-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved epoch-based GC #3
Conversation
This RFC is still a work in progress. The "Motivation" sections offers a comprehensive overview of different aspects of epoch-based garbage collection, and what problems come up with it. I wrote the text mostly by recollecting the previous discussions on GitHub, Rust forums, and IRC. I'd like to get some eyeballs on all this and hear what you think. The text is already long and dense, but there might still be important topics I missed, so let me know! The "Detailed design" section is empty, but I will write it up after we get a clearer picture on what kind of design we want. Here are my thougts on all this:
@joshtriplett If you have time and interest in participating in this RFC, your expertise might be of great help! |
Some initial thoughts: +1 for "style 4" tracking, it's understandable and there's an working implementation. Any possible extreme scaling issues could be tackled later. +1 for thread garbage buffers for the reasons already stated in the rfc. -1 for delayed garbage collection, we really shouldn't try to tackle problems that we don't have (yet). I also fell that people requiring that level of precision need much more specific things for us to discuss at this point (knobs being a big one). Regarding drops, I fell that trying to protect against bad interactions is a lost battle. As long as they're ran there's no way to make it completely safe. That being said, we should provide just the bare minimum safety net (Send + 'static) and the collection/user using crossbeam know better and should figure/document the rest.
|
Some more thoughts: +1 on style 4 tracking and thread-local garbage buffers for reasons stated in the rfc. I give a +1 on basic delayed GC since it's pretty simple to implement and is quite useful. You can just pin without doing any local/global GC (or at least no allocations, deallocations, or extra atomics). +1 on incremental GC as well. I have an implementation here that can easily be updated for any sane thread-local buffer scheme. At first we should probably just consider 'time' to be some limit on allocations and later think about advanced knobs and tuning. There needs to be a mechanism so enforce that threads can't indefinitely pile up garbage. Addressing how to prevent garbage stuck locally could be done at first with a global queue structure like the one used in Coco. It's probably worth adding some checks so that threads don't pointlessly push garbage into the global queue. On thread registration, I think there might a way to do this with a vector of pointers instead of a linked list. That would be great for reducing latency of the epoch check although would complicate thread registration. |
A little while ago I was talking with @stjepang over IRC and I suggested that each thread tracked the amount garbage they generated. By doing that each thread can estimate a minimum amount they need to collect and how urgent it is. As a side effect it prevents unbounded garbage piling and helps work distribution. The tracking can be done in several ways: it can be as simple as the number of deallocations or maybe a bit better by considering the size of the deallocation; maybe by number of bags; maybe only considering stuff pushed to the global pool to amortize the cost, etc.. If thread A produced X units of garbage it it tries to collect something like min(Y * X, MIN) units, starting with the local and then the global garbage. Auto collection can be triggered every P pins or after X exceeds a limit. Edits: fixed broken english |
Cool, glad you like this idea! You expressed interest (over IRC) in using Crossbeam for low/predictable latency stuff. Would something like
I like your implementation - that's a good way of measuring garbage and time spent on collection. But I think we should have even more precision when dealing with garbage. Not all garbage is created equal, as explained in the "Garbage objects" section. For example, if we're destroying a queue node or skiplist node, that's worth one unit. However, if we're destroying a B-tree node containing 100 keys, it's worth 100 units. Likewise, destroying a large array of length 10K backing a Chase-Lev deque would be worth 10K units. These units are not perfect. For example, destroying a skiplist node containing an The function that adds garbage might have the following signature: // Array of `T`s stored at address `object`, containing `length` elements.
// This garbage is worth `units`.
unsafe fn defer_free<T>(&self, object: *const T, length: usize, units: usize, scope: &Scope); This piece of garbage is pushed into the current thread-local bag.
That may work, but it's tricky to do. The problem is that we must load the current vector, clone it, push a new entry, and then use CAS to install the new vector. But someone may destroy the current vector while we're cloning it! With linked lists this is easier. We only need to get the address of the current head and use CAS to install a new node as the head. We don't read the contents of the nodes, so other threads can't wreak havoc by destroying them.
Fair enough. We certainly don't need fine control over GC right now, but I'd still like to at least think about these problems in order to see what kind of flexibility we need to build into the core GC design.
What do you mean by "there's no way to make it completely safe"? If we restrict deferred drops to types
Yep, that sounds great! If we are going to use units as described above (1 unit = 1 "element") for deciding when to seal bags and move them into the global queue, then we could simply count sealings to decide how much garbage to destroy. For example, if the thread has sealed 5 bags, then we can destroy Y * 5 bags of garbage in order to make up for it.
Fully agree - we want both triggers. |
I like and agree with your ideas, but for a first cut it's too much on the complicated side. |
Thank you for the efforts for this PR. It's huge :) Here are my thoughts:
|
I thought about that before but my impression was that it would be a performance hit in most cases. It's more expensive to fetch the corresponding struct (there are some tricks we can use to avoid calling sched_getcpu often) and it may require extra synchronization (the preemption unit is still the thread). Once you take into account that crossbeam pin() is measured in low ns it starts to look really hard. |
Style 1 is current Crossbeam (three distinct vectors of garbage), and style 4 is Coco (just one long queue of garbage).
The CAS trick definitely helps a lot, but is not the only contributor to better performance. Without it pinning takes 20 ns, which is still quite a bit faster than pinning in Crossbeam.
Valid point. That CAS is definitely a bit sketchy. :) We'd probably be safe if we used inline assembly to directly call
Hmm, I think we have two different kinds of actors:
Users of Crossbeam would simply call
Yes, absolutely. We need some measure of priority/urgency/size. I'll try to write a concrete proposal on that.
The problem mostly stems from the fact that a pinned thread might get preempted and prevent epoch advancement until it gets resumed. This may block garbage collection for a long time. Having per-processor garbage wouln't help much. What would help, though, is if we could tell the OS: "If possible, please don't preempt the thread while executing this critical section."
Certainly. I think my main gripe with signals is that they reserve a signal (e.g.
Unless I misunderstood something, that vector would simply be a vector of pointers to thread entries, which is no better than a linked list of thread entries. Perhaps you had in mind a long vector with adjacent entries in memory? Registration would then involve iterating over the vector and reserving a vacant entry. If so, yeah, that would probably make checking epochs faster. This week I'll make another iteration over the whole text, try to clean it up, and make a more concrete proposal. |
I've got a high-level comment on this - this seems like another mega-rfc. Many of the items seem orthogonal to each other (pinning method, priority tracking, object holding methods) @stjepang a vector of pointers is significantly better than a linked list in terms of iteration time! With a linked list, each iteration of the loop carries a data dependency on the last while in a vector there is no data dependency. This means that for any out-of-order processor one can schedule multiple iterations of the loop in parallel and as a result have multiple cache misses in flight. Since this list isn't being read frequently and almost every is being written to by a different thread than the reader, the case where each element is in the shared cache or main memory is the normal use case. I have another gripe with signals - they can come at inconvenient times! With crossbeam as-is, you only incur epoch+GC costs when you enter an epoch. A signal might hit at any time including latency-sensitive periods. I'm not sure if forcing priority in default unlinked is a good idea. Users probably don't want to try and decide how to prioritize garbage amongst stuff from other unknown data structures. We could add a different unlinked_priority function, but I expect that in normal use that would almost always prioritize by size. @stjepang @jeehoonkang is right about the atomics - relaxed atomics compile to standard loads/stores and the performance gain is from merging the addition and the fence on x86. There may be a minor performance gain from merging the load of the in_critical counter and the thread-local epoch counter but I don't think that's enough to show in benchmarks @stjepang I think I misunderstood the difference between GC #1 and GC #4. They appear extremely similar with some small implementation details in how garbage is internally tracked and pinning announced to the global GC. They both collect garbage locally, announce the current epoch by recording the current global one, and pin by fencing. We should delegate small implementation details there to another rfc because I don't actually think it's clear which is better. |
First, apologies for the late reply. I'll try to be more responsive in the future. @schets is right - this is too much material to digest in a single RFC. I've simplified the text and would like us to start with a reasonably good and simple design. Something that just suports simple drops, incremental collection, limited thread-local garbage, has faster pinning, and can urgently reclaim large garbage. What is proposed here is the first step towards better garbage collection. I'm open to future alterations of the design, API, and implementation details. @schets Regarding the GC variant we should use - I see the GC variant #4 as basically a generalization of variant #1 - instead of having three vectors where two of them are blocked we have an infinite number of vectors. I'm leaning towards #4, but you're right that it's not a simple decision - let's just try both and see what works best. shrug. :) Implementation details are up in the air. Regarding deferred drops - let's keep it simple for now. There will be a time in the future to explore this direction in more detail. For urgent reclamation (e.g. deque must be able to quickly reclaim old buffers) let's just use flush-based approach (just like in Coco) for now. Ideally I'd prefer to have something like @schets' incremental branch in Crossbeam. |
Very nice! Great, all-around understandable RFC with a clear direction forward. I agree with the layout for the new design. I'm not quite sure I got the section on "Destructors and local GCs" though.
This is mentioned, but the following section only outlines solutions to the problem of elements containing non-static references, rather than the problem of returning references from the structure:
To expand on returning references from a structure, since the reference is now in user code, it might live as long as the structure. For this reason, we cannot destroy any element before the structure destructor runs - this is very bad, since a lot of garbage will accumulate until then. Combined with the problem of non-static references in elements, we have to destroy everything that the structure ever touched in the structure's destructor. For this reason, I think we really shouldn't allow this. A crossbeam issue outlines two solutions - either return some kind of If we restrict ourselves to just allowing some non-static references inside elements, then we have to make sure everything is destroyed before the structure, but there isn't the problem of not being able to destroy anything until the structure itself is destroyed. Another thing that we discussed on IRC sometime ago is that we should make it clear that even with non-static references in deferred drop, not all kinds of them are allowed - specifically, only references that outlive the structure (or references to the structure itself) are valid. If for example we were to reference a method's local stack, that would crash horribly after the method finishes and we are left with a structure that still references some local variable from that method. This is important, so I feel like it should be mentioned. |
@Vtec234 Good suggestion - I've added a paragraph or two that expand on the topic. |
Has anyone else had the time to review the recent changes to the text? I'd really like us to push this forward, and soon start implementing the basic features that are missing in the current GC. Then we can improve it over time and try all the competing implementation details (like how to track epochs, how to store garbage in memory, how to represent a list of registered threads, etc.). |
Great RFC. My only comment is, what is the semantics for the case that a |
If I got that right, your question is basically "where are the boundaries of unsafe code"? It's really difficult to say - I believe we still don't have a fully conclusive answer to that question. We've always had this rule of thumb: "if you have an unsafe block anywhere in your code, then the whole surrounding module is already polluted". Gankro talks about that in his talk (around 10:00). Niko also published a great deal of blog posts on this topic (e.g. this one). |
@stjepang FYI, on the semantics of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the new text overall. I see no reason to block it at this point.
This looks great, and we can (and should) hash out more minor differences in more specialized rfcs and PRs |
This is a proposal for improvements on the epoch-based garbage collector.
Rendered