-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making Generational GC Thread-Safe #10317
Comments
Cc: @carnaval |
Yep seems ok. I don't think there will be much trouble porting the write barrier, it should already be robust to duplication (except some heuristic counting code, not very important).
This would avoid inserting mostly-useless branches in every loop of every function. |
(By the way if you run into memory corruption and you suspect the write barrier, run the program under GC_VERIFY which should check that no trigger was missed) |
That approach seems pretty extreme. It would be nice not to depend on modifying code. |
It's not modifying code. It's in-place re-JITting :-). If a conservative stack scan is used, why is stopping at safepoints still required? Won't the conservative scan find all possible roots? The current threading branch, if I understand it correctly, uses memory allocation calls as safe points, which is simple, though that it could make threads stall for a long time and gives weak progress guarantees. The Go flagship compiler (gc) currently uses function entry as its safepoints, and they piggyback on a conditional branch that used to deal with stack overflow. Go's procedure prolog checks if the stack is big enough, and if not, calls a routine that copies the entire stack. It looks like an attractive scheme for running many tasks without worrying about stack overflow, something we will have to worry about if parallelizing tasking. To stop all threads at a safepoint, the runtime modifies the stack limit so all threads jump to the overflow handler on next call. Of course it doesn't help for long loops. |
Wouldn't manually gc-ing also be a first valid approach? I know its not that flexible but performance critical code should ideally not allocate any memory. When I worked in the initial threading code this worked quite well. Combined with manual gc-ing after a threading barrier even allocations should be possible. |
@ArchRobison You're right, with a conservative stack scan we don't even need safepoints. Another huge argument in favor of the conservative scanner IMO. |
@carnaval: Yes you are right. I just would be extremely happy if we could get the threading branch opt-in for 0.4 and could live with a solution that is "experts only". |
@carnaval Why does a thread have multiple heaps? E.g., why isn't |
Yes, what I call a heap in the code (well, a region, I've not been very consistent there) is a fixed mmap'd zone. The pool allocating code uses this to ask (and release) aligned 16k chunks of memory via malloc_page() and free_page(). Those should be fairly low-contention so locking around them should be ok. |
Not sure if one can call it thread local what I did with the memory pool. I generated https://github.com/tknopp/julia/blob/jlthreading/src/gc.c#L974 In my tests this pool was a huge improvement. But has this been entirely removed from the threading branch? I thought the the experts replaced this with a proper TLS solution. |
Thanks for the comments and history. @carnaval: We're still working out what sort of scheduler the threading branch will use. Both work-stealing and more loop-centric approaches have merit. I'm still pondering hybrids that the parallel run-times (OpenMP, TBB, Cilk Plus) teams have kicked around. @tknopp The threading branch was using one pool per thread as far as I could tell. The "TLS" was rolled by hand via an array indexed by thread index. |
@ArchRobison: Re thread scheduler: I've begun to use Qthreads https://github.com/Qthreads/qthreads in another project recently. The main difference between Qthreads and, well, almost everything else (OpenMP, TBB, Cilk Plus) is that Qthreads doesn't introduce dependencies between threads. In the other threading architectures, threads can only wait for their children, or a core running a particular thread can only switch to running a child, or threads cannot easily switch between cores. This usually has to do with stack layout. Qthreads introduces a separate stack for each thread, leading to its flexibility. Qtherads also introduces the notion of "shepherd", probably corresponding to NUMA domains. Threads can move between shepherds, but that happens more rarely than moving between cores that are managed by the same shepherd. Anyway, since Qthreads seems to be more powerful than most other systems I thought I'd mention it here. Qthreads is used e.g. for Chapel. |
See the thread safety tracker in #10421 |
It's good to bring up QThreads. The one stack per OS thread model has surely done a lot of damage to enabling widespread use of parallelism. Indeed Cilk was an early system to break the one linear stack per OS thread association. It used heap-allocated procedure frames to simulate a cactus stack. Regrettably, that didn't play well in the marketplace since it required a different calling convention, and so the Intel version of Cilk worked out a trick to map a cactus stack onto multiple linear stacks, The dependencies between parent and child tasks are a two-edged sword. For certain common patterns, they enable very efficient implementation with provable space guarantees, specifically O(S*P) space on P threads if the program takes O(S) space on one thread.. The bound sounds trivial, but it's easy to show space blowup in less constrained schedulers. On the other hand, the dependencies are annoying in other contexts. I think the overall lesson is to give careful consideration to the options while the calling convention and threading model have some flexibility. |
I don't think so. The naming may be a bit misleading but those are not the lower/upper bound of the same quantity. We store a lower bound of the first free page and an upper bound of the last non-free page. |
Thanks for the clarifications. |
As of 2fc8d97 on the jn/threading branch, this is now basically accomplished. |
can this be closed? |
We should have at least one travis target that builds with multi-threading enabled and runs known passing tests. I don't think this should hold up closing of this particular issue, but it would be nice to have it enabled. |
same as with #9336, it's rather difficult to build a newer llvm on travis right now. have to decide exactly which branch or release will be used, then package it. |
I guess we can do this once we move to our own branch of llvm 3.7, and use the same branch for threading, gallium, and everything else. Perhaps this is not the right place to discuss it. |
We've got a Travis Linux with threading on passed =) https://travis-ci.org/JuliaLang/julia/builds/96090255 |
👏 |
that's what you said above should be the close criteria for this issue (we have newer ones open tracking this better anyways) |
Err, the branch tested is based on #14190 |
Bravo, @yuyichao. Still, it seems premature to close since this hasn't been merged. |
This issue is for discussing how to make the new GC (#8699) thread-safe. A GC expert should check the arguments here. For background, see the elegant state transition diagram in
src/gc.c
.I'm assuming that garbage collection operates in stop-the-world mode, and for now, that the mark/sweep phase is single-threaded. Each thread can have its own thread-local
remset
, with occasional duplication across the sets because of the race described below.Key question: what state transitions can race against each other? If I understand the GC correctly, the only racy transitions arise from a write barrier for the same parent, and that transition is
GC_MARKED --> GC_QUEUED
, which changes the low two bits of the memory location from 01 to 10. If this is indeed the only racy transition, then a plain load, modify, plain store sequence will suffice, since all racing participants will be attempting to write the same value to the location. To be portable C11, we should use relaxed atomic loads/stores instead of the current assignments into bitfields, but that's a purist nitpick.The possibility of a race will require lightening up on assertions that check for duplicate enqueuing of roots. There will need to be a
memory_order_acq_rel
fence between normal operation and garbage collection, but that's probably already implied by synchronization between normal operation and garbage collection.Possible problem:
gc_wb_buf
appears to engaging in other transitions. If so, which of these might race against each other? Are these transitions always on differentgc_bits
than the ones affected bygc_wb
?The text was updated successfully, but these errors were encountered: