Add a function `jl_gc_notify_thread_yield` before a thread yields for GC #57297

qinsoon · 2025-02-07T04:36:40Z

This PR adds a function jl_gc_notify_thread_yield to the GC interface. This will be used by MMTk to save the thread context (registers) for conservative scanning.

Keno · 2025-02-07T04:56:13Z

src/gc-interface.h

@@ -108,6 +108,9 @@ JL_DLLEXPORT const char* jl_gc_active_impl(void);
 // each GC should implement it but it will most likely not be used by other code in the runtime.
 // It still needs to be annotated with JL_DLLEXPORT since it is called from Rust by MMTk.
 JL_DLLEXPORT void jl_gc_sweep_stack_pools_and_mtarraylist_buffers(jl_ptls_t ptls) JL_NOTSAFEPOINT;
+// Notifies the GC that the given thread is about to yield for a GC. ctx is the ucontext for the thread
+// if it is already fetched by the caller, otherwise it is NULL.
+JL_DLLEXPORT void jl_gc_notify_thread_yield(jl_ptls_t ptls, void* ctx);


I don't like the term yield here, because it conflicts with the usage of the same term in the task subsystem and has a different meaning.

Do you have any suggestion that won't cause confusion in the context?

jl_gc_thread_will_join_collection or sth?

Keno · 2025-02-07T04:57:40Z

This will be used by MMTk to save the thread context (registers) for conservative scanning.

I don't know if I have mentioned this elsewhere, but I would recommend against this as a long-term strategy. We make no semantic guarantees that conservative scanning will be able to find all pointers.

qinsoon · 2025-02-07T05:03:09Z

This will be used by MMTk to save the thread context (registers) for conservative scanning.

I don't know if I have mentioned this elsewhere, but I would recommend against this as a long-term strategy. We make no semantic guarantees that conservative scanning will be able to find all pointers.

What is the concern here?

We had two concerns before we chose to use conservative scanning:

LLVM may partition pointers (you do not see the pointer on the stack or registers at all). But it seems that optimization is rare, and only used for a small number of platforms (and those are not common platforms). So we decided to ignore this concern.
There may be internal/derived pointers on the stack. We handled the case. We can find the base pointer from an internal pointer.

Keno · 2025-02-07T05:07:00Z

LLVM may partition pointers (you do not see the pointer on the stack or registers at all). But it seems that optimization is rare, and only used for a small number of platforms (and those are not common platforms). So we decided to ignore this concern.

Yes, this one. I agree that it is rare currently, I'm just warning that LLVM does not guarantee this, and you only need one to start crashing. I old the GAP.jl people the same thing when they started stack scanning julia.

vtjnash

This is not sufficient in the presence of threads. I started a corrected implementation that takes into account threads at #56477, but the cost seemed too absurdly high, especially given that we already know LLVM will miscompile code in the presence of a conservative scanner, which will necessarily lead to unsound GC results. (without an explicit leak of the object to the shadow stack, it might not bother to assign NULLs to the slots or lengths to the arrays or types to the objects)

qinsoon · 2025-02-07T05:25:21Z

LLVM may partition pointers (you do not see the pointer on the stack or registers at all). But it seems that optimization is rare, and only used for a small number of platforms (and those are not common platforms). So we decided to ignore this concern.

Yes, this one. I agree that it is rare currently, I'm just warning that LLVM does not guarantee this, and you only need one to start crashing. I old the GAP.jl people the same thing when they started stack scanning julia.

I agree.

When we started prototyping for a moving GC, x86_64 is the only target we cared about. We found these two discussions and knew from them that this optimization was not enabled on x86_64 and it wouldn't be a show stopper for us to use conservative stack scanning:

Now we should assess how many targets Julia supports may be affected by the optimization, and whether the LLVM maintainers would encourage this optimization for more platforms. I don't have an answer to this. Insights are welcome.

At this point, conservative stack scanning is a key part of our prototype to support moving GCs in Julia. Conservative stack scanning helps us identify references on the stack, and we only need to worry about references in the native heap, code, or globals. If we cannot use conservative stack scanning, we would have to identify references on the stack through other approaches.

vtjnash · 2025-02-07T05:26:08Z

The GAP people generally get away with this since they implement their object creation using C functions, so Julia cannot optimize them. That isn't true for any Julia objects however.

Keno · 2025-02-07T05:26:27Z

This is not sufficient in the presence of threads

Are you thinking about threads running in gc-unsafe mode in particular? Or is there a problem with the gf-safe threads also that I'm not seeing?

vtjnash · 2025-02-07T05:27:56Z

Yes, specifically with gc-safe. Those are regions of code that cannot be stopped (would deadlock if you tried) but must be scanned.

qinsoon · 2025-02-07T05:35:47Z

Yes, specifically with gc-safe. Those are regions of code that cannot be stopped (would deadlock if you tried) but must be scanned.

Can you explain a bit more on this? I am not familiar with gc-safe/unsafe threads in Julia. But it sounds like you are talking about some threads that are allowed to run when a GC is happening. Generally in those cases, those threads are allowed to run, as GC does not care about those threads. Why do those threads need to be scanned?

Keno · 2025-02-07T05:41:19Z

We permit threads to transition into regions where they run concurrently with the GC as long as they do not modify gc-tracked memory while they are in those regions. However, their stacks may still have references to gc-tracked objects that need to be scanned. That said, could you explain what you're trying to use conservative stack scanning for? You're not gonna be able to move the objects you find anyway, so I don't see how it helps you.

oscardssmith · 2025-02-07T05:50:27Z

The plan is to pin everything directly referenced by the stack (as well as anything that gets GC.@preserved or pointered, but to allow moving other objects.

Keno · 2025-02-07T05:56:16Z

Ok, but in that case why not extend the ordinary GC stack and just pin everything there?

qinsoon · 2025-02-07T05:59:47Z

Ok, but in that case why not extend the ordinary GC stack and just pin everything there?

The GC stack does not capture all the references. If an object is rooted (and pushed to the GC stack), any object reachable from the object does not need to be rooted and may not appear on the GC stack.

Keno · 2025-02-07T06:09:28Z

Ok, sure, but we could just extend the macros to also specify the objects to be pinned - we already have the static analysis to check liveness across a safepoint for you, so that seems like a relatively easy thing to do in comparison to doing a correct conservative stack scan.

qinsoon · 2025-02-07T06:22:50Z

We permit threads to transition into regions where they run concurrently with the GC as long as they do not modify gc-tracked memory while they are in those regions. However, their stacks may still have references to gc-tracked objects that need to be scanned.

This actually might be a problem.

We currently do conservative scanning for two places: 1. native stacks, and 2. registers. We go through native stacks by iterating all the _jl_tls_states_t structs which should include all the threads (running or stopped). We go through registers using the hooks added by this PR to capture registers in the threads that are stopped. So we are not scanning registers for running threads. It is possible that a running thread may still have references in the regsiters, even though they are not allowed to use heap references.

We would need a way to let those running threads to spill all the registers to the stack before they transition into the GC-safe regions. We may use the same way to spill registers for the stopped threads as well (and may not need this hook in that case).

Keno · 2025-02-07T06:24:29Z

We would need a way to let those running threads to spill all the registers to the stack before they transition into the GC-safe regions. We may use the same way to spill registers for the stopped threads as well (and may not need this hook in that case).

This is insufficient - the threads are still allowed to read GC memory, just not write to it.

qinsoon · 2025-02-07T06:33:59Z

Ok, sure, but we could just extend the macros to also specify the objects to be pinned - we already have the static analysis to check liveness across a safepoint for you, so that seems like a relatively easy thing to do in comparison to doing a correct conservative stack scan.

It is possible. But it would be a lot of work to mark all the objects to be pinned through the code base. I had attempted something here for only two source files. All those PIN macros are the objects that are not in the GC stack, but used in the code.

We could extend the JL_GC_PUSH macro to add a list of objects to be pinned, like JL_GC_PUSH(x, [x.a, x.b, y]), and do not bother about unpin. This would require us to extend the static analyzer (if possible) to get it work.

Keno · 2025-02-07T06:36:29Z

and do not bother about unpin.

Presumably they could be unpinned together with the GC_POP.

This would require us to extend the static analyzer (if possible) to get it work.

Seems easier than a correct stack scan in the presence of the complications discussed above.

qinsoon · 2025-02-07T07:11:43Z

This would require us to extend the static analyzer (if possible) to get it work.

Seems easier than a correct stack scan in the presence of the complications discussed above.

With my limited knowledge on the GCChecker, I doubt it is easier to get a correct static analyzer than getting correct conservative scanning. I had attempts with GCChecker, but I found there were many assumptions wired in for the Julia's rooting semantics. And maintaining a pinning state along with the rooting state was not as easy as I thought. Besides, conservative scanning is a one-time effort. Once it is working, it works no matter how the runtime code changes. And explicitly pinning objects is a continuing effort that every runtime developer needs to be aware of.

With all that said, I am still open to the idea of explict pinning with macros. Just in our attempts, it did not work out, and conservative scanning worked.

Keno · 2025-02-07T07:37:18Z

I'm fine to proceed with conservative stack scanning if you can work out the correctness issues identified. That said, I don't think that it's an acceptable long-term solution, so we'll have to do something else before we can consider enabling mmtk by default. At this point, I think it's a question of your prioritization.

vchuravy · 2025-02-07T08:59:09Z

Keno mentioned that we allow threads to run concurrently to GC, and that they are allowed to read Julia memory.
#49933 will make that a more common occurrence and make it happen from user-code as well.

We would need a way to let those running threads to spill all the registers to the stack before they transition into the GC-safe regions. We may use the same way to spill registers for the stopped threads as well (and may not need this hook in that case).

That would be incredibly expensive and would stop us from using GC_safe regions for latency critical operations.

Besides, conservative scanning is a one-time effort. Once it is working, it works no matter how the runtime code changes. And explicitly pinning objects is a continuing effort that every runtime developer needs to be aware of.

That's a false dichotomy. We already have a precise GC and runtime developers and Julia language developers have been able to use it correctly (while assisted by things like the static analyzer). Conservative scanning has issues, and we don't know that it is possible to be correct given codegen.

vtjnash · 2025-02-07T15:41:41Z

That would be incredibly expensive and would stop us from using GC_safe regions for latency critical operations.

On further reflection, we do have this ability already (to support profiling of async thread operation; natively on most kernels, emulated on Linux), so while that is a high-cost operation, only a conservative GC scan must pay it, not the whole runtime. We could revive my PR therefore using that GetThreadContext implementation we already have there

udesou · 2025-02-07T20:39:01Z

The other thing to add is that the collectors we support so far do judicious copying, meaning that they don't try to copy everything at every GC, but use heuristics to decide what/when to copy. From a correctness perspective, we could turn off copying for GC runs in which not all threads have stopped but some are operating in a gc-safe region. It might not be ideal if you never hit a point in which all threads stop for GC, but we could have that to guarantee correctness until we think about a better solution.

Add a hook before yielding for GC to save context

d219282

qinsoon added the GC: MMTK MMTK GC integration label Feb 7, 2025

Keno reviewed Feb 7, 2025

View reviewed changes

vtjnash reviewed Feb 7, 2025

View reviewed changes

qinsoon mentioned this pull request Feb 7, 2025

Support moving GCs in MMTk #57294

Draft

5 tasks

vchuravy mentioned this pull request Feb 7, 2025

Support copying garbage collectors #56819

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a function `jl_gc_notify_thread_yield` before a thread yields for GC #57297

Add a function `jl_gc_notify_thread_yield` before a thread yields for GC #57297

qinsoon commented Feb 7, 2025

Keno Feb 7, 2025

qinsoon Feb 7, 2025

Keno Feb 7, 2025

Keno commented Feb 7, 2025

qinsoon commented Feb 7, 2025

Keno commented Feb 7, 2025

vtjnash left a comment

qinsoon commented Feb 7, 2025 •

edited

Loading

vtjnash commented Feb 7, 2025

Keno commented Feb 7, 2025

vtjnash commented Feb 7, 2025

qinsoon commented Feb 7, 2025

Keno commented Feb 7, 2025

oscardssmith commented Feb 7, 2025

Keno commented Feb 7, 2025

qinsoon commented Feb 7, 2025 •

edited

Loading

Keno commented Feb 7, 2025

qinsoon commented Feb 7, 2025

Keno commented Feb 7, 2025

qinsoon commented Feb 7, 2025

Keno commented Feb 7, 2025

qinsoon commented Feb 7, 2025

Keno commented Feb 7, 2025

vchuravy commented Feb 7, 2025 •

edited

Loading

vtjnash commented Feb 7, 2025

udesou commented Feb 7, 2025

Add a function jl_gc_notify_thread_yield before a thread yields for GC #57297

Are you sure you want to change the base?

Add a function jl_gc_notify_thread_yield before a thread yields for GC #57297

Conversation

qinsoon commented Feb 7, 2025

Keno Feb 7, 2025

Choose a reason for hiding this comment

qinsoon Feb 7, 2025

Choose a reason for hiding this comment

Keno Feb 7, 2025

Choose a reason for hiding this comment

Keno commented Feb 7, 2025

qinsoon commented Feb 7, 2025

Keno commented Feb 7, 2025

vtjnash left a comment

Choose a reason for hiding this comment

qinsoon commented Feb 7, 2025 • edited Loading

vtjnash commented Feb 7, 2025

Keno commented Feb 7, 2025

vtjnash commented Feb 7, 2025

qinsoon commented Feb 7, 2025

Keno commented Feb 7, 2025

oscardssmith commented Feb 7, 2025

Keno commented Feb 7, 2025

qinsoon commented Feb 7, 2025 • edited Loading

Keno commented Feb 7, 2025

qinsoon commented Feb 7, 2025

Keno commented Feb 7, 2025

qinsoon commented Feb 7, 2025

Keno commented Feb 7, 2025

qinsoon commented Feb 7, 2025

Keno commented Feb 7, 2025

vchuravy commented Feb 7, 2025 • edited Loading

vtjnash commented Feb 7, 2025

udesou commented Feb 7, 2025

Add a function `jl_gc_notify_thread_yield` before a thread yields for GC #57297

Add a function `jl_gc_notify_thread_yield` before a thread yields for GC #57297

qinsoon commented Feb 7, 2025 •

edited

Loading

qinsoon commented Feb 7, 2025 •

edited

Loading

vchuravy commented Feb 7, 2025 •

edited

Loading