Garbage collection thread safety issues on 1.11 #56871

MilesCranmer · 2024-12-19T21:16:19Z

I think the garbage collection might have some thread safety issues on 1.11?

After running into various GC issues with SymbolicRegression.jl and PySR (reported in other issues #56735 #56759), I've been trying to break the GC with more minimal examples in the hopes of generating fixes.
Here is one such example I found, that results in the GC completely freezing.

The idea is to spawn multiple tasks that allocate and occasionally trigger GC.
This pushes the GC into running sweeps concurrently and potentially reveals data races in the parts of the code that modify the allocation map or page metadata.

@sync for t in 1:100
    Threads.@spawn begin
        # Each thread/task does a bunch of allocations
        for i in 1:10000
            # Allocate arrays
            A = Vector{Any}(undef, 1000)
            # Occasionally force a GC collection
            if (i % 1000) == 0
                GC.gc()
            end
        end
    end
end

If you run this loop ~2-3 times in a REPL, you should hit a freeze.

I wonder if this is from alloc map and page metadata being accessed by multiple GC threads (??). Or maybe it's just from the malloc memory leak identified in the other thread.
I didn't see obvious locking here, so I think there might be thread races? If multiple collector threads modify metadata concurrently (like changing page states), this may cause corruption.

Now, the good news is that this seems to be fixed on nightly. I'm not sure what the issue is from. Maybe someone can point me to an issue I missed that is now fixed. (Though it doesn't seem to be fixed yet on the release-1.11 branch)

In any case, it might be useful to have some of these tests in the CI, so such issues will not show up again?

The text was updated successfully, but these errors were encountered:

MilesCranmer · 2024-12-19T21:27:02Z

If I run the same code on 1.10, I don’t see any hanging, so I think this is a regression. (When I say “hang”, I mean it completely freezes up and I can’t even <ctrl-c> out of it)

MilesCranmer · 2024-12-19T21:49:48Z

Or maybe it's just from the malloc memory leak identified in the other thread.

Just tried backporting the simpler fix for #56801 but doesn’t seem to fix this issue on 1.11, so this error seems independent.

d-netto · 2025-01-03T15:46:37Z

Did some preliminary investigation.

Looks like it's hanging in the "wait-for-the-world" phase.

Patch:

diff --git a/src/gc.c b/src/gc.c
index fd9ad71d8a..5e2f34c81c 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -3880,7 +3880,9 @@ JL_DLLEXPORT void jl_gc_collect(jl_gc_collection_t collection)
     jl_fence();
     gc_n_threads = jl_atomic_load_acquire(&jl_n_threads);
     gc_all_tls_states = jl_atomic_load_relaxed(&jl_all_tls_states);
+    jl_safe_printf("Before STW\n");
     jl_gc_wait_for_the_world(gc_all_tls_states, gc_n_threads);
+    jl_safe_printf("After STW\n");
     JL_PROBE_GC_STOP_THE_WORLD();
 
     uint64_t t1 = jl_hrtime();

Results:

...
Before STW
After STW
Before STW
After STW
Before STW
After STW
Before STW
After STW
Before STW
After STW
Before STW
After STW
Before STW
After STW
Before STW
<Stuck>

CC: @vtjnash, @gbaraldi.

vchuravy · 2025-01-06T12:59:59Z

If you run this loop ~2-3 times in a REPL, you should hit a freeze.

What OS/Chip are you running on? I am having a hard time reproducing on Linux+Amd x86

d-netto · 2025-01-06T13:08:26Z

I could reproduce reliably on a stock M2, at least (macOS+aarch64).

vtjnash · 2025-01-06T13:12:21Z

The bug was in darwin-specific code paths (#53868)

gbaraldi · 2025-01-06T16:45:38Z

~~I haven't checked for sure yet.~~
Seems to be it

MilesCranmer · 2025-01-06T19:54:24Z

P.S., could we add this as a test? I can do a PR if so

giordano added GC Garbage collector regression 1.11 Regression in the 1.11 release labels Dec 19, 2024

ViralBShah added the multithreading Base.Threads and related functionality label Dec 23, 2024

This was referenced Dec 31, 2024

Task switch error on Enzyme v0.13 EnzymeAD/Enzyme.jl#2081

Open

fix: patch changed behavior of setproperty! for modules JuliaPy/PythonCall.jl#583

Merged

d-netto self-assigned this Jan 3, 2025

gbaraldi mentioned this issue Jan 3, 2025

Utilize bitshifts correctly in signals-mach.c when storing/reading the previous GC state #53868

Merged

vtjnash closed this as completed Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage collection thread safety issues on 1.11 #56871

Garbage collection thread safety issues on 1.11 #56871

MilesCranmer commented Dec 19, 2024 •

edited

Loading

MilesCranmer commented Dec 19, 2024 •

edited

Loading

MilesCranmer commented Dec 19, 2024

d-netto commented Jan 3, 2025

vchuravy commented Jan 6, 2025

d-netto commented Jan 6, 2025

vtjnash commented Jan 6, 2025

gbaraldi commented Jan 6, 2025 •

edited

Loading

MilesCranmer commented Jan 6, 2025

Garbage collection thread safety issues on 1.11 #56871

Garbage collection thread safety issues on 1.11 #56871

Comments

MilesCranmer commented Dec 19, 2024 • edited Loading

MilesCranmer commented Dec 19, 2024 • edited Loading

MilesCranmer commented Dec 19, 2024

d-netto commented Jan 3, 2025

vchuravy commented Jan 6, 2025

d-netto commented Jan 6, 2025

vtjnash commented Jan 6, 2025

gbaraldi commented Jan 6, 2025 • edited Loading

MilesCranmer commented Jan 6, 2025

MilesCranmer commented Dec 19, 2024 •

edited

Loading

MilesCranmer commented Dec 19, 2024 •

edited

Loading

gbaraldi commented Jan 6, 2025 •

edited

Loading