gc: improve mallocarrays locality #56801

vtjnash · 2024-12-11T14:19:16Z

small_arraylist_t has much better memory locality and space utilization than a linked list with individually malloc'd elements

src/gc-stock.c

gbaraldi · 2024-12-11T21:06:28Z

This seems to fix #56759 for me. So we probably want to backport this

d-netto · 2024-12-11T22:17:24Z

This seems to fix #56759 for me. So we probably want to backport this

IIRC we're still using this data structure in 1.10.

If this is indeed a source of a memory leak, then we should consider backporting it to 1.10 as well.

small_arraylist_t has much better memory locality and space utilization than a linked list with individually malloc'd elements. However, seemed that it needed alignment of small_arraylist_t to both hold all elements the user might create and avoid gcc bugs.

MilesCranmer · 2024-12-13T01:36:37Z

Can confirm this fixes #56759. Nice work!

vchuravy · 2024-12-13T08:49:28Z

Do we know why it fixes the memory leak? Were we simply not freeing the elements of the linked list?

oscardssmith · 2024-12-13T13:10:46Z

yes

vtjnash · 2024-12-14T00:36:40Z

No idea

MilesCranmer · 2024-12-14T05:56:49Z

Could this also be backported to 1.10?

alior101 · 2024-12-16T11:33:36Z

Thanks for this commit! I'v been banging my head for the last few days for an ever increasing MaxRSS in something that should be fixed.
Can you please share more details on why it solves the memory leak - I looked at the commit changes but it's not very clear..
Also, when is that backported to 1.11.3 ?
Thanks!

MilesCranmer · 2024-12-19T03:53:46Z

Ok I read through things a bit more carefully. My understanding is that the old code (and current 1.10 and 1.11) managed allocated blocks using this type mallocmemory_t which has a manually maintained linked-list. Nodes in this list are not properly being freed – only reused – resulting in a leak. (The buffers tracked by the nodes are freed, but not the nodes themselves.) And the reason it was only noticed on 1.11 is because this type started being used for more things (such as all arrays). My library SymbolicRegression.jl is doing tons of small array allocations which I suppose helped surface this issue.

I think the switch to use small_arraylist_t (and deletion of mallocmemory_t) by @vtjnash is because the small_arraylist_t is more widely used and thus has a more robust memory management in place due to greater usage. Which removes the issue entirely.

I think the actual underlying issue is here, in sweep_malloced_memory:

*pma = nxt;
int isaligned = (uintptr_t)ma->a & 1;
jl_gc_free_memory(a, isaligned);
ma->next = ptls2->gc_tls_common.heap.mafreelist;
ptls2->gc_tls_common.heap.mafreelist = ma;

So, the jl_gc_free_memory frees the buffer within the mallocmemory_t (the ->a, but not the mallocmemory_t node itself. So I think the issue is because the actual mallocmemory_t nodes just accumulate without being tracked?

Switching to small_arraylist_t for tracking allocations, and deleting mallocmemory_t, (rarely used before 1.11; hence why nobody noticed the bug I guess?) is another way to get around this issue. I guess it might be better overall (even if this is the underlying issue) because it means the GC is simpler and uses fewer data structures so is less likely to have this class of bug in the future.

I think an alternate fix would be to just do:

  *pma = nxt;
  int isaligned = (uintptr_t)ma->a & 1;
  jl_gc_free_memory(a, isaligned);
- ma->next = ptls2->gc_tls_common.heap.mafreelist;
- ptls2->gc_tls_common.heap.mafreelist = ma;
+ free(ma);

without further changes or needing to replace this mallocmemory_t type. But as I said, I think it's better to have fewer data structures and aim towards simplicity like this PR does so it's probably better this way!

(Hoping someone corrects me if I am wrong)

MilesCranmer · 2024-12-19T04:19:55Z

Cool. Just confirmed that just adding that single change to sweep_malloced_memory is enough to fix the leak. That might be an easier and more minimal backport for 1.10.8/1.11.3? @KristofferC

With @gbaraldi's example, on 1.11.2:

julia> using StatsBase

julia> function Simulate()
           Simulations=Int(1e7)
           Size=1000
           result = Array{Float64}(undef, Simulations, 1)
           Threads.@threads for i = 1:Simulations
                x = randn(Size)
                s = sort(x)
               result[i, 1] = s[1]
           end
           println(median(result))
       end
Simulate (generic function with 1 method)

julia> for i in 1:1000
           println(i)
           Simulate()
           GC.gc(true)
           # Print live_bytes
           println("live_bytes in MB: ", Base.gc_live_bytes() / 1024^2)
           sleep(10) # sleep for 10 seconds
       end
1
-3.1975019113594954
live_bytes in MB: 2182.0746631622314
2
-3.197639840642432
live_bytes in MB: 4273.654802322388
3
-3.1976844965361293
live_bytes in MB: 6365.400560379028
4
-3.1975014405024504
live_bytes in MB: 8457.508516311646
5
-3.1978872129155835
live_bytes in MB: 10549.042078018188

But if I just add that single free(ma), I get:

1
-3.1975903921733435
live_bytes in MB: 92.49524688720703
2
-3.1973693558307046
live_bytes in MB: 92.49922943115234
3
-3.1975461857988234
live_bytes in MB: 92.49922943115234
4
-3.1974292157057222
live_bytes in MB: 92.49922943115234
5
-3.197542860188766
live_bytes in MB: 92.49922943115234

alior101 · 2024-12-19T07:20:06Z

Thanks for the explanation!

KristofferC · 2025-01-02T12:35:59Z

On 1.10 I get:

1
-3.1976480507706193
live_bytes in MB: 129.57639598846436
2
-3.197508883859199
live_bytes in MB: 129.50681591033936
3

so I think it is fine there?

MilesCranmer · 2025-01-02T12:53:41Z

mallocmemory_t isn't used for arrays on 1.10 which is what that test script does allocations for, but I think the leak should still be there

KristofferC · 2025-01-02T14:05:00Z

Trying to run the example on #56741 with 2adfacc on it I get

1
-3.197463115148186
live_bytes in MB: 438.03278064727783
2
-3.1975149403064047
live_bytes in MB: 850.664532661438

so doesn't seem fixed with that commit?

MilesCranmer · 2025-01-02T15:19:15Z

Weird, it worked for me, unless I ran the wrong binary. Maybe run it a bit longer to see if it saturates? I think I had set --heap-size-hint pretty low to force it to garbage collect earlier.

MilesCranmer · 2025-01-02T15:36:02Z

Very weird. When I compile that specific commit I also see the behavior indicating the leak wasn't fixed. Maybe I just ran the wrong binary or something equally dumb.

I'm going to see if I just apply that patch to v1.11.2 it fixes it, because I think that's what I originally tried. Anyways it looks like people already backported to
8001392 so maybe just go for that.

On 1.10 it doesn't look like this data structure was actually used @d-netto (?) so perhaps we can leave it off the backport list.

MilesCranmer · 2025-01-02T15:52:37Z

Yep the patch doesn't fix v1.11.2 either 😵‍💫. Cosmic ray must have hit the PC. Sorry about that. Mystery is back on. Going to go grab some booze...

vtjnash · 2025-01-02T18:32:00Z

jl_gc_track_malloced_array was present in v1.10, and just renamed in v1.11

MilesCranmer · 2025-01-02T20:11:12Z

Thanks. I guess it would be helpful to understand the root cause for fixing a potential leak on 1.10, unless it’s ok to apply a similar replacement patch to LTS as well? (I wasn’t sure if the structure change could introduce other issues we might not have thought about yet, which might not be a small enough for an LTS version?)

My understanding is that the lines above are a recycling mechanism in the linked list of mallocmemory_t. And these are getting either accumulated faster than they can be recycled, or maybe failing to free their referenced allocation somewhere?

Maybe the first step is creating an equivalent to Simulate that actually hits this type of allocation in 1.10? Maybe there isn’t even a leak there

alior101 · 2025-01-03T15:12:42Z

I'm on nightly and can see a fixed GC Live bytes but increasing RSS and MaxRSS untill OOM few hours later (app is allocating ~50 MB every few seconds inside a function) ... Running with size-hint of 600M for a container with mem limit of 2GB but that doesnt seem to help...

d-netto · 2025-01-03T16:02:12Z

I'm on nightly and can see a fixed GC Live bytes but increasing RSS and MaxRSS untill OOM few hours later (app is allocating ~50 MB every few seconds inside a function) ... Running with size-hint of 600M for a container with mem limit of 2GB but that doesnt seem to help...

Do you have an open-source reproducer? If so, could you please open an issue?

alior101 · 2025-01-03T16:16:34Z

unfortunatly not. But I'll try to come up with a minimal worklng example and post it here. BTW, reverting to 1.10.7 fixed the issue completly.

alior101 · 2025-01-05T09:53:00Z

Ok, the OOM in the container was due to a lazy linux kernel clearing physical memory from cached log files. When I stopped writing logs the RSS stopped growing. It seems this growing RSS is not directly related to a leak in GC but to some caching and OOM killer race

vtjnash added the GC Garbage collector label Dec 11, 2024

vtjnash requested a review from gbaraldi December 11, 2024 14:19

gbaraldi reviewed Dec 11, 2024

View reviewed changes

src/gc-stock.c Outdated Show resolved Hide resolved

vtjnash force-pushed the jn/mallocarrays-list branch from 8afa0e6 to 725adba Compare December 11, 2024 20:17

gbaraldi approved these changes Dec 11, 2024

View reviewed changes

gbaraldi added the backport 1.11 Change should be backported to release-1.11 label Dec 11, 2024

vtjnash force-pushed the jn/mallocarrays-list branch 3 times, most recently from ed028da to b02629a Compare December 12, 2024 21:30

vtjnash force-pushed the jn/mallocarrays-list branch from b02629a to 6e2c0cb Compare December 13, 2024 00:52

MilesCranmer mentioned this pull request Dec 13, 2024

Memory leak with Julia 1.11's GC (discovered in SymbolicRegression.jl) #56759

Closed

vtjnash merged commit 03a0247 into master Dec 13, 2024
5 of 7 checks passed

vtjnash deleted the jn/mallocarrays-list branch December 13, 2024 02:54

KristofferC mentioned this pull request Dec 13, 2024

Backports for 1.11.3 #56741

Merged

51 tasks

MilesCranmer mentioned this pull request Dec 13, 2024

[BUG]: Memory issue in version 1.0.0? MilesCranmer/PySR#764

Closed

MilesCranmer mentioned this pull request Dec 14, 2024

[BUG]: Memory leak(?) when using batching with large dataset (>450K items) MilesCranmer/PySR#706

Closed

oscardssmith added the backport 1.10 Change should be backported to the 1.10 release label Dec 14, 2024

MilesCranmer mentioned this pull request Dec 18, 2024

Update README.md MilesCranmer/PySR#785

Open

MilesCranmer mentioned this pull request Dec 19, 2024

Garbage collection thread safety issues on 1.11 #56871

Closed

MilesCranmer mentioned this pull request Dec 31, 2024

Task switch error on Enzyme v0.13 EnzymeAD/Enzyme.jl#2081

Open

KristofferC mentioned this pull request Jan 2, 2025

Backports for Julia 1.10.8 #56653

Merged

36 tasks

KristofferC removed the backport 1.10 Change should be backported to the 1.10 release label Jan 2, 2025

stevengj pushed a commit that referenced this pull request Jan 2, 2025

gc: improve mallocarrays locality (#56801)

8001392

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gc: improve mallocarrays locality #56801

gc: improve mallocarrays locality #56801

vtjnash commented Dec 11, 2024

gbaraldi commented Dec 11, 2024

d-netto commented Dec 11, 2024

MilesCranmer commented Dec 13, 2024

vchuravy commented Dec 13, 2024

oscardssmith commented Dec 13, 2024

vtjnash commented Dec 14, 2024

MilesCranmer commented Dec 14, 2024

alior101 commented Dec 16, 2024

MilesCranmer commented Dec 19, 2024 •

edited

Loading

MilesCranmer commented Dec 19, 2024 •

edited

Loading

alior101 commented Dec 19, 2024

KristofferC commented Jan 2, 2025

MilesCranmer commented Jan 2, 2025

KristofferC commented Jan 2, 2025

MilesCranmer commented Jan 2, 2025

MilesCranmer commented Jan 2, 2025

MilesCranmer commented Jan 2, 2025

vtjnash commented Jan 2, 2025

MilesCranmer commented Jan 2, 2025

alior101 commented Jan 3, 2025

d-netto commented Jan 3, 2025

alior101 commented Jan 3, 2025

alior101 commented Jan 5, 2025

gc: improve mallocarrays locality #56801

gc: improve mallocarrays locality #56801

Conversation

vtjnash commented Dec 11, 2024

gbaraldi commented Dec 11, 2024

d-netto commented Dec 11, 2024

MilesCranmer commented Dec 13, 2024

vchuravy commented Dec 13, 2024

oscardssmith commented Dec 13, 2024

vtjnash commented Dec 14, 2024

MilesCranmer commented Dec 14, 2024

alior101 commented Dec 16, 2024

MilesCranmer commented Dec 19, 2024 • edited Loading

MilesCranmer commented Dec 19, 2024 • edited Loading

alior101 commented Dec 19, 2024

KristofferC commented Jan 2, 2025

MilesCranmer commented Jan 2, 2025

KristofferC commented Jan 2, 2025

MilesCranmer commented Jan 2, 2025

MilesCranmer commented Jan 2, 2025

MilesCranmer commented Jan 2, 2025

vtjnash commented Jan 2, 2025

MilesCranmer commented Jan 2, 2025

alior101 commented Jan 3, 2025

d-netto commented Jan 3, 2025

alior101 commented Jan 3, 2025

alior101 commented Jan 5, 2025

MilesCranmer commented Dec 19, 2024 •

edited

Loading

MilesCranmer commented Dec 19, 2024 •

edited

Loading