-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gc: improve mallocarrays locality #56801
Conversation
8afa0e6
to
725adba
Compare
This seems to fix #56759 for me. So we probably want to backport this |
IIRC we're still using this data structure in 1.10. If this is indeed a source of a memory leak, then we should consider backporting it to 1.10 as well. |
ed028da
to
b02629a
Compare
small_arraylist_t has much better memory locality and space utilization than a linked list with individually malloc'd elements. However, seemed that it needed alignment of small_arraylist_t to both hold all elements the user might create and avoid gcc bugs.
b02629a
to
6e2c0cb
Compare
Can confirm this fixes #56759. Nice work! |
Do we know why it fixes the memory leak? Were we simply not freeing the elements of the linked list? |
yes |
No idea |
Could this also be backported to 1.10? |
Thanks for this commit! I'v been banging my head for the last few days for an ever increasing MaxRSS in something that should be fixed. |
Ok I read through things a bit more carefully. My understanding is that the old code (and current 1.10 and 1.11) managed allocated blocks using this type I think the switch to use I think the actual underlying issue is here, in *pma = nxt;
int isaligned = (uintptr_t)ma->a & 1;
jl_gc_free_memory(a, isaligned);
ma->next = ptls2->gc_tls_common.heap.mafreelist;
ptls2->gc_tls_common.heap.mafreelist = ma; So, the Switching to I think an alternate fix would be to just do: *pma = nxt;
int isaligned = (uintptr_t)ma->a & 1;
jl_gc_free_memory(a, isaligned);
- ma->next = ptls2->gc_tls_common.heap.mafreelist;
- ptls2->gc_tls_common.heap.mafreelist = ma;
+ free(ma); without further changes or needing to replace this (Hoping someone corrects me if I am wrong) |
Cool. Just confirmed that just adding that single change to With @gbaraldi's example, on 1.11.2: julia> using StatsBase
julia> function Simulate()
Simulations=Int(1e7)
Size=1000
result = Array{Float64}(undef, Simulations, 1)
Threads.@threads for i = 1:Simulations
x = randn(Size)
s = sort(x)
result[i, 1] = s[1]
end
println(median(result))
end
Simulate (generic function with 1 method)
julia> for i in 1:1000
println(i)
Simulate()
GC.gc(true)
# Print live_bytes
println("live_bytes in MB: ", Base.gc_live_bytes() / 1024^2)
sleep(10) # sleep for 10 seconds
end
1
-3.1975019113594954
live_bytes in MB: 2182.0746631622314
2
-3.197639840642432
live_bytes in MB: 4273.654802322388
3
-3.1976844965361293
live_bytes in MB: 6365.400560379028
4
-3.1975014405024504
live_bytes in MB: 8457.508516311646
5
-3.1978872129155835
live_bytes in MB: 10549.042078018188 But if I just add that single 1
-3.1975903921733435
live_bytes in MB: 92.49524688720703
2
-3.1973693558307046
live_bytes in MB: 92.49922943115234
3
-3.1975461857988234
live_bytes in MB: 92.49922943115234
4
-3.1974292157057222
live_bytes in MB: 92.49922943115234
5
-3.197542860188766
live_bytes in MB: 92.49922943115234 |
Thanks for the explanation! |
On 1.10 I get:
so I think it is fine there? |
|
Weird, it worked for me, unless I ran the wrong binary. Maybe run it a bit longer to see if it saturates? I think I had set |
Very weird. When I compile that specific commit I also see the behavior indicating the leak wasn't fixed. Maybe I just ran the wrong binary or something equally dumb. I'm going to see if I just apply that patch to On 1.10 it doesn't look like this data structure was actually used @d-netto (?) so perhaps we can leave it off the backport list. |
Yep the patch doesn't fix v1.11.2 either 😵💫. Cosmic ray must have hit the PC. Sorry about that. Mystery is back on. Going to go grab some booze... |
|
Thanks. I guess it would be helpful to understand the root cause for fixing a potential leak on 1.10, unless it’s ok to apply a similar replacement patch to LTS as well? (I wasn’t sure if the structure change could introduce other issues we might not have thought about yet, which might not be a small enough for an LTS version?) My understanding is that the lines above are a recycling mechanism in the linked list of Maybe the first step is creating an equivalent to |
I'm on nightly and can see a fixed GC Live bytes but increasing RSS and MaxRSS untill OOM few hours later (app is allocating ~50 MB every few seconds inside a function) ... Running with size-hint of 600M for a container with mem limit of 2GB but that doesnt seem to help... |
Do you have an open-source reproducer? If so, could you please open an issue? |
unfortunatly not. But I'll try to come up with a minimal worklng example and post it here. BTW, reverting to 1.10.7 fixed the issue completly. |
Ok, the OOM in the container was due to a lazy linux kernel clearing physical memory from cached log files. When I stopped writing logs the RSS stopped growing. It seems this growing RSS is not directly related to a leak in GC but to some caching and OOM killer race |
small_arraylist_t has much better memory locality and space utilization than a linked list with individually malloc'd elements