Hybrid RC and mark+sweep GC #118

nascheme · 2021-07-20T20:16:20Z

nascheme
Jul 20, 2021

I'm hesitant to post this idea since it probably belongs in a "slower-cpython" rather than "faster-cpython" project. However, something like it is a necessary per-condition to removing the GIL and taking better advantage of multiple CPU cores. Making incref/defcref thread safe (e.g. atomic instructions) seems like a non-starter because the overhead is too high.

The basic idea is as follows: rather than having refcnt > 0 as being the condition for an object being alive, make the condition that either refcnt > 0 or the object is reachable from a GC root. Make use of tp_traverse to implement the reachable check. Provide a way to explicitly define the GC roots. Once we have that, we can start removing incref/decref pairs if we can ensure that the object is "properly rooted". We could look for incref/decref hot spots and work on those first. E.g. inside ceval, if we know the object is rooted, we can avoid the incref/decref. dictobject would also be a hot spot.

A simple implementation of this idea would be very slow. The problem is that when decref is called, we can no longer immediately free the object if refcnt == 0. The object might be reachable from a GC root. Since Python allocates objects at a fantastic rate, this would kill performance. What we need is a way to run the mark and sweep pass on only a subset of the heap. One way to do that is with a "write barrier". My idea for that is to basically copy the Caml Light GC design: http://pauillac.inria.fr/~doligez/caml-guts/Sestoft94.txt

For the mark and sweep, there would be a major and minor collection. A major collection would look at all objects. A minor collector would look at "young" objects. Young objects would be defined as objects created since the last GC pass. It can be implemented with a bitmap in the memory manager arenas. In addition to traversing the young objects, we also have to traverse any references from the old generation to those young objects. Those references can be tracked by explicitly calling a function like Caml's Modify(). That function adds the object pointer to a table that gets treated as roots for the next minor collection.

A big advantage of this approach is that it could be implemented incrementally. In the first step, all objects would still have explicit reference counts. The only difference is that objects would no longer be deallocated as soon as refcnt == 0. In the following steps, we would carefully remove some incref/decref instructions. Where we do that, we have to be sure the object is reachable from GC roots and if the object or memory holding the reference is changed, we call the Modify() function.

gvanrossum · 2021-07-20T20:25:37Z

gvanrossum
Jul 20, 2021
Maintainer

necessary per-condition to removing the GIL

I thought the strategy for GIL removal was multiple interpreters?

0 replies

nascheme · 2021-07-20T23:31:01Z

nascheme
Jul 20, 2021
Author

What I am talking about is multi-threading similar to what's done in modern Java. Having a thread-safe GC is a pre-condition for that, I think. Making Java style threading work reliability would be huge project since you would need to solve a bunch of memory model issues. Supporting the C API makes it extra difficult. The Linux kernel did something similar when it removed the "big kernel lock" so I think it could be done. Should we do it? I'm not sure because it is so much work. Also, maybe performance would be unacceptably slower than single-threaded CPython. Perhaps sub-interpreters can give us multi-core performance that's good enough.

I worry that the sub-interpreter approach will struggle to find a way to cheaply pass data between threads. If it's isolated threads with essentially IPC copying data between them, isn't that what the multi-processing module does already? If we want cheap data passing, I think we will need some mechanism for thread-safe memory ownership (i.e. likely some kind of thread safe GC). It would seem you still run into memory model questions as well. Maybe those are easier to solve at the interpreter-to-interpreter level but I worry that the problem has just been moved to a different layer and not solved.

This hybrid GC idea is only worth pursing if we decide we want to pursue Java-style threading. I understood that's what people meant when taking about removing the GIL.

0 replies

gvanrossum · 2021-07-20T23:38:12Z

gvanrossum
Jul 20, 2021
Maintainer

Everything I know says that we should not do Java-style threading.

Yes, multiple interpreters require a way to pass data between them, but that can be a dedicated API that won't make regular objects slower.

And yes, that's what people are thinking of when they talk about removing the GIL -- but it's basically so fraught with issues that I don't want to touch it.

0 replies

h-vetinari · 2021-07-21T06:54:39Z

h-vetinari
Jul 21, 2021

Speaking of Java threads, Java itself is looking to improve/iterate on that with project loom. Here's an overview

0 replies

gvanrossum · 2021-07-21T18:22:36Z

gvanrossum
Jul 21, 2021
Maintainer

Hm, that sounds more like a diatribe against async IO. :-)

0 replies

markshannon · 2021-07-27T10:03:40Z

markshannon
Jul 27, 2021
Collaborator

@nascheme

Multiple interpreters will allow sharing of mutable arrays of primitive data (ints, floats, etc), and the backing data for immutable data like strings, array.array, etc. There will be some copying, but some things become a lot more efficient than using multi-processing. Communication can be entirely in user-space which will allow more third-party approaches.

Any data passed from one processor to another has to pass through L3 cache, or main memory, so the cost of copying is not as bad as it might first appear. If the copied data fits into L2 cache, then the extra cost of copying is relatively small.
Well designed, cache-aware, algorithms are more important. Which is why, IMO, we need to focus on offering fast and robust low level features and let library authors build on that.

Regarding ownership, because there cannot be any cycles through shared memory, simple (atomic) refcounting works perfectly.
Each interpreter has its own handle to the shared memory, which is subject to normal (non-atomic) refcounting. When the handle is freed, then the atomic refcount is decremented.

Multiple interpreters also have (some) resilience to hard crashes. Shared memory threads do not.

I understood that's what people meant when taking about removing the GIL.

I suspect it is, but it doesn't mean that is what they really want 🙂

I suspect that "Remove the GIL" really means "I want to use all my cores without doing any extra work".
I think that switching to using multiple interpreters will be much less work than tracking down every last race, deadlock and livelock in code that was written to be protected by a GIL.

This is much like the refrain "Python should have a JIT", when all they mean is that "Python should be faster", which we are happy to oblige 😄

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid RC and mark+sweep GC #118

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Hybrid RC and mark+sweep GC #118

nascheme Jul 20, 2021

Replies: 6 comments

gvanrossum Jul 20, 2021 Maintainer

nascheme Jul 20, 2021 Author

gvanrossum Jul 20, 2021 Maintainer

h-vetinari Jul 21, 2021

gvanrossum Jul 21, 2021 Maintainer

markshannon Jul 27, 2021 Collaborator

nascheme
Jul 20, 2021

gvanrossum
Jul 20, 2021
Maintainer

nascheme
Jul 20, 2021
Author

gvanrossum
Jul 20, 2021
Maintainer

h-vetinari
Jul 21, 2021

gvanrossum
Jul 21, 2021
Maintainer

markshannon
Jul 27, 2021
Collaborator