Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Releasing the GIL #128

Open
nalgeon opened this issue Jan 18, 2025 · 6 comments
Open

Releasing the GIL #128

nalgeon opened this issue Jan 18, 2025 · 6 comments

Comments

@nalgeon
Copy link

nalgeon commented Jan 18, 2025

Hey Jason, thank you for this package, it's crazy fast!

Do I understand correctly that the library does not release the GIL when reading PDF, so to parallelize it I need to use multiprocessing and not threading? Are there any reasons that prevent it from releasing the GIL?

@jalan
Copy link
Owner

jalan commented Jan 18, 2025

Hey Jason, thank you for this package, it's crazy fast!

Thanks! That was my usecase, just to dump text out of PDFs as fast as possible.

Do I understand correctly that the library does not release the GIL when reading PDF, so to parallelize it I need to use multiprocessing and not threading? Are there any reasons that prevent it from releasing the GIL?

I made no special effort in this department, so you are correct. I will have to read up on how to manage the GIL in C extensions and see what improvements can be made.

@nalgeon
Copy link
Author

nalgeon commented Jan 18, 2025

I think releasing the GIL is essentially putting Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS around some calls, for example:

+    Py_BEGIN_ALLOW_THREADS
     page = self->doc->create_page(page_number);
+    Py_END_ALLOW_THREADS

I'm not sure where exactly to put them in this case, though :)

Anyway, thanks again for the package! It's very refreshing to see a library that has a clear focus and does its job really well. I've tried other PDF libraries and they don't even stand a chance.

@nalgeon
Copy link
Author

nalgeon commented Jan 18, 2025

Oh, and one more thing. To safely release the GIL, the Poppler library itself should be thread-safe. If it's not, there's probably nothing to do here.

@nalgeon
Copy link
Author

nalgeon commented Jan 19, 2025

Come to think of it, you can still release the GIL even if Poppler is not thread-safe. In that case, the PDF object will not be thread-safe, but that's probably fine (if you mention it in the docs). I can't imagine anyone actually wanting to access the same PDF from multiple threads.

@jalan
Copy link
Owner

jalan commented Jan 25, 2025

Oh, and one more thing. To safely release the GIL, the Poppler library itself should be thread-safe. If it's not, there's probably nothing to do here.

I can't imagine anyone actually wanting to access the same PDF from multiple threads.

I believe Poppler is thread-safe to access multiple pages concurrently, but I'll make sure. Probably at least some PDF readers that use poppler take advantage of this to pre-render multiple pages at a time.

I'll try to set up some tests to see what sort of boosts I can get by releasing the GIL in certain places.

@jalan
Copy link
Owner

jalan commented Feb 3, 2025

This guide seems to have a lot of what I need: https://py-free-threading.github.io/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants