-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync/Async API design feedback #4
Comments
What's the use of |
I just use |
Why impose this restriction? |
I guess providing a |
I think I understand -- I'm mentally weighing the pros and cons of the |
Exactly.
The problem with |
Given that you will need syntax that differs from NumPy for async anyway, it seems to me that from a user perspective there is little benefit in having a separate async Array class. Instead, I would say option 2 or 3 is better. That is what is done in tensorstore as well. However, if you allow both sync and async through the same class, then that kind of implies that sync and async methods might be called concurrently, and async methods might be called concurrently from multiple event loops. That would require that the class (especially any underlying key-value stores that it uses) not hold any references to any particular event loop, and be designed to support this type of concurrent use. That would probably make it challenging to re-use existing asyncio libraries for I/O. You could just ban use of the same Array from multiple event loops and ban mixing sync/async operations, but that may be undesirable. In general, because array processing often involves CPU-heavy operations that release the GIL, support multi-threaded access would be useful, though. tensorstore does not have this problem because it internally uses its own thread pools and merely interoperates with asyncio. It is convenient to support the normal NumPy syntax for reading and writing, because then it can in some cases it can just work with existing libraries that expect an Array-like object. However, you could also support that use case with an adapter object available through a property. |
Style-wise, my personal favorites are 3 and 2. 3 might be confusing when not used to async, but is the most clean API IMO, unifying both worlds nicely. Also, futures allow some degree of parallelism from non-async code in Python. Additionally, I find it nice that slices of an array can be represented, and actual get or set access is done explicitly. However, this would be a major difference to the current zarr-python implementation. If the barrier of entry should be as low as possible (for beginners or people coming from zarr-python), I'd tend to option 2a or 2b.
This might be circumvented by allowing to set the event-loop for sync access directly on the array. Then any async access should also happen within the same event-loop (which can be asserted). This is also a useful feature on it's own IMO, just to have more control over the event-loop zarrita would be using internally with synchronous access. |
I did some playing with this and zarr-python in https://github.com/martindurant/async-zarr |
@martindurant async-zarr looks great! Have you also thought about how to design the "write" interface in async? |
To be sure: it was just an experimental POC, and a while ago now. I think "write" would look very much the same, though - of course assuming you already have the data in memory. I had not thoughts at all about streaming in either direction. |
I am not convinced about the suitability of asyncio for zarr at all. I'm not a heavy asyncio user, I've made a number of attempts to incorporate it into my workflows but if my understanding of it is correct (not a given), I have always run into issues when trying to reframe my problems in an asyncio context. Mainly the issues are:
The stdlib docs recommend running blocking IO-bound tasks (like file access) in a thread pool, and blocking CPU-bound tasks (like heavy maths) in a process pool, alongside the asyncio you're doing. But transferring data in and out of a separate interpreter for |
@clbarnes : I think there may be some misunderstanding here. zarr-python already allows for async operation on reading/writing batches of chunks via fsspec, using only blocking calls in the outward facing API - there is no poisoning of the stack. All of it works exceedingly well with thread and process parallelism, particularly as implemented by dask. It took some magicry in fsspec to make this happen (or, look at the async implementations in rfsspec, which operate without asyncio in a rust event loop), and there is no asyncio at all for local file access. However, zarr is fundamentally an IO library more than an encoding/compute library. It is reasonable for people to want to call it from within an async context and to want the various metadata operations on remote files to operate async (since they are extremely latency bottlenecked). Even a compute library like dask might reasonable want to be able to say "wait on these IO operations and in the meantime I can use this thread to do other real work". My POC shows that it is relatively simple to provide the normal blocking API as usual in zarr-python and also provide an alternative async class hierarchy for those that want it. This is (part of) one of the options being discussed in this issue. |
Agreed, although you've got to do something with the data you're IO-ing, and zarr is designed to IO a lot of data, which does imply a lot of "somethings" to do!
Yes, rust does sidestep some of the issues with python's asyncio because you can freely hand data between processes, spin them up and down without needing to fork an interpreter, and so on. |
I want to mention that there is |
Well, also remember that holding the GIL means other (thread) parallelism, for instance with dask, can't happen, but if you release it then it can. Do we have any estimate of how much zarr interaction happens via dask?? |
I've been working to make
zarrita
work with async stores and using concurrency throughout the library. However, I think that many users will still want to use synchronous methods. So, the implementation uses only async methods internally and provides sync methods, by running the async methods an event loop on a separate thread and blocking the main thread (inspired by fsspec).I am looking for feedback on how to design the API to accommodate both sync and async functions. Here are the main options that came to my mind:
1. Separate classes
The class methods create either sync or async variants of the
Array
class. Users need to decide upfront, whether to use async or sync methods.2. Separate methods and properties
Both sync and async methods are available through the same class. There are still separate
create
andcreate_async
class methods because the creation of an array is async under the hood (i.e. writing metadata to storage).2a. Property-based async
This is a sync-first API, with the async methods available through the
async_
property.2b. Async methods
Similar to 2a, but with
_async
-suffixed async methods. This feels unpleasant, because the slice syntax[:, :]
cannot be used.3. Async-first API
Implemented through
future
objects. Inspired by tensorstoreThe text was updated successfully, but these errors were encountered: