-
Notifications
You must be signed in to change notification settings - Fork 76
Physical layer interface
In this wiki, I'll document the interface that Uproot expects from Source
classes, such as HTTPSource and XRootDSource. This document only describes reading because only reading has been implemented (Nov 2020).
When a user opens a file with uproot4.open, the URL scheme determines which Source to use:
- non-URL or
"file://"
defaults to MemmapSource, but MultithreadedFileSource is also a good option, -
"http://"
and"https://"
default to HTTPSource, -
"root://"
defaults to XRootDSource, - a non-string, non-Path object with
read
andseek
methods is handled by ObjectSource.
The uproot4.open returns a ReadOnlyDirectory, which has a file property pointing to a ReadOnlyFile, which has a source property pointing to the actual Source. The Source may be stateful, with open file handles and associated threads. When any object derived from uproot4.open exits a with
statement (through its __exit__
) or is explicitly closed, __exit__
calls are propagated all the way down to the Source, so that it can close or shutdown whatever it needs to.
The job of a Source is to deliver Chunk objects on demand. A Chunk represents physical bytes of a file, uninterpreted and (if directly from a Source), possibly compressed. The data in a Chunk might not have been read yet, but they have been requested. A Chunk is defined by:
- a pointer back to the Source,
-
start and stop, the
inclusive:exclusive
byte positions in the file, - a future, which adheres to a subset of the Python 3 Future protocol. (It only has to have a result method.)
The rest of Uproot interfaces with Chunk objects through get and remainder to get the raw data from the file (through the future) as a numpy.ndarray
of dtype numpy.uint8
. The act of requesting data from a Chunk blocks until its future actually delivers.
The interpretation of those bytes is out of scope for the physical layer: the physical layer only needs to deliver bytes (in futures) on demand.
The two chunk-delivering methods that a Source must implement are
-
chunk (singular), which takes a
start:stop
interval and returns one Chunk, and -
chunks (plural), which takes a list of
(int, int)
pairs and anotifications
queue.Queue.
The first method, chunk (singular), doesn't need much explanation. The Chunk it returns may be synchronous (its future is a trivial NoFuture, recently renamed as TrivialFuture
) or asynchronous, with some background thread delivering its value. This method is used to extract items from a ReadOnlyDirectory, so each __getitem__
is a single request-response cycle. (Perhaps values, items, itervalues, iteritems should be updated to use chunks (plural), to efficiently read all histograms from a directory, for instance, but that would be a future improvement.)
The second method, chunks (plural), takes a list of intervals to read and returns a list of Chunk objects, which, again, may be synchronous or asynchronous. Depending on implementation, this interface may get a large set of discontiguous byte intervals in one request (as HTTPSource and XRootDSource do), or spawn many concurrent requests (as MultithreadedHTTPSource and MultithreadedXRootDSource do), or something else. This method is called by array-fetching functions (TBranch.array, HasBranches.arrays, HasBranches.iterate, uproot4.iterate, uproot4.concatenate, and uproot4.lazy) to batch requests for every TBasket
needed to generate arrays.
The possibly asynchronous results are provided in the return value, the list of Chunk objects, but remember that reading a value from a Chunk means waiting for the source to respond. In principle, that could happen in any order, so reading the Chunk objects from first to last could be the wrong order: if they're filled from last to first, the downstream code would end up waiting when it could be decompressing and interpreting arrays while the remote source streams them. For this reason, they are also returned by adding them to a supplied notifications
queue.Queue. Both the return value and the notifications
queue.Queue get the same Chunk output, but the notifications
get them after they are ready to be read. Calling get on the notifications
guarantees an optimal order. Uproot only uses the notifications
output; the return value of this method is ignored.
In principle, the interface can be simplified. ("Further simplified," since it used to be more complex, with more options.) In principle, the chunks (plural) method could return nothing and only fill the notifications
queue.Queue, and it could fill it with numpy.ndarray
objects, rather than Chunk objects, since they're known to be ready for reading if they appear on the notifications
. The reason I didn't remove this structure is to allow for flexibility in the future. To be future-proof, a Source must return Chunk objects in both the return value and the notifications
.