Skip to content

Physical layer interface

Jim Pivarski edited this page Nov 3, 2020 · 8 revisions

In this wiki, I'll document the interface that Uproot expects from Source classes, such as HTTPSource and XRootDSource. This document only describes reading because only reading has been implemented (Nov 2020).

When a user opens a file with uproot4.open, the URL scheme determines which Source to use:

The uproot4.open returns a ReadOnlyDirectory, which has a file property pointing to a ReadOnlyFile, which has a source property pointing to the actual Source. The Source may be stateful, with open file handles and associated threads. When any object derived from uproot4.open exits a with statement (through its __exit__) or is explicitly closed, __exit__ calls are propagated all the way down to the Source, so that it can close or shutdown whatever it needs to.

The job of a Source is to deliver Chunk objects on demand. A Chunk represents physical bytes of a file, uninterpreted and (if directly from a Source), possibly compressed. The data in a Chunk might not have been read yet, but they have been requested. A Chunk is defined by:

The rest of Uproot interfaces with Chunk objects through get and remainder to get the raw data from the file (through the future) as a numpy.ndarray of dtype numpy.uint8. The act of requesting data from a Chunk blocks until its future actually delivers.

The interpretation of those bytes is out of scope for the physical layer: the physical layer only needs to deliver bytes (in futures) on demand.

The two chunk-delivering methods that a Source must implement are

  • chunk (singular), which takes a start:stop interval and returns one Chunk, and
  • chunks (plural), which takes a list of (int, int) pairs and a notifications queue.Queue.

The first method, chunk (singular), doesn't need much explanation. The Chunk it returns may be synchronous (its future is a trivial NoFuture, recently renamed as TrivialFuture) or asynchronous, with some background thread delivering its value. This method is used to extract items from a ReadOnlyDirectory, so each __getitem__ is a single request-response cycle. (Perhaps values, items, itervalues, iteritems should be updated to use chunks (plural), to efficiently read all histograms from a directory, for instance, but that would be a future improvement.)

The second method, chunks (plural), takes a list of intervals to read and returns a list of Chunk objects, which, again, may be synchronous or asynchronous. Depending on implementation, this interface may get a large set of discontiguous byte intervals in one request (as HTTPSource and XRootDSource do), or spawn many concurrent requests (as MultithreadedHTTPSource and MultithreadedXRootDSource do), or something else. This method is called by array-fetching functions (TBranch.array, HasBranches.arrays, HasBranches.iterate, uproot4.iterate, uproot4.concatenate, and uproot4.lazy) to batch requests for every TBasket needed to generate arrays.

The possibly asynchronous results are provided in the return value, the list of Chunk objects, but remember that reading a value from a Chunk means waiting for the source to respond. In principle, that could happen in any order, so reading the Chunk objects from first to last could be the wrong order: if they're filled from last to first, the downstream code would end up waiting when it could be decompressing and interpreting arrays while the remote source streams them. For this reason, they are also returned by adding them to a supplied notifications queue.Queue. Both the return value and the notifications queue.Queue get the same Chunk output, but the notifications get them after they are ready to be read. Calling get on the notifications guarantees an optimal order. Uproot only uses the notifications output; the return value of this method is ignored.

In principle, the interface can be simplified. ("Further simplified," since it used to be more complex, with more options.) In principle, the chunks (plural) method could return nothing and only fill the notifications queue.Queue, and it could fill it with numpy.ndarray objects, rather than Chunk objects, since they're known to be ready for reading if they appear on the notifications. The reason I didn't remove this structure is to allow for flexibility in the future. To be future-proof, a Source must return Chunk objects in both the return value and the notifications.

Clone this wiki locally