Skip to content

Thoughts on eventID and trackID

mjkramer edited this page Apr 7, 2023 · 2 revisions

Note: This does not necessarily describe what is currently implemented. This is just being kept around as a record of the reasoning that went into our event/track numbering. The current scheme is described on another page.

Problem: The EventId field of TG4Event is 32 bits. For 2.5E20 POT, we should be able to fit all of the events into 32 bits. But as we scale up to 1E22 POT, we will run out of bits.

1E22 POT ~ 2E8 spills, say 100 events (mostly rock muons) per spill, then 2E10 events. 32 bits can fit ~4E9. Even if 100 events/spill is excessive, there's also the inefficiency of bit packing. So in the end, we need more bits.

In the edep-sim ROOT files, we can change the format of TG4Event such that the EventId is 64 bits. However, this would be an incompatible change to the file format, which would have to be annoyingly accomodated by everyone who wants to read the files. A better solution would be make use of the RunId field. We're already doing this with the singles files, where the RunId encodes the flux file ID and the random seed.

On the other hand, for the HDF5 files, it is much more convenient to have a globally unique event ID in the truth datasets. This allows arrays to be event-sliced with just one condition (on eventID) instead of two (on eventID and runID). Fortunately, it is easy for us and for e.g. mlreco to just change one entry in a dtype from 'u4' to 'u8'.

So the proposal is as follows. Note that "global" here means "across all files in the production":

  • As previously planned (and already implemented), each pre-hadd singles files will have monotonic zero-based eventIDs and a constant RunId calculated from the flux file and random seed. After hadding, the singles files will contain repeated eventIDs, but (globally*) unique tuples of (runID, eventID). *Note that each "globally unique" tuple will actually be repeated twice, once in the nu sample and once in the rock sample.
  • During spill building, the eventID will remain unchanged, while the RunId will be updated (by adding 1E9, if rock) to indicate whether the event came from a nu file or a rock file. Thus, within the spill files, the tuples of (runID, eventID) are globally unique. Each spill file will be assigned a global SpillFileID (zero-based monotonic), and global SpillIDs will be calculated as (1E3*SpillFileID + LocalSpillID).
    • This will work for 1E22 POT as long as each spill file contains < 1000 spills (< 5E16 POT). E.g., for 1E22 POT total, and 1E16 POT per spill file, we have 1E6 spill files; then max(SpillID) ~ 1E9, which fits into an int.)
  • During HDF5 conversion, the SpillID is copied over unchanged, while the RunID and (previous) EventID are combined into a 64-bit global EventID.
  • Also during HDF5 conversion, the trackID (which, in edep-sim format, counts from zero within each event) is rewritten to be unique within the file (and monotonic within each spill). The same update is made to all references to trackIDs, include the parent of each track, and the contributors to each hit. A local_trackID field is added containing the original trackID, for ease of future backtracking.
    • Reason: Having a unique trackID (within each file) makes it easier to do cross-referencing. A globally unique trackID would also be possible, but would require some cleverness in order to select the right bits from the 64-bit EventID and (event-local) TrackID.