How to handle data related by id #229

DraTeots · 2020-12-16T06:25:34Z

DraTeots
Dec 16, 2020

We have data structured like this:

hit_count            | uint64_t                        | AsDtype('>u8')
hit_id               | std::vector<uint64_t>    | AsJagged(AsDtype('>u8'), he...
hit_trk_id           | std::vector<uint64_t>    | AsJagged(AsDtype('>u8'), he...
hit_x                | std::vector<double>      | AsJagged(AsDtype('>f8'), he...
hit_y                | std::vector<double>      | AsJagged(AsDtype('>f8'), he...
hit_z                | std::vector<double>      | AsJagged(AsDtype('>f8'), he...
...

trk_count            | uint64_t                 | AsDtype('>u8')
trk_id               | std::vector<uint64_t>    | AsJagged(AsDtype('>u8'), he...
trk_pdg              | std::vector<int64_t>     | AsJagged(AsDtype('>i8'), he...
trk_mom              | std::vector<double>      | AsJagged(AsDtype('>f8'), he...
...

First, the data is in aligned arrays, means, all hit_* arrays have the same size of hit_count and all trk_* arrays have the same size equal trk_count.

Then hits have hit_trk_id which corresponds to trk_id - a track that made this hits.

So now imagine my task, I want to build a histogram with track momentums that have hits at certain subdetector (certain z).

for batch in tree.iterate(['hit_z', 'hit_trk_id', 'trk_mom', 'trk_id'], step_size=5, entry_stop=10):
   # then I want all hits with certain z, so I do something like
   good_tracks_id = batch.hit_trk_id[batch.hit_z>100]
   # now, having good_tracks_id how do I select tracks momentum with ids in good_tracks_id?

tamasgal · 2020-12-16T08:19:42Z

tamasgal
Dec 16, 2020

You are almost there! I'd propose to give batch.hit_z > 100 a name and reuse that boolean mask to also select the momentums.
Since you are building a histogram, you may only be interested in the bare values of the momentums after masking, which you obtain by flattening the nested array (coming from the batch-wise operation), since it can contain empty ([]) elements.

import awkward as ak

for batch in tree.iterate(['hit_z', 'hit_trk_id', 'trk_mom', 'trk_id'], step_size=5, entry_stop=10):
   # save your cut as a mask
   mask = batch.hit_z > 100
   good_tracks_id = batch.hit_trk_id[mask]
   momentums = ak.flatten(batch.trk_mom[mask])  # ak.flatten() gets rid of empty elements

0 replies

jpivarski · 2020-12-16T15:58:57Z

jpivarski
Dec 16, 2020
Maintainer

Adding to @tamasgal's answer, I'll give some explicit examples. Suppose your data looks like

>>> trk_mom = ak.Array([[1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8]])
>>> hit_id = ak.Array([[0, 0, 0, 1, 1], [], [1, 1, 0], [0, 0, 0], [2, 2, 1]])
>>> hit_z = ak.Array([[50, 100, 100, 50, 100], [], [100, 50, 100], [100, 100, 50], [50, 100, 100]])

>>> # hit_id and hit_z are aligned
>>> ak.num(hit_id)
<Array [5, 0, 3, 3, 3] type='5 * int64'>
>>> ak.num(hit_z)
<Array [5, 0, 3, 3, 3] type='5 * int64'>

As you've seen, you can get a masking array of booleans from hit_z:

>>> print(hit_z >= 100)
[[False, True, True, False, True], [], ... [True, True, False], [False, True, True]]

and you can use it to select hit_ids:

>>> print(hit_id[hit_z >= 100])
[[0, 0, 1], [], [1, 0], [0, 0], [2, 1]]

If these ids are the positions of trk values within their arrays, that is, if a 0 in the above means that it's the first track in a track list, then you can just pass this array of integers into square brackets, like this:

>>> print(trk_mom[hit_id[hit_z >= 100]])
[[1.1, 1.1, 2.2], [], [4.4, 3.3], [5.5, 5.5], [8.8, 7.7]]

If the trk ids are not positions in the trk_* arrays, they need to be translated somehow to those positions before you can use it this way. (For more, see the full documentation.)

Another thing that could be useful to know is that there's a way to apply cuts that does not change the shape of the array. If you use "mask":

>>> print(hit_id.mask[hit_z >= 100])
[[None, 0, 0, None, 1], [], [1, None, 0], [0, 0, None], [None, 2, 1]]

the True values in hit_z >= 100 pass hit_ids through and the False values put in a placeholder "None". You can use this array of integers and Nones like an array of integers; the Nones pass through as None.

>>> print(trk_mom[hit_id.mask[hit_z >= 100]])
[[None, 1.1, 1.1, None, 2.2], [], ... None, 3.3], [5.5, 5.5, None], [None, 8.8, 7.7]]

As @tamasgal said, you can use ak.flatten to get rid of the nested list structure, but it also gets rid of None values. (None is treated like an empty list for the purposes of flattening.) Note that the default axis for ak.flatten is axis=1, which only flattens the first level, but you might need to flatten all levels at once; axis=None does that:

>>> print(ak.flatten(trk_mom[hit_id.mask[hit_z >= 100]], axis=1))
[None, 1.1, 1.1, None, 2.2, 4.4, None, 3.3, 5.5, 5.5, None, None, 8.8, 7.7]
>>> print(ak.flatten(trk_mom[hit_id.mask[hit_z >= 100]], axis=None))
[1.1, 1.1, 2.2, 4.4, 3.3, 5.5, 5.5, 8.8, 7.7]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle data related by id #229

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to handle data related by id #229

DraTeots Dec 16, 2020

Replies: 2 comments

tamasgal Dec 16, 2020

jpivarski Dec 16, 2020 Maintainer

DraTeots
Dec 16, 2020

tamasgal
Dec 16, 2020

jpivarski
Dec 16, 2020
Maintainer