You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's say I am trying to access multiple trees from the same set of files which can fit into memory, call it files. The documentation suggests that for reading sets of files, uproot.concatenate is the way to go. There are multiple ways I can think to do this:
Simply make two sets of file lists files_tree1 and files_tree2 where files_treeX = [f+":treeX" for f in files], pass each list to uproot.concatenate in a separate call:
This I think is inefficient, since the same files wil be opened twice. Maybe caching would help, but feels like it can be done better.
[Naively] make a combined files list: files_all = files_tree1 + files_tree2 and combined branches list: all_branches = tree1_branches + tree2_branches] and pass these to one uproot call, hoping uproot magic will know what to do. This surprisngly (at least to me) did not crash. Uproot just produced an awkward array which is a union of 2 awkward arrays tree1_branches (N entries) and tree2_branches (M entries), with ak.type(data) giving :
N+M * union[{"tree1_branch1": var * float32, "tree1_branch2": var * float32}, {"tree2_branch1": int32}]
so calling data[N+M-1] gives the last entry from tree2 and data[0] gives the first entry of tree1. I guess it can be expected behaviour from the global_index that uproot.concatenate() seems to keep track of (or maybe I'm completely off)?
Anyway, I think we still open each file twice, which is non-ideal.
I can do a manual loop over the files and call uproot.open on each of them, then access the keys from the structure we get back. This way each file is opened once.
My questions are:
What does uproot.concatenate do in the background that makes it more performant (if that's even true) than uproot.open inside a loop over files? What I can see quickly from a skim over the source code is that concatenate loops over the files one by one, opening them as ReadOnlyFile then grabbing the data, but I am probably missing something subtle in the steps.
What do you recommend as best practicei n reading multiple trees from many files?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi experts,
Let's say I am trying to access multiple trees from the same set of files which can fit into memory, call it
files
. The documentation suggests that for reading sets of files,uproot.concatenate
is the way to go. There are multiple ways I can think to do this:files_tree1
andfiles_tree2
wherefiles_treeX = [f+":treeX" for f in files]
, pass each list touproot.concatenate
in a separate call:This I think is inefficient, since the same files wil be opened twice. Maybe caching would help, but feels like it can be done better.
files_all = files_tree1 + files_tree2
and combined branches list:all_branches = tree1_branches + tree2_branches]
and pass these to one uproot call, hoping uproot magic will know what to do. This surprisngly (at least to me) did not crash. Uproot just produced anawkward
array which is aunion
of 2 awkward arraystree1_branches
(N entries) andtree2_branches
(M entries), withak.type(data)
giving :so calling
data[N+M-1]
gives the last entry fromtree2
anddata[0]
gives the first entry oftree1
. I guess it can be expected behaviour from theglobal_index
thatuproot.concatenate()
seems to keep track of (or maybe I'm completely off)?Anyway, I think we still open each file twice, which is non-ideal.
uproot.open
on each of them, then access the keys from the structure we get back. This way each file is opened once.My questions are:
uproot.concatenate
do in the background that makes it more performant (if that's even true) thanuproot.open
inside a loop over files? What I can see quickly from a skim over the source code is thatconcatenate
loops over the files one by one, opening them asReadOnlyFile
then grabbing the data, but I am probably missing something subtle in the steps.Beta Was this translation helpful? Give feedback.
All reactions