Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset merge issue #487

Merged
merged 5 commits into from
Nov 6, 2023
Merged

Dataset merge issue #487

merged 5 commits into from
Nov 6, 2023

Conversation

juanangp
Copy link
Member

@juanangp juanangp commented Oct 5, 2023

juanangp Ok: 65 Powered by Pull Request Badge

Implementation of merging of metadata while importing more than one dataset, in addition any file inside the list which is not a valid dataSet will be skipped.

Address issue #486

@juanangp juanangp requested review from jgalan and lobis October 5, 2023 17:40
@juanangp juanangp requested a review from a team October 6, 2023 07:13

if (REST_StringHelper::StringToTimeStamp(fFilterStartTime) >
REST_StringHelper::StringToTimeStamp(dS.GetFilterStartTime()))
fFilterStartTime = dS.GetFilterStartTime();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky because a generated dataset 1 with a given startTime filter, and dataset 2 with filter 2, then to be rigorous I need to know the startTime filter of both.

So that I know that the dataset 1 was produced with a given filter, and the other was produced with a different time filter. Idem for other filters.

Perhaps a new class TRestDataSetMerged could help. We keep the filters only for TRestDataSet while in TRestDataSetMerged we define as metadata members the dataset filenames used for the merge?

And perhaps introduce a hash-id to identify/guarantee the TRestDataSet used to create the merged dataset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that previous post was me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed the timeStamp filters, I think that to make a new class TRestDataSetMerged is an overkill.

However, I don't know what is the best strategy here because apart of the filters there is plenty of metadata info which is not saved when we merge several dataSets. Note that before just only the metadata info from the first file was saved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for me a dataset does not need to contain all the metadata information that is available to the run files. The dataset must guarantee that we can identify the run files used to generate the dataset. I guess, having the selected files vector is already a good starting point. Then we need to know if any filters have been applied for event selection. But the final user working with the dataset needs to know also other conditions applied to generate the dataset, e.g. the detector pressure, the data taking days, the voltage on the mesh, etc, etc. Then, if those filters were used to generate the datasets, it is relevant for the end-user.

The metadata information transferred to the dataset in the form of relevant quantities is to provide the dataset with physical meaning, and allow to have all the analysis required information into one single entity.

Indeed, if we added new columns to the dataset, we need to add also that metadata information, because if I generate a new column using <addColumn then I need to now the formula used to calculate that column. Not sure if we are doing this right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see now that the columns added should be in the form fColumnNameExpressions. Just last time I couldn't find them in my dataset.

Copy link
Member

@jgalan jgalan Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that before just only the metadata info from the first file was saved.

In my opinion this should never happen, as I already stated many times I am not agree on merging runs. Merging datasets is different issue, but runs should be considered as read-only (or end-product of event data processing) entities, that can be used later on to generate a data compilation or dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that before just only the metadata info from the first file was saved.

In my opinion this should never happen, as I already stated many times I am not agree on merging runs. Merging datasets is different issue, but runs should be considered as read-only (or end-product of event data processing) entities, that can be used later on to generate a data compilation or dataset.

I am talking about datasets, not TRestRuns the issue is that if the metadata info of the dataset is not properly handled the results are wrong when I compute e.g. the rate. In this case the duration of the dataset has to be added, but other metadata info can be also wrong or misleading. I don't know what is the proper way to handle metadata info of the dataset while merging, but this issue was present before this PR.

@juanangp juanangp requested a review from lobis November 6, 2023 16:07
@juanangp juanangp merged commit 8dc9f22 into master Nov 6, 2023
@juanangp juanangp deleted the dSMergeIssue branch November 6, 2023 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants