-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing datasets in latest inventory - in release 3 #15
Comments
@agstephens The inventory from 11/03/2021 is from the 2nd dataset release - https://github.com/cp4cds/c3s_34g_qc_results/blob/release2/QC_Results/QC_passed_dataset_ids_latest.txt The dataset is in the 2nd release but not in the 3rd. When Ruth passed us the list she said it contained all the data from the previous release, so I'm not sure why it isn't there. |
@agstephens Searching for I will run through and compare the 2 releases to see what the differences are. |
There are 481 datasets in the 2nd release that aren't in the latest 3rd release |
I have picked 3 datasets at random:
These are all included in the release, dated 2021-03-09: However, they are all missing, in a different branch ( |
64 of these appear to have been failed by the range check.
If we could clean up the template we might be able to resolve this (posted in parallel with above). |
On the
|
@martinjuckes can you direct me to the file that you are reading. I think I'm looking at different versions/branches, because I'm seeing other content. |
@agstephens : I hope this is the right place: https://github.com/cp4cds/c3s_34g_qc_results/tree/release3 . And you have to be careful about which files you pick up there. The latest results might be in zip files rather than being present as json. Some cleaning up would be good .... |
Thanks @martinjuckes: I also see there is this repo: https://github.com/cp4cds/cmip6_qc Which has some of the same files. We do need to clean up so that we have a provenance trail we can keep track of. Regarding the release above: I get the right result if I use the
@martinjuckes: Do you know how I can work out which are right files to trust regarding the most recent QC? |
I think https://github.com/cp4cds/cmip6_qc only covers the CF checks run by Ruth. https://github.com/cp4cds/c3s_34g_qc_results/tree/release3 combines results from Ruth's tests with results from myself, Guillaume and Fabi. |
Thanks @martinjuckes |
@agstephens : on your last question: I hope Ruth explained to Fran the process of constructing the manifest from the reports in https://github.com/cp4cds/c3s_34g_qc_results/tree/release3 . |
Thanks @martinjuckes: I am digging into it and reading up so that I feel equipped to pick this up with Fran. |
@martinjuckes: based on the 6 checks you have listed, I am parsing what I believe are the latest results from the JSON files (or
So that explains 131 of the 481 datasets that were in release2 but are not in release3. Where should I look next? |
Good question. Have you been able to locate the software that Ruth used to generate her manifest? Another possibility is that there has been confusion about the interpretation of |
@martinjuckes: I'll take a look at that with Fran tomorrow. |
Actual number of datasets in our latest inventory/catalog is: 9953
|
My count for the expected number of datasets is:
There are 9,953 in the intake catalog. There were originally 99 that we couldn't scan, which is 9,949. But then, I think there were the 4 netCDF read errors that went away magically (due to Quobyte). Is that correct? If so, at least our numbers are matching - and there are 9,953 known/published datasets. Yes, that's good, 95 are listed in your errors files. I can add them to my investigations. |
It looks like the datasets that could not be scanned were not in the missing-in-release3 list:
|
@martinjuckes: I spoke to @feggleton yesterday but we are still trying to understand which part of our process has rejected the 350 datasets. Here is some more analysis of data in this category:
The majority of models in this category are: |
Digging deeper: There is a problem in the |
@feggleton Please can you take a look at the QC cfchecker code and let me know if it is doing any checking of the number of expected files and marking datasets as failed if they are wrong? Thanks |
@agstephens That's strange. I've been looking through all the different scripts but can't find anything that checks the number of files. It does check if the files are in the archive. Here is the location of the cf checker module: The workflow runs the following scripts so I have checked them all as much as I can (and understand): https://github.com/cp4cds/cmip6_qc/blob/master/src/cfchecker_run_all.py In cfchecker_run_unit.py under the run_unit function, there is some code to check files but it's commented out and not sure it would affect this. It looks like it was just there for testing. In complete_json_release_template.py there is a section to determine the overall dataset qc status, if that's helpful to see. Determine the overall dataset qc status based on aggregate of all file records
There is also code to say if the dataframe is empty and (I assume) there is no result, then set to fail. Fill in the individual file records
Finally there is some code to concatenate multiple errors: Where a file has multiple errors returned these are concatenated here
These are the only places I can find where the fail status is set by a condition that isn't the cf checker. Would it help to check the logs I have for an example file? |
Thanks @feggleton, I'll keep looking... |
In the previous inventory, this dataset exists:
c3s-cmip6.ScenarioMIP.CCCma.CanESM5.ssp585.r1i1p1f1.Amon.ts.gn.v20190429
See: https://raw.githubusercontent.com/cp4cds/c3s_34g_manifests/master/inventories/c3s-cmip6/c3s-cmip6_v20210311.yml
The latest version in intake is not there:
@ellesmith88 please can you check whether this is an error we have introduced or something we do know about. Thanks
The text was updated successfully, but these errors were encountered: