Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing datasets in latest inventory - in release 3 #15

Open
agstephens opened this issue Jun 21, 2021 · 24 comments
Open

Missing datasets in latest inventory - in release 3 #15

agstephens opened this issue Jun 21, 2021 · 24 comments
Assignees

Comments

@agstephens
Copy link
Contributor

In the previous inventory, this dataset exists:

c3s-cmip6.ScenarioMIP.CCCma.CanESM5.ssp585.r1i1p1f1.Amon.ts.gn.v20190429

See: https://raw.githubusercontent.com/cp4cds/c3s_34g_manifests/master/inventories/c3s-cmip6/c3s-cmip6_v20210311.yml

The latest version in intake is not there:

$ wget -O c3s-cmip6_v20210611.csv.gz  "https://github.com/cp4cds/c3s_34g_manifests/blob/master/intake/catalogs/c3s-cmip6/c3s-cmip6_v20210611.csv.gz?raw=true"

$ python
>>> import pandas as pd
>>> df = pd.read_csv("c3s-cmip6_v20210611.csv.gz")
>>> df[df["ds_id"] == "c3s-cmip6.ScenarioMIP.CCCma.CanESM5.ssp585.r1i1p1f1.Amon.ts.gn.v20190429"]
Empty DataFrame
Columns: [ds_id, path, size, mip_era, activity_id, institution_id, source_id, experiment_id, member_id, table_id, variable_id, grid_label, version, start_time, end_time, bbox, level]
Index: []

@ellesmith88 please can you check whether this is an error we have introduced or something we do know about. Thanks

@ellesmith88
Copy link
Collaborator

@agstephens The inventory from 11/03/2021 is from the 2nd dataset release - https://github.com/cp4cds/c3s_34g_qc_results/blob/release2/QC_Results/QC_passed_dataset_ids_latest.txt
The most recent intake catalogue (06/11/2021) is from the 3rd dataset release - https://github.com/cp4cds/c3s_34g_qc_results/blob/release3/QC_Results/QC_passed_dataset_ids_latest.txt

The dataset is in the 2nd release but not in the 3rd. When Ruth passed us the list she said it contained all the data from the previous release, so I'm not sure why it isn't there.

@ellesmith88
Copy link
Collaborator

@agstephens Searching for ScenarioMIP.CCCma.CanESM5.ssp585 in the 2nd release shows there are 30 datasets, but in the 3rd release only 1, which itself doesn't exist in the 2nd release.

I will run through and compare the 2 releases to see what the differences are.

@ellesmith88
Copy link
Collaborator

ellesmith88 commented Jun 22, 2021

There are 481 datasets in the 2nd release that aren't in the latest 3rd release
present_in_2.txt

@agstephens
Copy link
Contributor Author

I have picked 3 datasets at random:

CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429
CMIP6.CMIP.BCC.BCC-CSM2-MR.historical.r1i1p1f1.Amon.pr.gn.v20181126
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp585.r1i1p1f1.Ofx.deptho.gn.v20190429

These are all included in the release, dated 2021-03-09:

https://github.com/cp4cds/c3s_34g_qc_results/blob/master/QC_Results/QC_passed_dataset_ids_2021-03-09.txt

However, they are all missing, in a different branch (release3), dated 2021-04-29:

https://github.com/cp4cds/c3s_34g_qc_results/blob/release3/QC_Results/QC_passed_dataset_ids_2021-04-29.txt

@martinjuckes
Copy link

64 of these appear to have been failed by the range check.
I checked two of those, and both failed because Ruth listing of datasets includes files which are not present at CEDA ... and look like they should not be part of the dataset.
e.g

       "hdl:21.14100/70e3f7fe-a0c9-3fbf-be31-da68fe7a1f36": {
            "dset_error_message": "na 1 file(s) missing",
            "dset_error_severity": "major",
            "dset_id": "CMIP6.ScenarioMIP.MRI.MRI-ESM2-0.ssp126.r1i1p1f1.Amon.rlut.gn.v20191108",
            "dset_qc_status": "fail",
            "files": {
                "hdl:21.14100/5bc824f0-7a3f-4b5a-b38a-b207f2385477": {
                    "file_error_message": "na",
                    "file_error_severity": "na",
                    "file_qc_status": "pass",
                    "filename": "rlut_Amon_MRI-ESM2-0_ssp126_r1i1p1f1_gn_201501-210012.nc"
                },
                "hdl:21.14100/e6d0bfc6-3934-401a-a298-b1e5bff2038a": {
                    "file_error_message": "",
                    "file_error_severity": "",
                    "file_qc_status": "",
                    "filename": "rlut_Amon_MRI-ESM2-0_ssp126_r1i1p1f1_gn_210101-230012.nc"
                }
            }
        },

If we could clean up the template we might be able to resolve this (posted in parallel with above).

@agstephens
Copy link
Contributor Author

On the release3 branch, we have the following matches for the first dset_id:

$ grep "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429" *
QC_cfchecker.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",
QC_handle.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",
QC_passed_all_sites_20210426.txt:CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429
QC_passed_dataset_ids_2021-02-25.txt:CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429
QC_passed_dataset_ids_2021-03-09.txt:CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429
QC_template_v2.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",
QC_template_v3_20210317.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",
QC_template_v4_2021-03-25.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",
QC_template_v5_2021-03-25.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",

@agstephens
Copy link
Contributor Author

@martinjuckes can you direct me to the file that you are reading. I think I'm looking at different versions/branches, because I'm seeing other content.

@martinjuckes
Copy link

@agstephens : I hope this is the right place: https://github.com/cp4cds/c3s_34g_qc_results/tree/release3 . And you have to be careful about which files you pick up there. The latest results might be in zip files rather than being present as json. Some cleaning up would be good ....

@agstephens
Copy link
Contributor Author

Thanks @martinjuckes: I also see there is this repo: https://github.com/cp4cds/cmip6_qc

Which has some of the same files. We do need to clean up so that we have a provenance trail we can keep track of.

Regarding the release above: I get the right result if I use the QC_ranges.json.gz file:

$ cat QC_ranges.json.gz | gunzip | grep -B 3 -A 6 "CMIP6.ScenarioMIP.MRI.MRI-ESM2-0.ssp126.r1i1p1f1.Amon.rlut.gn.v20191108"
        "hdl:21.14100/70e3f7fe-a0c9-3fbf-be31-da68fe7a1f36": {
            "dset_error_message": "na 1 file(s) missing",
            "dset_error_severity": "major",
            "dset_id": "CMIP6.ScenarioMIP.MRI.MRI-ESM2-0.ssp126.r1i1p1f1.Amon.rlut.gn.v20191108",
            "dset_qc_status": "fail",
            "files": {
                "hdl:21.14100/5bc824f0-7a3f-4b5a-b38a-b207f2385477": {
                    "file_error_message": "na",
                    "file_error_severity": "na",
                    "file_qc_status": "pass",

@martinjuckes: Do you know how I can work out which are right files to trust regarding the most recent QC?

@martinjuckes
Copy link

I think https://github.com/cp4cds/cmip6_qc only covers the CF checks run by Ruth. https://github.com/cp4cds/c3s_34g_qc_results/tree/release3 combines results from Ruth's tests with results from myself, Guillaume and Fabi.

@agstephens
Copy link
Contributor Author

Thanks @martinjuckes

@martinjuckes
Copy link

@agstephens : on your last question: I hope Ruth explained to Fran the process of constructing the manifest from the reports in https://github.com/cp4cds/c3s_34g_qc_results/tree/release3 .
I think you need to take the latest file matching each of "ranges", "cfchecker", "errata", "nctime", "prepare", "handle". Ruth tried to get these done with standard file names, but that agreement broke when people needed to compress files rather than just uploading xxxx.json

@agstephens
Copy link
Contributor Author

@agstephens : on your last question: I hope Ruth explained to Fran the process of constructing the manifest from the reports in https://github.com/cp4cds/c3s_34g_qc_results/tree/release3 .
I think you need to take the latest file matching each of "ranges", "cfchecker", "errata", "nctime", "prepare", "handle". Ruth tried to get these done with standard file names, but that agreement broke when people needed to compress files rather than just uploading xxxx.json

Thanks @martinjuckes: I am digging into it and reading up so that I feel equipped to pick this up with Fran.

@agstephens
Copy link
Contributor Author

@martinjuckes: based on the 6 checks you have listed, I am parsing what I believe are the latest results from the JSON files (or .tgz, .gz, tar.gz) and also the lists of files not present at different sites, and I get the following results:

Removing due to errors with: cfchecker
        Removed: 71

Removing due to errors with: errata
        Removed: 0

Removing due to errors with: handle
        Removed: 0

Removing due to errors with: nctime
        Removed: 1

Removing due to errors with: prepare
        Removed: 0

Removing due to errors with: ranges
        Removed: 59

Removing due to: incomplete_ipsl
        Removed: 0

Removing due to: inconsistent_winds
        Removed: 0

Removing due to: missing_ceda
        Removed: 0

Removing due to: missing_dkrz
        Removed: 0

Removing due to: missing_ipsl
        Removed: 0

All done: remaining dsets: 350

So that explains 131 of the 481 datasets that were in release2 but are not in release3. Where should I look next?

@martinjuckes
Copy link

Good question. Have you been able to locate the software that Ruth used to generate her manifest? Another possibility is that there has been confusion about the interpretation of minor errors.

@agstephens
Copy link
Contributor Author

@martinjuckes: I'll take a look at that with Fran tomorrow.

@agstephens
Copy link
Contributor Author

Actual number of datasets in our latest inventory/catalog is: 9953

$ grep v2  results/c3s-cmip6_v20210611.csv | cut -d, -f1 | sort -u | wc -l
9953

@agstephens
Copy link
Contributor Author

My count for the expected number of datasets is:

$ wc -l QC_passed_dataset_ids_latest.txt QC_passed_dataset_ids_2021-04-29_historical.txt
  10048 QC_passed_dataset_ids_latest.txt
  10048 QC_passed_dataset_ids_2021-04-29_historical.txt

There are 9,953 in the intake catalog.

There were originally 99 that we couldn't scan, which is 9,949. But then, I think there were the 4 netCDF read errors that went away magically (due to Quobyte). Is that correct?

If so, at least our numbers are matching - and there are 9,953 known/published datasets.

Yes, that's good, 95 are listed in your errors files. I can add them to my investigations.

@agstephens
Copy link
Contributor Author

It looks like the datasets that could not be scanned were not in the missing-in-release3 list:

 python3 identify_missing_dset_ids.py
Removing due to errors with: cfchecker
        Removed: 71

Removing due to errors with: errata
        Removed: 0

Removing due to errors with: handle
        Removed: 0

Removing due to errors with: nctime
        Removed: 1

Removing due to errors with: prepare
        Removed: 0

Removing due to errors with: ranges
        Removed: 59

Removing due to: incomplete_ipsl
        Removed: 0

Removing due to: inconsistent_winds
        Removed: 0

Removing due to: missing_ceda
        Removed: 0

Removing due to: missing_dkrz
        Removed: 0

Removing due to: missing_ipsl
        Removed: 0

Removing due to: missing_34e_scan
        Removed: 0

All done: remaining dsets: 350

@agstephens
Copy link
Contributor Author

@martinjuckes: I spoke to @feggleton yesterday but we are still trying to understand which part of our process has rejected the 350 datasets. Here is some more analysis of data in this category:

=============
missing_1-5_counts.txt
=============
CMIP6.CMIP.BCC.BCC-CSM2-MR.historical: 27
CMIP6.CMIP.CAMS.CAMS-CSM1-0.historical: 19
CMIP6.CMIP.CCCma.CanESM5.historical: 42
CMIP6.CMIP.NIMS-KMA.UKESM1-0-LL.historical: 20
CMIP6.ScenarioMIP.CAMS.CAMS-CSM1-0.ssp119: 19
CMIP6.ScenarioMIP.CAMS.CAMS-CSM1-0.ssp126: 19
CMIP6.ScenarioMIP.CAMS.CAMS-CSM1-0.ssp245: 19
CMIP6.ScenarioMIP.CAMS.CAMS-CSM1-0.ssp370: 19
CMIP6.ScenarioMIP.CAMS.CAMS-CSM1-0.ssp585: 19
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126: 33
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp245: 27
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp370: 35
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp585: 29
CMIP6.ScenarioMIP.NASA-GISS.GISS-E2-1-G.ssp119: 1
CMIP6.ScenarioMIP.NASA-GISS.GISS-E2-1-G.ssp245: 1
CMIP6.ScenarioMIP.NIMS-KMA.KACE-1-0-G.ssp126: 21

=============
missing_mip_counts.txt
=============
CMIP: 108
ScenarioMIP: 242

=============
missing_model_counts.txt
=============
BCC-CSM2-MR: 27
CAMS-CSM1-0: 114
CanESM5: 166
GISS-E2-1-G: 2
KACE-1-0-G: 21
UKESM1-0-LL: 20

=============
missing_var_counts.txt
=============
areacello: 2
clt: 14
deptho: 5
evspsbl: 8
hfls: 13
hfss: 13
hurs: 5
hus: 15
huss: 11
mrro: 3
mrsos: 3
orog: 2
pr: 19
prsn: 1
ps: 42
psl: 20
rlds: 13
rlus: 8
rlut: 14
rsds: 14
rsus: 10
rsut: 14
sfcWind: 1
sftgif: 4
sftlf: 5
sftof: 5
siconc: 10
snd: 5
snw: 2
sos: 7
ta: 44
tas: 27
tasmax: 6
tasmin: 5
tauu: 6
tauv: 5
tos: 12
ts: 13
uas: 13
vas: 13
zos: 10

The majority of models in this category are:
CAMS-CSM1-0: 114
CanESM5: 166

@agstephens agstephens changed the title Missing dataset in latest inventory? Missing datasets in latest inventory - in release 3 Jun 24, 2021
@agstephens
Copy link
Contributor Author

Digging deeper:

There is a problem in the QC_cfchecker.json results. They have 67 datasets include files from a different dataset identifier (differing by grid). These may have been rejected because the file count was wrong. Some of these match some of our missing datasets.

@agstephens
Copy link
Contributor Author

@feggleton Please can you take a look at the QC cfchecker code and let me know if it is doing any checking of the number of expected files and marking datasets as failed if they are wrong? Thanks

@feggleton
Copy link

@agstephens That's strange. I've been looking through all the different scripts but can't find anything that checks the number of files. It does check if the files are in the archive. Here is the location of the cf checker module:

https://github.com/cp4cds/cmip6_qc/blob/097b602842313376225c8f5261cee8324fe414fe/src/simple_cfcheck.py

The workflow runs the following scripts so I have checked them all as much as I can (and understand):

https://github.com/cp4cds/cmip6_qc/blob/master/src/cfchecker_run_all.py
https://github.com/cp4cds/cmip6_qc/blob/master/src/cfchecker_run_unit.py
https://github.com/cp4cds/cmip6_qc/blob/master/src/create_expt_psvs.sh
https://github.com/cp4cds/cmip6_qc/blob/master/src/create_model_psvs.sh
https://github.com/cp4cds/cmip6_qc/blob/master/src/generate_c3s-34g_dataframe.py
https://github.com/cp4cds/cmip6_qc/blob/master/src/complete_json_release_template.py

In cfchecker_run_unit.py under the run_unit function, there is some code to check files but it's commented out and not sure it would affect this. It looks like it was just there for testing.

In complete_json_release_template.py there is a section to determine the overall dataset qc status, if that's helpful to see.

Determine the overall dataset qc status based on aggregate of all file records

    dataset_qc_result = ds_results.isin(PASSES).all()
    if dataset_qc_result:
        qc_template["datasets"][ds_pid]["dset_qc_status"] = 'pass'
    else:
        qc_template["datasets"][ds_pid]["dset_qc_status"] = 'fail'

There is also code to say if the dataframe is empty and (I assume) there is no result, then set to fail.

Fill in the individual file records

    for fpid in f_pids:
        logging.debug(f'fpid {fpid}')
        fres_df = dataset_df[dataset_df['pid'] == fpid]
        if fres_df.empty:
            logging.info(f'FILE DATAFRAME EMPTY PID MISMATCH {ds_id, ds_pid, fpid, }')
            qc_template["datasets"][ds_pid]["dset_qc_status"] = 'fail'
            continue

Finally there is some code to concatenate multiple errors:

Where a file has multiple errors returned these are concatenated here

            template_entry["file_error_severity"] = '; '.join(str(_) for _ in severities) # display as list severities
            template_entry["file_error_message"] = '; '.join(str(_) for _ in err_msgs) # display as list err_msgs
            if all(x in PASSES for x in severities):
                template_entry["file_qc_status"] = 'pass'
            else:
                template_entry["file_qc_status"] = 'fail'

These are the only places I can find where the fail status is set by a condition that isn't the cf checker. Would it help to check the logs I have for an example file?

@agstephens
Copy link
Contributor Author

Thanks @feggleton, I'll keep looking...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants