Missing datasets in latest inventory - in release 3 #15

agstephens · 2021-06-21T15:40:38Z

In the previous inventory, this dataset exists:

c3s-cmip6.ScenarioMIP.CCCma.CanESM5.ssp585.r1i1p1f1.Amon.ts.gn.v20190429

See: https://raw.githubusercontent.com/cp4cds/c3s_34g_manifests/master/inventories/c3s-cmip6/c3s-cmip6_v20210311.yml

The latest version in intake is not there:

$ wget -O c3s-cmip6_v20210611.csv.gz  "https://github.com/cp4cds/c3s_34g_manifests/blob/master/intake/catalogs/c3s-cmip6/c3s-cmip6_v20210611.csv.gz?raw=true"

$ python
>>> import pandas as pd
>>> df = pd.read_csv("c3s-cmip6_v20210611.csv.gz")
>>> df[df["ds_id"] == "c3s-cmip6.ScenarioMIP.CCCma.CanESM5.ssp585.r1i1p1f1.Amon.ts.gn.v20190429"]
Empty DataFrame
Columns: [ds_id, path, size, mip_era, activity_id, institution_id, source_id, experiment_id, member_id, table_id, variable_id, grid_label, version, start_time, end_time, bbox, level]
Index: []

@ellesmith88 please can you check whether this is an error we have introduced or something we do know about. Thanks

The text was updated successfully, but these errors were encountered:

ellesmith88 · 2021-06-22T07:27:18Z

@agstephens The inventory from 11/03/2021 is from the 2nd dataset release - https://github.com/cp4cds/c3s_34g_qc_results/blob/release2/QC_Results/QC_passed_dataset_ids_latest.txt
The most recent intake catalogue (06/11/2021) is from the 3rd dataset release - https://github.com/cp4cds/c3s_34g_qc_results/blob/release3/QC_Results/QC_passed_dataset_ids_latest.txt

The dataset is in the 2nd release but not in the 3rd. When Ruth passed us the list she said it contained all the data from the previous release, so I'm not sure why it isn't there.

ellesmith88 · 2021-06-22T07:34:41Z

@agstephens Searching for ScenarioMIP.CCCma.CanESM5.ssp585 in the 2nd release shows there are 30 datasets, but in the 3rd release only 1, which itself doesn't exist in the 2nd release.

I will run through and compare the 2 releases to see what the differences are.

ellesmith88 · 2021-06-22T07:55:55Z

There are 481 datasets in the 2nd release that aren't in the latest 3rd release
present_in_2.txt

agstephens · 2021-06-22T11:36:31Z

I have picked 3 datasets at random:

CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429
CMIP6.CMIP.BCC.BCC-CSM2-MR.historical.r1i1p1f1.Amon.pr.gn.v20181126
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp585.r1i1p1f1.Ofx.deptho.gn.v20190429

These are all included in the release, dated 2021-03-09:

https://github.com/cp4cds/c3s_34g_qc_results/blob/master/QC_Results/QC_passed_dataset_ids_2021-03-09.txt

However, they are all missing, in a different branch (release3), dated 2021-04-29:

https://github.com/cp4cds/c3s_34g_qc_results/blob/release3/QC_Results/QC_passed_dataset_ids_2021-04-29.txt

martinjuckes · 2021-06-22T11:41:22Z

64 of these appear to have been failed by the range check.
I checked two of those, and both failed because Ruth listing of datasets includes files which are not present at CEDA ... and look like they should not be part of the dataset.
e.g

       "hdl:21.14100/70e3f7fe-a0c9-3fbf-be31-da68fe7a1f36": {
            "dset_error_message": "na 1 file(s) missing",
            "dset_error_severity": "major",
            "dset_id": "CMIP6.ScenarioMIP.MRI.MRI-ESM2-0.ssp126.r1i1p1f1.Amon.rlut.gn.v20191108",
            "dset_qc_status": "fail",
            "files": {
                "hdl:21.14100/5bc824f0-7a3f-4b5a-b38a-b207f2385477": {
                    "file_error_message": "na",
                    "file_error_severity": "na",
                    "file_qc_status": "pass",
                    "filename": "rlut_Amon_MRI-ESM2-0_ssp126_r1i1p1f1_gn_201501-210012.nc"
                },
                "hdl:21.14100/e6d0bfc6-3934-401a-a298-b1e5bff2038a": {
                    "file_error_message": "",
                    "file_error_severity": "",
                    "file_qc_status": "",
                    "filename": "rlut_Amon_MRI-ESM2-0_ssp126_r1i1p1f1_gn_210101-230012.nc"
                }
            }
        },

If we could clean up the template we might be able to resolve this (posted in parallel with above).

agstephens · 2021-06-22T11:50:32Z

On the release3 branch, we have the following matches for the first dset_id:

$ grep "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429" *
QC_cfchecker.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",
QC_handle.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",
QC_passed_all_sites_20210426.txt:CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429
QC_passed_dataset_ids_2021-02-25.txt:CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429
QC_passed_dataset_ids_2021-03-09.txt:CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429
QC_template_v2.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",
QC_template_v3_20210317.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",
QC_template_v4_2021-03-25.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",
QC_template_v5_2021-03-25.json:            "dset_id": "CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126.r1i1p1f1.fx.sftlf.gn.v20190429",

agstephens · 2021-06-22T11:51:29Z

@martinjuckes can you direct me to the file that you are reading. I think I'm looking at different versions/branches, because I'm seeing other content.

martinjuckes · 2021-06-22T12:00:29Z

@agstephens : I hope this is the right place: https://github.com/cp4cds/c3s_34g_qc_results/tree/release3 . And you have to be careful about which files you pick up there. The latest results might be in zip files rather than being present as json. Some cleaning up would be good ....

agstephens · 2021-06-22T12:06:50Z

Thanks @martinjuckes: I also see there is this repo: https://github.com/cp4cds/cmip6_qc

Which has some of the same files. We do need to clean up so that we have a provenance trail we can keep track of.

Regarding the release above: I get the right result if I use the QC_ranges.json.gz file:

$ cat QC_ranges.json.gz | gunzip | grep -B 3 -A 6 "CMIP6.ScenarioMIP.MRI.MRI-ESM2-0.ssp126.r1i1p1f1.Amon.rlut.gn.v20191108"
        "hdl:21.14100/70e3f7fe-a0c9-3fbf-be31-da68fe7a1f36": {
            "dset_error_message": "na 1 file(s) missing",
            "dset_error_severity": "major",
            "dset_id": "CMIP6.ScenarioMIP.MRI.MRI-ESM2-0.ssp126.r1i1p1f1.Amon.rlut.gn.v20191108",
            "dset_qc_status": "fail",
            "files": {
                "hdl:21.14100/5bc824f0-7a3f-4b5a-b38a-b207f2385477": {
                    "file_error_message": "na",
                    "file_error_severity": "na",
                    "file_qc_status": "pass",

@martinjuckes: Do you know how I can work out which are right files to trust regarding the most recent QC?

martinjuckes · 2021-06-22T12:22:34Z

I think https://github.com/cp4cds/cmip6_qc only covers the CF checks run by Ruth. https://github.com/cp4cds/c3s_34g_qc_results/tree/release3 combines results from Ruth's tests with results from myself, Guillaume and Fabi.

agstephens · 2021-06-22T12:31:55Z

Thanks @martinjuckes

martinjuckes · 2021-06-22T12:37:31Z

@agstephens : on your last question: I hope Ruth explained to Fran the process of constructing the manifest from the reports in https://github.com/cp4cds/c3s_34g_qc_results/tree/release3 .
I think you need to take the latest file matching each of "ranges", "cfchecker", "errata", "nctime", "prepare", "handle". Ruth tried to get these done with standard file names, but that agreement broke when people needed to compress files rather than just uploading xxxx.json

agstephens · 2021-06-22T13:20:36Z

@agstephens : on your last question: I hope Ruth explained to Fran the process of constructing the manifest from the reports in https://github.com/cp4cds/c3s_34g_qc_results/tree/release3 .
I think you need to take the latest file matching each of "ranges", "cfchecker", "errata", "nctime", "prepare", "handle". Ruth tried to get these done with standard file names, but that agreement broke when people needed to compress files rather than just uploading xxxx.json

Thanks @martinjuckes: I am digging into it and reading up so that I feel equipped to pick this up with Fran.

agstephens · 2021-06-22T14:31:34Z

@martinjuckes: based on the 6 checks you have listed, I am parsing what I believe are the latest results from the JSON files (or .tgz, .gz, tar.gz) and also the lists of files not present at different sites, and I get the following results:

Removing due to errors with: cfchecker
        Removed: 71

Removing due to errors with: errata
        Removed: 0

Removing due to errors with: handle
        Removed: 0

Removing due to errors with: nctime
        Removed: 1

Removing due to errors with: prepare
        Removed: 0

Removing due to errors with: ranges
        Removed: 59

Removing due to: incomplete_ipsl
        Removed: 0

Removing due to: inconsistent_winds
        Removed: 0

Removing due to: missing_ceda
        Removed: 0

Removing due to: missing_dkrz
        Removed: 0

Removing due to: missing_ipsl
        Removed: 0

All done: remaining dsets: 350

So that explains 131 of the 481 datasets that were in release2 but are not in release3. Where should I look next?

martinjuckes · 2021-06-22T15:04:12Z

Good question. Have you been able to locate the software that Ruth used to generate her manifest? Another possibility is that there has been confusion about the interpretation of minor errors.

agstephens · 2021-06-22T15:37:03Z

@martinjuckes: I'll take a look at that with Fran tomorrow.

agstephens · 2021-06-23T15:19:25Z

Actual number of datasets in our latest inventory/catalog is: 9953

$ grep v2  results/c3s-cmip6_v20210611.csv | cut -d, -f1 | sort -u | wc -l
9953

agstephens · 2021-06-23T15:39:29Z

My count for the expected number of datasets is:

$ wc -l QC_passed_dataset_ids_latest.txt QC_passed_dataset_ids_2021-04-29_historical.txt
  10048 QC_passed_dataset_ids_latest.txt
  10048 QC_passed_dataset_ids_2021-04-29_historical.txt

There are 9,953 in the intake catalog.

There were originally 99 that we couldn't scan, which is 9,949. But then, I think there were the 4 netCDF read errors that went away magically (due to Quobyte). Is that correct?

If so, at least our numbers are matching - and there are 9,953 known/published datasets.

Yes, that's good, 95 are listed in your errors files. I can add them to my investigations.

agstephens · 2021-06-23T15:45:37Z

It looks like the datasets that could not be scanned were not in the missing-in-release3 list:

 python3 identify_missing_dset_ids.py
Removing due to errors with: cfchecker
        Removed: 71

Removing due to errors with: errata
        Removed: 0

Removing due to errors with: handle
        Removed: 0

Removing due to errors with: nctime
        Removed: 1

Removing due to errors with: prepare
        Removed: 0

Removing due to errors with: ranges
        Removed: 59

Removing due to: incomplete_ipsl
        Removed: 0

Removing due to: inconsistent_winds
        Removed: 0

Removing due to: missing_ceda
        Removed: 0

Removing due to: missing_dkrz
        Removed: 0

Removing due to: missing_ipsl
        Removed: 0

Removing due to: missing_34e_scan
        Removed: 0

All done: remaining dsets: 350

agstephens · 2021-06-24T08:55:08Z

@martinjuckes: I spoke to @feggleton yesterday but we are still trying to understand which part of our process has rejected the 350 datasets. Here is some more analysis of data in this category:

=============
missing_1-5_counts.txt
=============
CMIP6.CMIP.BCC.BCC-CSM2-MR.historical: 27
CMIP6.CMIP.CAMS.CAMS-CSM1-0.historical: 19
CMIP6.CMIP.CCCma.CanESM5.historical: 42
CMIP6.CMIP.NIMS-KMA.UKESM1-0-LL.historical: 20
CMIP6.ScenarioMIP.CAMS.CAMS-CSM1-0.ssp119: 19
CMIP6.ScenarioMIP.CAMS.CAMS-CSM1-0.ssp126: 19
CMIP6.ScenarioMIP.CAMS.CAMS-CSM1-0.ssp245: 19
CMIP6.ScenarioMIP.CAMS.CAMS-CSM1-0.ssp370: 19
CMIP6.ScenarioMIP.CAMS.CAMS-CSM1-0.ssp585: 19
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp126: 33
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp245: 27
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp370: 35
CMIP6.ScenarioMIP.CCCma.CanESM5.ssp585: 29
CMIP6.ScenarioMIP.NASA-GISS.GISS-E2-1-G.ssp119: 1
CMIP6.ScenarioMIP.NASA-GISS.GISS-E2-1-G.ssp245: 1
CMIP6.ScenarioMIP.NIMS-KMA.KACE-1-0-G.ssp126: 21

=============
missing_mip_counts.txt
=============
CMIP: 108
ScenarioMIP: 242

=============
missing_model_counts.txt
=============
BCC-CSM2-MR: 27
CAMS-CSM1-0: 114
CanESM5: 166
GISS-E2-1-G: 2
KACE-1-0-G: 21
UKESM1-0-LL: 20

=============
missing_var_counts.txt
=============
areacello: 2
clt: 14
deptho: 5
evspsbl: 8
hfls: 13
hfss: 13
hurs: 5
hus: 15
huss: 11
mrro: 3
mrsos: 3
orog: 2
pr: 19
prsn: 1
ps: 42
psl: 20
rlds: 13
rlus: 8
rlut: 14
rsds: 14
rsus: 10
rsut: 14
sfcWind: 1
sftgif: 4
sftlf: 5
sftof: 5
siconc: 10
snd: 5
snw: 2
sos: 7
ta: 44
tas: 27
tasmax: 6
tasmin: 5
tauu: 6
tauv: 5
tos: 12
ts: 13
uas: 13
vas: 13
zos: 10

The majority of models in this category are:
CAMS-CSM1-0: 114
CanESM5: 166

agstephens · 2021-06-24T12:01:09Z

Digging deeper:

There is a problem in the QC_cfchecker.json results. They have 67 datasets include files from a different dataset identifier (differing by grid). These may have been rejected because the file count was wrong. Some of these match some of our missing datasets.

agstephens · 2021-06-24T12:05:09Z

@feggleton Please can you take a look at the QC cfchecker code and let me know if it is doing any checking of the number of expected files and marking datasets as failed if they are wrong? Thanks

feggleton · 2021-06-24T15:03:08Z

@agstephens That's strange. I've been looking through all the different scripts but can't find anything that checks the number of files. It does check if the files are in the archive. Here is the location of the cf checker module:

https://github.com/cp4cds/cmip6_qc/blob/097b602842313376225c8f5261cee8324fe414fe/src/simple_cfcheck.py

The workflow runs the following scripts so I have checked them all as much as I can (and understand):

https://github.com/cp4cds/cmip6_qc/blob/master/src/cfchecker_run_all.py
https://github.com/cp4cds/cmip6_qc/blob/master/src/cfchecker_run_unit.py
https://github.com/cp4cds/cmip6_qc/blob/master/src/create_expt_psvs.sh
https://github.com/cp4cds/cmip6_qc/blob/master/src/create_model_psvs.sh
https://github.com/cp4cds/cmip6_qc/blob/master/src/generate_c3s-34g_dataframe.py
https://github.com/cp4cds/cmip6_qc/blob/master/src/complete_json_release_template.py

In cfchecker_run_unit.py under the run_unit function, there is some code to check files but it's commented out and not sure it would affect this. It looks like it was just there for testing.

In complete_json_release_template.py there is a section to determine the overall dataset qc status, if that's helpful to see.

Determine the overall dataset qc status based on aggregate of all file records

    dataset_qc_result = ds_results.isin(PASSES).all()
    if dataset_qc_result:
        qc_template["datasets"][ds_pid]["dset_qc_status"] = 'pass'
    else:
        qc_template["datasets"][ds_pid]["dset_qc_status"] = 'fail'

There is also code to say if the dataframe is empty and (I assume) there is no result, then set to fail.

Fill in the individual file records

    for fpid in f_pids:
        logging.debug(f'fpid {fpid}')
        fres_df = dataset_df[dataset_df['pid'] == fpid]
        if fres_df.empty:
            logging.info(f'FILE DATAFRAME EMPTY PID MISMATCH {ds_id, ds_pid, fpid, }')
            qc_template["datasets"][ds_pid]["dset_qc_status"] = 'fail'
            continue

Finally there is some code to concatenate multiple errors:

Where a file has multiple errors returned these are concatenated here

            template_entry["file_error_severity"] = '; '.join(str(_) for _ in severities) # display as list severities
            template_entry["file_error_message"] = '; '.join(str(_) for _ in err_msgs) # display as list err_msgs
            if all(x in PASSES for x in severities):
                template_entry["file_qc_status"] = 'pass'
            else:
                template_entry["file_qc_status"] = 'fail'

These are the only places I can find where the fail status is set by a condition that isn't the cf checker. Would it help to check the logs I have for an example file?

agstephens · 2021-06-25T09:27:21Z

Thanks @feggleton, I'll keep looking...

agstephens assigned ellesmith88 Jun 21, 2021

agstephens changed the title ~~Missing dataset in latest inventory?~~ Missing datasets in latest inventory - in release 3 Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing datasets in latest inventory - in release 3 #15

Missing datasets in latest inventory - in release 3 #15

agstephens commented Jun 21, 2021

ellesmith88 commented Jun 22, 2021

ellesmith88 commented Jun 22, 2021

ellesmith88 commented Jun 22, 2021 •

edited

Loading

agstephens commented Jun 22, 2021

martinjuckes commented Jun 22, 2021

agstephens commented Jun 22, 2021

agstephens commented Jun 22, 2021

martinjuckes commented Jun 22, 2021

agstephens commented Jun 22, 2021

martinjuckes commented Jun 22, 2021

agstephens commented Jun 22, 2021

martinjuckes commented Jun 22, 2021

agstephens commented Jun 22, 2021

agstephens commented Jun 22, 2021

martinjuckes commented Jun 22, 2021

agstephens commented Jun 22, 2021

agstephens commented Jun 23, 2021

agstephens commented Jun 23, 2021

agstephens commented Jun 23, 2021

agstephens commented Jun 24, 2021

agstephens commented Jun 24, 2021

agstephens commented Jun 24, 2021

feggleton commented Jun 24, 2021

agstephens commented Jun 25, 2021

Missing datasets in latest inventory - in release 3 #15

Missing datasets in latest inventory - in release 3 #15

Comments

agstephens commented Jun 21, 2021

ellesmith88 commented Jun 22, 2021

ellesmith88 commented Jun 22, 2021

ellesmith88 commented Jun 22, 2021 • edited Loading

agstephens commented Jun 22, 2021

martinjuckes commented Jun 22, 2021

agstephens commented Jun 22, 2021

agstephens commented Jun 22, 2021

martinjuckes commented Jun 22, 2021

agstephens commented Jun 22, 2021

martinjuckes commented Jun 22, 2021

agstephens commented Jun 22, 2021

martinjuckes commented Jun 22, 2021

agstephens commented Jun 22, 2021

agstephens commented Jun 22, 2021

martinjuckes commented Jun 22, 2021

agstephens commented Jun 22, 2021

agstephens commented Jun 23, 2021

agstephens commented Jun 23, 2021

agstephens commented Jun 23, 2021

agstephens commented Jun 24, 2021

agstephens commented Jun 24, 2021

agstephens commented Jun 24, 2021

feggleton commented Jun 24, 2021

Determine the overall dataset qc status based on aggregate of all file records

Fill in the individual file records

Where a file has multiple errors returned these are concatenated here

agstephens commented Jun 25, 2021

ellesmith88 commented Jun 22, 2021 •

edited

Loading