Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEFS Re-forecast files missing #1994

Open
chiaral opened this issue Sep 14, 2023 · 24 comments
Open

GEFS Re-forecast files missing #1994

chiaral opened this issue Sep 14, 2023 · 24 comments

Comments

@chiaral
Copy link

chiaral commented Sep 14, 2023

This is not necessary an exhaustive list of missing files,
But 2004102400/p01/Days:1-10 is missing a lot of files. The idx files are there, not the actual grib files.
Here I have 76 items, here instead I have 122.

Thanks!

@Patrick-Keown
Copy link
Contributor

Patrick-Keown commented Sep 14, 2023 via email

@chiaral
Copy link
Author

chiaral commented Sep 19, 2023

Continuing adding more as I go through the data. I found other 2 issues:

The easier one, in
s3://noaa-gefs-retrospective/GEFSv12/reforecast/2004/2004101700/p04/Days:1-10/
we are missing the apcp file, we only have the idx file.

But the true easter egg 🤣 is the following one:

s3://noaa-gefs-retrospective/GEFSv12/reforecast/2006/2006033000/c00/Days:1-10/ugrd_hgt_2006033000_c00.grib2
and
s3://noaa-gefs-retrospective/GEFSv12/reforecast/2006/2006033000/c00/Days:1-10/vgrd_hgt_2006033000_c00.grib2

have the wrong valid_time coordinates

!aws s3 cp s3://noaa-gefs-retrospective/GEFSv12/reforecast/2006/2006033000/c00/Days:1-10/ugrd_hgt_2006033000_c00.grib2 ufromaws.grib2
!wgrib2 -v ufromaws.grib2`
1:0:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:3 hour fcst:ENS=low-res ctl
2:806524:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:3 hour fcst:ENS=low-res ctl
3:1623430:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:6 hour fcst:ENS=low-res ctl
4:2428635:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:6 hour fcst:ENS=low-res ctl
5:3247497:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:9 hour fcst:ENS=low-res ctl
6:4058812:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:9 hour fcst:ENS=low-res ctl
7:4883679:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:12 hour fcst:ENS=low-res ctl
8:5705130:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:12 hour fcst:ENS=low-res ctl
9:6537556:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:15 hour fcst:ENS=low-res ctl
10:7364475:d=2004033000:UGRD U-Component of Wind [m/s]:100 m above ground:15 hour fcst:ENS=low-res ctl
11:8199647:d=2004033000:UGRD U-Component of Wind [m/s]:10 m above ground:18 hour fcst:ENS=low-res ctl

Also with xarray/cfgrib

import cfgrib
u = xr.open_dataset('ufromaws.grib2', engine="cfgrib",
                backend_kwargs={"filter_by_keys": {"shortName": "10u"}},
                )
u.valid_time.values
array(['2004-03-30T03:00:00.000000000', '2004-03-30T06:00:00.000000000',
       '2004-03-30T09:00:00.000000000', '2004-03-30T12:00:00.000000000',
       '2004-03-30T15:00:00.000000000', '2004-03-30T18:00:00.000000000',
       '2004-03-30T21:00:00.000000000', '2004-03-31T00:00:00.000000000',
       '2004-03-31T03:00:00.000000000', '2004-03-31T06:00:00.000000000',
       '2004-03-31T09:00:00.000000000', '2004-03-31T12:00:00.000000000',
       '2004-03-31T15:00:00.000000000', '2004-03-31T18:00:00.000000000',
       '2004-03-31T21:00:00.000000000', '2004-04-01T00:00:00.000000000',
       '2004-04-01T03:00:00.000000000', '2004-04-01T06:00:00.000000000',
       '2004-04-01T09:00:00.000000000', '2004-04-01T12:00:00.000000000',
       '2004-04-01T15:00:00.000000000', '2004-04-01T18:00:00.000000000',
       '2004-04-01T21:00:00.000000000', '2004-04-02T00:00:00.000000000',
       '2004-04-02T03:00:00.000000000', '2004-04-02T06:00:00.000000000',
       '2004-04-02T09:00:00.000000000', '2004-04-02T12:00:00.000000000',
       '2004-04-02T15:00:00.000000000', '2004-04-02T18:00:00.000000000',
       '2004-04-02T21:00:00.000000000', '2004-04-03T00:00:00.000000000',

@Patrick-Keown
Copy link
Contributor

Thank you for the additional information. We have a scientist looking into this. Our team will reach back out once we have a resolution.

Thank you

@chiaral
Copy link
Author

chiaral commented Sep 22, 2023

Hello! I found something not missing but erroneous in the precipitation (it appears both in tp and acpcp) for one month so far.
I have not done an exhaustive analysis, I bumped into this by pure luck.

In the following figures I have the 5 ensemble member for each column ('c00' to 'p04'), each row is a 3hourly interval starting from the start of the run (i.e. 00z)

For May 30th 2006 - all good (this is precipitation truncated to 10 mm for the first 3 hourly steps, 0-3, 0-6, 6-9, and so on)
image

For June 1st 2006 🙃
image

For June 10th
image

then July 1st goes back to normal
image

for the whole month of June 2006 tp and acpcp are off for the first 2 time steps. (0-3 and 0-6)
The issue tho is only for the 0-3 because if I do 0-6 minus 0-3 I get
image

I have looked at a handful of other variables and they seem all ok, but in all honestly I have not looked at all of them.
Also I picked 2006-06 by chance, so I am not sure how pervasive this is. I will do a little more random checks, but maybe you are aware of this issue?

@Patrick-Keown
Copy link
Contributor

Patrick-Keown commented Sep 22, 2023 via email

@Patrick-Keown
Copy link
Contributor

Thank you for bringing these data issues to our attention. We are working on fixing the issues you brought up on github (#1994). My coworker is fixing and sending the data for 2004102400, 2006033000 and 2006033000 to AWS. I believe she has mostly completed this process, but I will confirm with her when she returns from vacation.

Meanwhile, we are verifying and sending this data to our FTP server (ftp://ftp.emc.ncep.noaa.gov/GEFSv12). Please note that this FTP data cannot be accessed through any modern internet browser, but it can be publicly accessed using tools such as the ftp command (e.g. ftp ftp.emc.ncep.noaa.gov).

  1. The missing data from 2004102400 is now available on this FTP: ftp://ftp.emc.ncep.noaa.gov/GEFSv12/reforecast/2004/10/24/
  2. We are working on fixing 20041017 and 2006033000.
  3. Regarding the erroneous precipitation values for f03 and f06, this is a known issue and a fix has been applied to most of the cases in the reforecast dataset. We are looking into June 2006 and will work on fixing this.

@EricSinsky-NOAA
Copy link

Hi Chiara,

We are continuing to fix the data issues that you have found.

  1. The missing data from 2004101700 is now available on the EMC FTP: ftp://ftp.emc.ncep.noaa.gov/GEFSv12/reforecast/2004/10/17/. I believe this data is also on AWS, but I will confirm with my coworker that she finished processing this data when she returns from vacation.
  2. With thanks to my coworker, she has corrected the time coordinates for 2006033000 for ugrd_hgt and vgrd_hgt. These can be found on AWS: https://noaa-gefs-retrospective.s3.amazonaws.com/index.html#GEFSv12/reforecast/2006/2006033000/c00/Days:1-10/
  3. We are working on making those corrections to f03 and f06 for June 2006.

Thank you.

@EricSinsky-NOAA
Copy link

Hi Chiara,

For June 1 2006, it looks like the f03 and f06 data has already been fixed on the EMC FTP: ftp://ftp.emc.ncep.noaa.gov/GEFSv12/reforecast/2006/06/01/
If you also see this same issue with the data on the EMC FTP for June 2006, please feel free to let me know.
The fixes for June 2006 may have not all been carried over to AWS. We will work on bringing these f03 and f06 fixes to AWS.

Thank you.

@chiaral
Copy link
Author

chiaral commented Sep 29, 2023

Thanks for the update!! I only access them through aws so I will wait for that for sure!

@chiaral
Copy link
Author

chiaral commented Nov 15, 2023

Hello! Adding a new small issue,

some days - for now I identified only one day in 2001 11 15 for all ensemble members, i.e. this folder - the files are repeated twice but one has a missing digit in the date of the filename

(correct date 2001 11 15)
acpcp_sfc_2001111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:41 | 28 MB
(wrong date 2001 11 5)
acpcp_sfc_200111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:26 | 30 MB

Problem is that they are different for the first two time steps for some variables.

this is for accumulated precip

one  = cfgrib.open_dataset('acpcp_sfc_2001111500_c00.grib2')
two  = cfgrib.open_dataset('acpcp_sfc_200111500_c00.grib2')

(two.acpcp- one.acpcp).sum(dim=['latitude', 'longitude'])

array([997719.06, 997308.06,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ], dtype=float32)

with

(one.isel(step=slice(0,2))-two.isel(step=slice(0,2))).acpcp.plot(col='step')

image

surface pressure seem to be identical in both files
helicity too

I can't check them all, so I was wondering if you have any guidance.
In the case of acpcp the differences are such that the one with the wrong date is much more wet (probably the first step is problematic and the second one carries the value in the accumulation).
ButI just thought to let you know.

@EricSinsky-NOAA
Copy link

Hello! Adding a new small issue,

some days - for now I identified only one day in 2001 11 15 for all ensemble members, i.e. this folder - the files are repeated twice but one has a missing digit in the date of the filename

(correct date 2001 11 15) acpcp_sfc_2001111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:41 | 28 MB (wrong date 2001 11 5) acpcp_sfc_200111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:26 | 30 MB

Problem is that they are different for the first two time steps for some variables.

this is for accumulated precip

one  = cfgrib.open_dataset('acpcp_sfc_2001111500_c00.grib2')
two  = cfgrib.open_dataset('acpcp_sfc_200111500_c00.grib2')

(two.acpcp- one.acpcp).sum(dim=['latitude', 'longitude'])

array([997719.06, 997308.06,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ,      0.  ,      0.  ,      0.  ,      0.  ,
            0.  ,      0.  ], dtype=float32)

with

(one.isel(step=slice(0,2))-two.isel(step=slice(0,2))).acpcp.plot(col='step')

image

surface pressure seem to be identical in both files helicity too

I can't check them all, so I was wondering if you have any guidance. In the case of acpcp the differences are such that the one with the wrong date is much more wet (probably the first step is problematic and the second one carries the value in the accumulation). ButI just thought to let you know.

@chiaral Thank you for bringing this to our attention. We are investigating 2001111500.

@EricSinsky-NOAA
Copy link

Hello! I found something not missing but erroneous in the precipitation (it appears both in tp and acpcp) for one month so far. I have not done an exhaustive analysis, I bumped into this by pure luck.

for the whole month of June 2006 tp and acpcp are off for the first 2 time steps. (0-3 and 0-6) The issue tho is only for the 0-3 because if I do 0-6 minus 0-3
I have looked at a handful of other variables and they seem all ok, but in all honestly I have not looked at all of them. Also I picked 2006-06 by chance, so I am not sure how pervasive this is. I will do a little more random checks, but maybe you are aware of this issue?

The f03 and f06 fixes for June 2006 have recently been sent to AWS.

@EricSinsky-NOAA
Copy link

Hello! Adding a new small issue,

some days - for now I identified only one day in 2001 11 15 for all ensemble members, i.e. this folder - the files are repeated twice but one has a missing digit in the date of the filename

(correct date 2001 11 15) acpcp_sfc_2001111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:41 | 28 MB (wrong date 2001 11 5) acpcp_sfc_200111500_p01.grib2 | 3 years ago | 2021-02-17 11:19:26 | 30 MB

Problem is that they are different for the first two time steps for some variables.

surface pressure seem to be identical in both files helicity too

I can't check them all, so I was wondering if you have any guidance. In the case of acpcp the differences are such that the one with the wrong date is much more wet (probably the first step is problematic and the second one carries the value in the accumulation). ButI just thought to let you know.

The files with the incorrect date ("200111500") in the filename have been removed from AWS for 20011115. The corrected f03 and f06 data has also been sent to AWS. Many thanks to my co-worker for managing the data on AWS.

@chiaral
Copy link
Author

chiaral commented Jan 10, 2024

Hello!

the wrong valid_time (april vs june) that I had identified for ugrd_hgt_2006033000_c0 and vgrd_hgt_2006033000_c0, I found it for cape_sfc and spfh_2m as well (same date and ensemble).

@EricSinsky-NOAA
Copy link

Hi @chiaral, we are working on correcting the valid_time for cape_sfc and spfh_2m.

@chiaral
Copy link
Author

chiaral commented Jan 10, 2024

(EDITED) After more hiccups here and there, I realized that also all the other ensembles member - and not just c00, have the same issue of using the wrong year (2004 instead of 2006) that I found for ugrd_hgt, vgrd_hgt, cape_sfc, and spfh_2m. I also found the u/vgrd_pres_abv700mb_2006033000 have it. So I'd probably check other variables as well.

@EricSinsky-NOAA
Copy link

@chiaral Thank you for bringing this to our attention. We are investigating and correcting the incorrect valid times for 2006033000.

@EricSinsky-NOAA
Copy link

@chiaral The issue regarding the incorrect valid times in the 2006033000 grib2 files has been resolved. After further investigation, we found that this issue occurred because 2004033000 data was being mislabeled as "2006033000" in the grib2 filename for days 1-10. The correct 2006033000 data is now being used in the 2006033000 grib2 files. The actual 2006033000 data, however, contains incomplete records in the "abv" files for days 1-10. Unfortunately, we are unable to recover this missing 2006033000 data in the "abv" files for days 1-10.

@chiaral
Copy link
Author

chiaral commented Jan 31, 2024

Thanks - so just to understand better, should I update only the 20060330 data or should I also refresh 20040330 data? It's unclear to me. And is this being propagated to AWS or only on ftp?
It's ok about the missing data. thanks.

@EricSinsky-NOAA
Copy link

@chiaral The changes that were explained in my previous message have been propagated to AWS. You should update the 2006033000 data only. Previously, the data labelled as "2006033000" in the filename was actually 2004033000 data for days 1-10. There is no need to update the 2004033000 data.

@chiaral
Copy link
Author

chiaral commented Feb 2, 2024

Hello

I am now looking at the files after 2010.
the file apcp_sfc_2012051700_c00 - but i think this is true for multiple variables because it was failing across many variables - has two different start time.
in particular

import cfgrib
il = 'apcp_sfc_2012051700_c00.grib2'
dclist = cfgrib.open_datasets(
            il,  backend_kwargs={"extra_coords": {"stepRange": "step"}}
        )
dclist

[<xarray.Dataset>
 Dimensions:     (time: 2, step: 80, latitude: 721, longitude: 1440)
 Coordinates:
     number      int64 0
   * time        (time) datetime64[ns] 2008-05-17 2012-05-17
   * step        (step) timedelta64[ns] 0 days 03:00:00 ... 10 days 00:00:00
     surface     float64 0.0
   * latitude    (latitude) float64 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
   * longitude   (longitude) float64 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
     valid_time  (time, step) datetime64[ns] dask.array<chunksize=(2, 80), meta=np.ndarray>
     stepRange   (step) <U7 dask.array<chunksize=(80,), meta=np.ndarray>
 Data variables:
     tp          (time, step, latitude, longitude) float32 dask.array<chunksize=(2, 80, 721, 1440), meta=np.ndarray>
 Attributes:
     GRIB_edition:            2
     GRIB_centre:             kwbc
     GRIB_centreDescription:  US National Weather Service - NCEP
     GRIB_subCentre:          2
     Conventions:             CF-1.7
     institution:             US National Weather Service - NCEP]

The problem is

   * time        (time) datetime64[ns] 2008-05-17 2012-05-17

If your pipeline gets the name from the filename, it won't have issues, if it assumes that there is one value, it will break.

@EricSinsky-NOAA
Copy link

Hi @chiaral,
Thank you for providing details regarding this issue. We are investigating.

@EricSinsky-NOAA
Copy link

Hi @chiaral, the issue that you found in 2012051700 has been corrected and sent to AWS.

@chiaral
Copy link
Author

chiaral commented Feb 16, 2024

Fantastic! thanks so much for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants