Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5 JCSDA mpasjedi tests (out of 57) failed in the latest RDASApp #158

Open
guoqing-noaa opened this issue Sep 9, 2024 · 20 comments
Open

Comments

@guoqing-noaa
Copy link
Collaborator

Error message:

91% tests passed, 5 tests failed out of 57                                         

Label Time Summary:                      
executable    = 622.34 sec*proc (13 tests)                                         
mpasjedi      = 1115.24 sec*proc (57 tests)                                        
mpi           = 1113.37 sec*proc (56 tests)                                        
script        = 492.90 sec*proc (44 tests)                                         

Total Test time (real) = 1115.41 sec     

The following tests FAILED:              
         36 - test_mpasjedi_3denvar_amsua_allsky (Failed)                          
         37 - test_mpasjedi_3denvar_amsua_bc (Failed)                              
         42 - test_mpasjedi_4denvar_VarBC (Failed)                                 
         43 - test_mpasjedi_4denvar_VarBC_nonpar (Failed)                          
         52 - test_mpasjedi_lgetkf_height_vloc (Failed)  

The test was done on Hera at /scratch1/BMC/wrfruc/gge/tmp/rdas_build_test/RDASApp_NOAA-EMC_develop
The error log is at

/scratch1/BMC/wrfruc/gge/tmp/rdas_build_test/RDASApp_NOAA-EMC_develop/build/mpas-jedi/Testing/Temporary/LastTest.log

Anyone can repeat the error by cloning the latest RDASApp, build, enter build/mpasjedi, source ../../ush/load_rdas.sh, export SLURM_ACCOUNT=your_account and then run ctest

@guoqing-noaa
Copy link
Collaborator Author

@SamuelDegelia-NOAA Could you help look at this? Thanks!

@guoqing-noaa guoqing-noaa changed the title 5 JCSDA mpasjedi tests failed in the latest RDASApp 5 JCSDA mpasjedi tests _(out of 57)_ failed in the latest RDASApp Sep 9, 2024
@guoqing-noaa guoqing-noaa changed the title 5 JCSDA mpasjedi tests _(out of 57)_ failed in the latest RDASApp 5 JCSDA mpasjedi tests (out of 57) failed in _the_ latest RDASApp Sep 9, 2024
@guoqing-noaa guoqing-noaa changed the title 5 JCSDA mpasjedi tests (out of 57) failed in _the_ latest RDASApp 5 JCSDA mpasjedi tests (out of 57) failed in the latest RDASApp Sep 9, 2024
@SamuelDegelia-NOAA
Copy link
Contributor

Sure @guoqing-noaa, I will take a closer look at this today.

@SamuelDegelia-NOAA
Copy link
Contributor

At least the first four tests fail due to discrepancies in the satbias data. Recent updates to UFO changed how varBC is handled and it now expects variables named BiasCoefficientErrors. But the ufo test data we are using is older (copied from @CoryMartin-NOAA's staged data instead of the ufo-data repo) and the satbias file has the variable named bias_coeff_errors. So we could either update the build script to clone from https://github.com/JCSDA-internal/ufo-data, or we could have @CoryMartin-NOAA update his test data.

@SamuelDegelia-NOAA
Copy link
Contributor

Maybe a question for @TingLei-NOAA and/or @CoryMartin-NOAA but do you know why we are not cloning ufo-data and instead use Cory's local copy?

@TingLei-NOAA
Copy link
Contributor

@SamuelDegelia-NOAA That is to use faster disk copying to replace internet transferring.

@guoqing-noaa
Copy link
Collaborator Author

@SamuelDegelia-NOAA Thanks a lot for pinpointing the reason!
This ufo-data issue can be easily fixed in my PR #147
I will update after my testing.

@guoqing-noaa
Copy link
Collaborator Author

@SamuelDegelia-NOAA I updated the ufo-data on Jet/Hera/Hercules(Orion) and the BiasCoefficientErrors missing issue is solved.
But it now has a new error:

!!! NetCDF error in netcdf_open_file on task #000001: No such file or directory
!!! Hint: file ./Data/bump/mpas_parametersbump_loc_nicas_local_000001-000001.nc
!!! ABORT in netcdf_strerror on task #000001: cannot find this child ID in registry

You will need my new branch reduce_copying2 to repeat this.

Here is what I did:
Assume we have cloned the rdas_build_test tool (git clone [email protected]:rrfsx/rdas_build_test.git)

cd rdas_build_test/
./build_test.sh guoqing-noaa reduce_copying2 rtrr

The test log files are at:

/lfs5/BMC/wrfruc/gge/tmp/rdas_build_test/RDASApp_guoqing-noaa_reduce_copying2/build/mpas-jedi/Testing/Temporary/LastTest.log
/work/noaa/wrfruc/gge/rdas_build_test_hecules/RDASApp_guoqing-noaa_reduce_copying2/build/mpas-jedi/Testing/Temporary/LastTest.log

On Hera, somehow the bump files exist and we got another error information:

terminate called after throwing an instance of 'oops::TestReferenceFloatMismatchError'
  what():  Test reference Float mismatch @ Line:2

and here is the log file on Hera:

/scratch1/BMC/wrfruc/gge/tmp/rdas_build_test/RDASApp_guoqing-noaa_reduce_copying2/build/mpas-jedi/Testing/Temporary/LastTest.log

@SamuelDegelia-NOAA
Copy link
Contributor

Interesting, will clone your branch and keep checking!

@SamuelDegelia-NOAA
Copy link
Contributor

@guoqing-noaa I did not get any netcdf errors when running with your branch on Hera. Are those files not cloned when running on Jet?

Regarding the oops::TestReferenceFloatMismatchError error for the other ctests, I think that this is due to our Saber version lagging behind mpas-jedi. The reference files for the mpas-jedi ctests were recently updated due to changes in BUMP (which are compared against here), but we are waiting for Saber #928 to be merged before updating it in RDASApp. Once that PR is merged, I can update the submodules for RDASApp and then rerun these ctests. I think that PR is meant to be merged sometime this week.

@guoqing-noaa
Copy link
Collaborator Author

@SamuelDegelia-NOAA Can you access Hercules or Jet? You may repeat the bump files missing issue on either platform.

@guoqing-noaa
Copy link
Collaborator Author

@guoqing-noaa I did not get any netcdf errors when running with your branch on Hera. Are those files not cloned when running on Jet?

Regarding the oops::TestReferenceFloatMismatchError error for the other ctests, I think that this is due to our Saber version lagging behind mpas-jedi. The reference files for the mpas-jedi ctests were recently updated due to changes in BUMP (which are compared against here), but we are waiting for Saber #928 to be merged before updating it in RDASApp. Once that PR is merged, I can update the submodules for RDASApp and then rerun these ctests. I think that PR is meant to be merged sometime this week.

Thanks for finding out the reason. I think we may want to merge PR #147 before finally getting a fix from Saber #928

@SamuelDegelia-NOAA
Copy link
Contributor

I can check for the error on Jet tomorrow.

@SamuelDegelia-NOAA
Copy link
Contributor

SamuelDegelia-NOAA commented Sep 10, 2024

@guoqing-noaa I did a fresh clone of your branch on Jet and the five tests fail with the same errors on Hera (oops::TestReferenceFloatMismatchError). I did not see the netCDF error you linked. I did not run ./build_test.sh since the version I have only runs the rrfs-test ctests. So I just manually went into build/mpas-jedi and ran ctest. My test log on Jet is at /lfs5/BMC/wrfruc/Samuel.Degelia/RDASApp_dev/RDASApp/build/mpas-jedi/Testing/Temporary/LastTestsFailed.log.

Since the oops::TestReferenceFloatMismatchError will likely be resolved by the BUMP update, I think that these ctest failures are okay for now.

Note: Saber #928 was merged today so I can work on updating Saber in RDASApp and testing if it resolves these ctest failures.

@guoqing-noaa
Copy link
Collaborator Author

@SamuelDegelia-NOAA Thanks for testing on Jet. I can repeat my test on Jet to see what I will get.

Glad that Saber #928 was merged. But I don't think we need to address that issue in the PR. It works better if you create a new issue and PR for that. Thanks!

@SamuelDegelia-NOAA
Copy link
Contributor

@guoqing-noaa Yes that was my plan - I will have a new issue and PR to update Saber.

@guoqing-noaa
Copy link
Collaborator Author

@SamuelDegelia-NOAA
My new run on Jet still misses the bumploc files.
/lfs5/BMC/wrfruc/gge/tmp/rdas_build_test/RDASApp_guoqing-noaa_reduce_copying/build/mpas-jedi/test/Data/bump

Did you save the log from the build process?

@guoqing-noaa
Copy link
Collaborator Author

@SamuelDegelia-NOAA My new run on Jet still misses the bumploc files. /lfs5/BMC/wrfruc/gge/tmp/rdas_build_test/RDASApp_guoqing-noaa_reduce_copying/build/mpas-jedi/test/Data/bump

AH, I found the reason. I have to run other tests first to generate bumploc files. I should not run those failed 5 tests separately without running other tests.

@SamuelDegelia-NOAA
Copy link
Contributor

That makes sense, thanks for the update!

@SamuelDegelia-NOAA
Copy link
Contributor

As mentioned in #163, four of these ctests now pass after updating saber and crtm. The test_mpasjedi_lgetkf_height_vloc test still fails with a reference mismatch error. The mismatch only passes the threshold for the analysis variance for wind, everything else is very close.

There are still some submodule differences between RDASApp and those used to generate the mpas-jedi ctests that likely explain this small mismatch. One difference is the MPAS model version. However I cannot build RDASApp with the latest MPAS version - it gives an error when compiling mpas_geom_mod.F90.

Since we use different library versions from the mpas-jedi repo, it is not expected for all of the ctests to work. I think we can be okay with this, as long as we understand why the ctests fail. And since this is a small reference mismatch error (not a failure due to missing files, etc.), I think we can leave this be.

@guoqing-noaa
Copy link
Collaborator Author

Thanks, @SamuelDegelia-NOAA! Great progress!

We can leave this issue open since we will update all subcomponents soon to use mpasjedi v3.0. We can revisit this issue at that time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants