-
Notifications
You must be signed in to change notification settings - Fork 320
System Testing Guide
Contents
If you are new to system testing with create_test
, we recommend you read this whole
guide linearly. You may find that you never run create_test
directly, instead relying
on CTSM's run_sys_tests
wrapper script. However, it is still helpful to know about
create_test
, since run_sys_tests
is just a wrapper to that underlying script.
If you want to jump right in with running a test suite, and/or if you already understand the CESM/CIME test system well, you can jump down to Running test suites with the run_sys_tests wrapper.
If you know all about how to use these testing tools, but have just been asked to act as an integrator – doing final testing before bringing a branch to master – you can jump down to Notes for integrators.
System tests are useful for:
- Verifying various requirements
- Runs to completion
- Restarts bit-for-bit
- Results independent of processor count
- Threading
- Compilation with debug flags, e.g., to pick up:
- Array bounds problems
- Floating point errors
- And other specialty tests (e.g., init_interp)
- Verifying those requirements across a wide range of model configurations (e.g., making sure CTSM still works when you turn on prognostic crops)
- Making sure that you haven't introduced a bug that changes answers in
some configurations, when answer changes are unexpected
- This is one of the most powerful aspects of the system tests
- For this to work best, you should try to separate your changes into:
- Bit-for-bit refactoring
- Answer-changing modifications that are as small as possible, and so can be carefully reviewed
The cime test system runs tests that involve:
- Doing one or more runs of the model and verifying that they run to completion. (If there was more than one run in a single test, then there is some change in configuration between the different runs.)
- If there was more than one run in a single test, then comparing those runs to ensure they were bit-for-bit identical as expected.
- If desired, comparing results with existing baselines (to ensure results are bit-for-bit the same as before), and/or generating new baselines.
- Providing final test results: An overall PASS/FAIL as well as PASS/FAIL status for individual parts of the test.
A test name looks like this; bracketed components are optional:
Testtype[_Testopt].Resolution.Compset.Machine_Compiler[.Testmod]
(There may be more than one Testopt, separated by underscores.)
Notice that this string specifies all required options to
create_newcase
.
An example is:
SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default
Testtype
: code specifying the type of test to run; common test types
are given in the table below. The SMS
test in the above example is a
basic smoke test.
Testopt
: One or more options modifying some high-level configuration
options. In the above example, we are compiling in debug mode (_D
),
and running for 3 days (_Ld3
).
Resolution
: Any resolution that you would typically specify via the
--res
argument to create_newcase
. In the above example, we are
using the f10_f10_musgs
resolution, which is a coarse-resolution
global grid that is good for testing.
Compset
: Any compset that you would typically specify via the
--compset
option to create_newcase
. In the above example, we are
using the I1850Clm50BgcCrop
compset.
Machine
: The name of the machine you are running on (cheyenne
in
the above example).
Compiler
: The name of the compiler to use (intel
in the above
example).
Testmod
: A directory containing arbitrary user_nl_*
contents and
xmlchange
commands. See below for more details.
The following are the most commonly used test types and their meaning:
SMS
: Basic smoke test: Just does a single run
ERS
: Exact restart test: Compares two runs, ensuring that they give
bit-for-bit identical results:
- Straight-through run, which writes a restart file just over half-way through
- Restart run starting from the restart file written by (1)
ERP
: Exact restart with changed processor count: This covers the
exact restart functionality of the ERS
test, and also halves the
processor count in run (2). In addition, if multiple threads are used,
it also halves the thread count in run (2). Thus, in addition to
ensuring that restarts are bit-for-bit, it also ensures that answers do
not depend on processor count, and optionally that answers do not depend
on threading. This is nice in that a single test can verify a few of our
most important system requirements. However, when the test fails, it can
sometimes be harder to track down the cause for the problem. (To debug a
failed ERP test, you can run the same configuration in an ERS, PEM
and/or PET test.)
The following are the most commonly used test options (optional strings
appearing after the test type, separated by _
):
_D
: Compile in debug mode. Exactly what this does depends on the
compiler. Typically, this turns on checks for array bounds and various
floating point traps. The model will run significantly slower with this
option.
_L
: Specifies the length of the run. The default for most tests is 5
days. Examples are _Ld3
(3 days), _Lm6
(6 months), and _Ly5
(5 years).
_P
: Specifies the processor count of the run. Syntax is _PNxM
where N
is the number of tasks and M
is the number of threads
per task. For example, _P32x2
runs with 32 tasks and 2 threads per
task. Default layouts of standalone CTSM all have just 1 thread per
task, but the ability to run with threading (and get bit-for-bit
identical answers) is an important requirement. Thus, many of our tests
(and particularly ERP tests) specify processor layouts that use 2
threads per task.
Few CTSM tests simply run an out-of-the-box compset without any other modifications. Testmods provide a facility to make arbitrary changes to xml and namelist variables for this particular test. They typically serve two purposes:
- Adding more frequent history output, additional history streams, and/or additional history variables. The more frequent history output is particularly important, since otherwise a short (e.g., 5-day) test would not produce any CTSM diagnostic output (since the default output frequency is monthly).
- Making configuration changes specific to this test, such as turning on a non-default parameterization option.
Testmods directories are assumed to be in
cime_config/testdefs/testmods_dirs
. Dashes are used in place of
slashes in the path relative to that directory. So a testmod of
clm-default
is found in
cime_config/testdefs/testmods_dirs/clm/default/
.
Testmods directories can contain three types of files:
-
user_nl_*
files: The contents of these files are copied into the appropriateuser_nl
file (e.g.,user_nl_clm
) in the case directory. This allows you to set namelist options. -
shell_commands
: This file can contain xmlchange commands that change the values of xml variables in the case. -
include_user_mods
: Often you want a testmod that is basically the same as some other testmod, but with a few extra changes. For example, many of our testmods use the default testmod as a starting point, then add a few things on top of that.include_user_mods
allows you to set up these relationships without resorting to unmaintainable copy & paste. This file contains the relative path to another testmod directory to include; for example, its contents may be:../default
First, the
user_nl_*
andshell_commands
contents from the included testmod are applied, then the contents from the current testmod are applied. (So changes from the current testmod take precedence in case of conflicts.)These includes are applied recursively, if you include a directory that itself has an
include_user_mods
file. Also, in principle, aninclude_user_mods
file can include multiple testmods (one per line), but in practice we rarely do that, because it tends to be more confusing than helpful.
Running a single test is as simple as doing the following from
cime/scripts
:
./create_test TESTNAME
For example:
./create_test SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default
In contrast to create_newcase
, create_test
automatically runs
case.setup
, case.build
and case.submit
for you - so that
single create_test
command will build and run your case.
A full list of possible options to create_test
can be viewed by
running create_test -h
. Here are some of the most useful options:
-
-r /path/to/test/root
: By default, the test's case directory is placed in the directory given byCIME_OUTPUT_ROOT
(e.g.,/glade/scratch/$USER
on cheyenne). This has the benefit that thebld
andrun
directories are nested under the case directory. However, if your scratch space is cluttered, this can make it hard to find your test cases later. If you specify a different directory with the-r
(or--test-root
) option, your test cases will appear there, instead. Specifying-r .
will put your test cases in the current directory (analogous to the operation ofcreate_newcase
). This option is particularly useful when running large test suites: We often find it useful to put all tests within a given test suite within a subdirectory ofCIME_OUTPUT_ROOT
- for example,-r /glade/scratch/$USER/HELPFULLY_NAMED_SUBDIRECTORY
. -
--walltime HH:MM
: By default, the maximum queue wallclock time for each test is generally the maximum allowed for the machine. Since tests are generally short, using this default may result in your jobs sitting in the queue longer than is necessary. You can use the--walltime
option to specify a shorter queue wallclock time, thus allowing your jobs to get through the queue faster. However, note that all tests will use the same maximum walltime, so be sure to pick a time long enough for the longest test in a test suite. (Note: If you are running a full test suite with the xml options documented below, walltime limits may already be specified on a per-test basis. However, as of the time of this writing, this capability is not yet used for the CTSM test suites.)
As a test runs through its various phases (setup, build, run, etc.), it
updates a file named TestStatus
in the test's case directory. After
a test completes, a typical TestStatus
file will look like this:
PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default CREATE_NEWCASE PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default XML PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SETUP PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SHAREDLIB_BUILD time=175 PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default NLCOMP PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MODEL_BUILD time=96 PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SUBMIT PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default RUN time=606 PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default COMPARE_base_rest PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default BASELINE ctsm_n11_clm4_5_16_r249 PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default TPUTCOMP PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MEMLEAK insuffiencient data for memleak test
(This is from a test that had comparisons with baselines, which we have not described yet.)
The three possible status codes you may see are:
-
PASS
: This phase finished successfully -
FAIL
: This phase finished with an error -
PEND
: This phase is currently running, or has not yet started. (If a given phase is listed asPEND
, subsequent phases may not be listed yet in theTestStatus
file.)
By the time a test completes, you should typically see all PASS
status values to indicate that the test completed successfully. However,
we often ignore FAIL
values for TPUTCOMP
and MEMCOMP
(which
compare throughput and memory usage with the baseline), because system
variability can cause these to fail even when there isn't a real
problem.
More detailed test output can be found in the file named
TestStatus.log
in the test's case directory. This is the first place
you should look if a test has failed.
Many test types perform two runs and then compare the output from the
two, expecting bit-for-bit identical output. For example, an ERS
test compares a straight-through run with a restart run. The comparison
is done by comparing the last set of history files from each run. (If,
for example, there are h0 and h1 history files, then this will compare
both the last h0 file and the last h1 file.) These comparisons are done
via a custom tool named cprnc
, which compares each field and, if
differences are found, computes various statistics on these differences.
If any one of these comparisons fails, you will see a line like:
FAIL ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default COMPARE_base_rest
As usual, more details can be found in TestStatus.log
, where you
will find output like this:
2017-09-26 10:10:24: Comparing hists for case 'ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud' dir1='/glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run', suffix1='base', dir2='/glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run' suffix2='rest' comparing model 'datm' no hist files found for model datm comparing model 'clm' /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h0.0001-01-04-00000.nc.base did NOT match /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h0.0001-01-04-00000.nc.rest cat /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h0.0001-01-04-00000.nc.base.cprnc.out /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h1.0001-01-04-00000.nc.base did NOT match /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h1.0001-01-04-00000.nc.rest cat /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h1.0001-01-04-00000.nc.base.cprnc.out comparing model 'sice' no hist files found for model sice comparing model 'socn' no hist files found for model socn comparing model 'mosart' no hist files found for model mosart comparing model 'cism' no hist files found for model cism comparing model 'swav' no hist files found for model swav comparing model 'cpl' /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.cpl.hi.0001-01-04-00000.nc.base did NOT match /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.cpl.hi.0001-01-04-00000.nc.rest cat /glade/scratch/sacks/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud/run/ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.cpl.hi.0001-01-04-00000.nc.base.cprnc.out FAIL
Notice the lines that say did NOT match
. Also notice the lines
pointing you to various *.cprnc.out
files. (For convenience,
*.cprnc.out
files from failed comparisons are also copied to the
case directory.) These output files from cprnc
contain a lot of
information. Most of what you need, though, can be determined via:
-
Examining the last 10 or so lines:
$ tail -10 ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h0.0001-01-04-00000.nc.base.cprnc.out SUMMARY of cprnc: A total number of 487 fields were compared of which 340 had non-zero differences and 0 had differences in fill patterns and 0 had different dimension sizes A total number of 2 fields could not be analyzed A total number of 0 fields on file 1 were not found on file2. diff_test: the two files seem to be DIFFERENT
-
Looking for lines referencing RMS errors:
$ grep RMS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default.20170926_095505_cqrqud.clm2.h0.0001-01-04-00000.nc.base.cprnc.out RMS ACTUAL_IMMOB 3.4138E-11 NORMALIZED 1.1947E-04 RMS AGNPP 3.9135E-14 NORMALIZED 1.0836E-08 RMS AR 1.4793E-10 NORMALIZED 1.2585E-05 RMS BAF_PEATF 6.9713E-23 NORMALIZED 2.4249E-12 RMS BGNPP 3.2774E-14 NORMALIZED 9.1966E-09 RMS BTRAN2 2.5167E-07 NORMALIZED 2.7111E-07 RMS BTRANMN 2.5532E-07 NORMALIZED 6.0307E-07 RMS CH4PROD 1.3658E-15 NORMALIZED 7.5109E-08 RMS CH4_SURF_AERE_SAT 6.6191E-12 NORMALIZED 1.6114E-04 RMS CH4_SURF_AERE_UNSAT 1.2635E-22 NORMALIZED 5.1519E-13 ...
Notice that this lists all fields that differ, along with their RMS and normalized RMS differences.
It is often useful to run multiple tests at once (i.e., a test suite), covering different test types, different compsets, different compilers, etc.
This can be done by simply listing each test on the create_test
command-line, as in:
./create_test SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default
However, it is often more convenient to create a file listing each of the tests you want to run. This way you can easily run the same test suite again later.
To do this, simply create a text file containing your test list, with one test per line:
SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default
Then run create_test
with the -f
(or --testfile
) option:
./create_test -f TESTFILE
(where TESTFILE
gives the path to the file you just created).
The -r
and --walltime
options described in Options to
create_test are useful here, too. The -r
option is particularly
helpful for putting all of the tests in the test suite together in their
own directory.
You can check the individual TestStatus
files in each test of your
test suite, but that gets old pretty quickly. An easier way to check the
results of a test suite is to run the cs.status.TESTID
command that
is put in your test root (where TESTID
is the unique id that was
used for this test suite).
If you run this cs.status
command, you will see output like the following:
20170926_093725_gq431o ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default (Overall: PASS) details: PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default CREATE_NEWCASE PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default XML PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SETUP PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SHAREDLIB_BUILD time=175 PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default NLCOMP PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MODEL_BUILD time=96 PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SUBMIT PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default RUN time=606 PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default COMPARE_base_rest PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default BASELINE ctsm_n11_clm4_5_16_r249 PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default TPUTCOMP PASS ERS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MEMLEAK insuffiencient data for memleak test SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default (Overall: PASS) details: PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default CREATE_NEWCASE PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default XML PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SETUP PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SHAREDLIB_BUILD time=16 PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default NLCOMP PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MODEL_BUILD time=202 PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default SUBMIT PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default RUN time=374 PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default BASELINE ctsm_n11_clm4_5_16_r249 PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default TPUTCOMP PASS SMS_D_Ld3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-default MEMLEAK insuffiencient data for memleak test
This aggregates the results of all of the tests in the test suite, and also gives an Overall PASS or FAIL result for each test. Reviewing this output manually can be tedious, so some options can help you filter the results:
- The
-f
/--fails-only
option tocs.status
allows you to see only test failures - The
--count-performance-fails
option suppresses line-by-line output for performance comparisons that often fail due to machine variability; instead, this just gives a count of the number of non-PASS results (FAIL or PEND) at the bottom. - The
-c PHASE
/-count-fails PHASE
option can be used to suppress line-by-line output for the given phase (e.g., NLCOMP or BASELINE), instead just giving a count of the number of non-PASSes (FAILs or PENDs) for that phase. This is useful when you expect failures for some phases – often, phases related to baseline comparisons. This option can be specified multiple times.
So a typical use of cs.status.TESTID
will look like this:
./cs.status.20170926_093725_gq431o -f --count-performance-fails
or, if you expect NLCOMP and BASELINE failures:
./cs.status.20170926_093725_gq431o -f --count-performance-fails -c NLCOMP -c BASELINE
In addition to running your own individual tests or test suites, you can
also use create_test
to run a pre-defined test suite. Most CESM
components have a policy that a particular test suite must be run before
changes can be merged back to the master branch. These test suites are
defined in xml files in each component.
To determine what pre-defined test suites are available and what tests
they contain, you can run cime/scripts/query_testlists
(run
query_testlists -h
for usage information).
Test suites are retrieved in create_test
via three selection
attributes:
- The test category, specified with
--xml-category
(e.g.,--xml-category aux_clm
; see Test categories for other options) - The machine, specified with
--xml-machine
(e.g.,--xml-machine cheyenne
) - The compiler, specified with
--xml-compiler
(e.g.,--xml-compiler intel
) (although it's also possible to leave this out and run all tests for this category and machine in a single test suite)
So a component's testing policy may state something like: You must run
the tests from the aux_clm
category for these machine/compiler
combinations: cheyenne/intel, cheyenne/gnu, hobart/nag and hobart/pgi.
So, for example, to run the subset of the aux_clm
test suite that
runs on cheyenne with the intel compiler, you can run:
./create_test --xml-category aux_clm --xml-machine cheyenne --xml-compiler intel
The -r
option described in Options to create_test is particularly
useful here for putting all of the tests in the test suite together in
their own directory.
create_test
uses multiple threads aggressively to speed up the process of setting up
and building all of the cases in your test suite. On a shared system,this can turn you
into a bad neighbor and get you in trouble with your system administrator. If possible,
you should submit the create_test job to a compute node rather than running it on the
login node. CTSM's run_sys_tests
command does this automatically for you on our
main test machines; see Running test suites with the run_sys_tests wrapper for
details.
If you can't build the test suite on compute nodes, here are some helpful tips on running large test suites on the login node:
- It's a good idea to run
create_test
with the unixnohup
command in case you lose your connection. - Run
create_test
with the unixnice
command to give it a lower scheduling priority - Specify a smaller number of parallel jobs via the
--parallel-jobs
option tocreate_test
(the default is the number of cores available on a single node of the machine)
Putting this all together, a typical create_test
command for running
a pre-defined test suite might look like this:
nohup nice -n 19 ./create_test --xml-category aux_clm --xml-machine cheyenne --xml-compiler intel -r /glade/scratch/$USER/HELPFULLY_NAMED_SUBDIRECTORY --parallel-jobs 6
Testing that various configurations run to completion and that given variations are bit-for-bit with each other can only take you so far. The strongest tool we have for determining that your changes haven't broken anything are baseline comparisons. These compare the output from the current version of the code against the output from a previous version to determine if answers have changed at all in the new version.
Depending on what you have changed, you may expect:
- No answer changes, e.g., if you are doing an answer-preserving code refactoring, or adding a new option but not changing anything with respect to existing options
- Answers change only for certain configurations, e.g., if you change CTSM-crop code, but don't expect any answer changes for runs without the crop model
- Answers change for most or all configurations, but only in a few diagnostic fields that don't feed back to the rest of the system
- Answers change for most or all configurations
You may think that most changes fall into (4). With some care, however, it is often possible to separate large changes to the model science into:
- Bit-for-bit modifications that can be tested against baselines - e.g., renaming variables and moving code around, either before or after your science changes
- Answer-changing modifications; try to make these as small as possible (in terms of lines of code changed) so that they can be more easily reviewed for correctness.
You should then run the test suite separately on these two classes of changes, ensuring that the parts of the change that you expect to be bit-for-bit truly are bit-for-bit. The effort it takes to do this separation pays off in the increased confidence that you haven't introduced bugs.
First, you need to determine what to use as a baseline. Generally this is the version of master from which you have branched, or a previous, well-tested version of your branch.
If you're comparing against a version of master and have access to the
main development machine(s) for the given component, then baselines may
already exist. (e.g., on cheyenne, baselines go in
/glade/p/cgd/tss/ctsm_baselines
by default). Otherwise, you'll need to
generate your own baselines.
If you need to generate baselines, you can do so by:
- Checking out the baseline code version
- Running
create_test
from the baseline code with these options:-
--baseline-root /PATH/TO/BASELINE/ROOT
: Specifies the directory in which baselines should be placed. This is optional, but is needed if you don't have write access to the default baseline location on this machine. -
--generate GENERATE_NAME
: Specifies a name for these baselines. Baselines for individual tests are placed under/PATH/TO/BASELINE/ROOT/GENERATE_NAME
. For example, this could be a tag name or an abbreviated git sha-1.
-
If you're generating baselines for a full test suite (as opposed to just
one or a few tests of your choosing), you may have to run multiple
create_test
invocations, possibly on different machines, in order to
generate a full set of baselines. Each component has its own policies
regarding the test suite that should be run for baseline comparisons.
After the test suite finishes, you can check the results as normal. Now,
though, you should see an extra line in the TestStatus
files or the
output from cs.status
, labeled GENERATE
. A PASS
status for
this phase indicates that files were successfully copied to the baseline
directory. You can confirm this by looking through
/PATH/TO/BASELINE/ROOT/GENERATE_NAME
: There should be a directory
for each test in the test suite, containing history files, namelist
files, etc.
Comparison against baselines is done similarly to generation (as
described in Baseline comparisons step 2: Generate baselines, if
needed), but now you should use the -c-ompare COMPARE_NAME
flag to
create_test
. You should still specify --baseline-root
/PATH/TO/BASELINE/ROOT
. You can optionally specify --generate
GENERATE_NAME
, but if you do, make sure that GENERATE_NAME
differs
from COMPARE_NAME
! (In this case, create_test
will compare
against some previous baselines while also generating new baselines for
later use.)
After the test suite finishes, you can check the results as normal. Now,
though, you should see an extra line in the TestStatus
files or the
output from cs.status
, labeled BASELINE
. A PASS
status for
this phase indicates that all history file types were bit-for-bit
identical to their counterparts in the given baseline directory. (For
each history file type - e.g., cpl hi, clm h0, clm h1, etc. -
comparisons are just done for the last history file of that type.)
Checking the results of failed baseline comparisons is similar to
checking the results of failed in-test comparisons. See Finding more
details on failed comparisons for details. However, whereas failed
in-test comparisons are put in a file named *.nc.base.cprnc.out
,
failed baseline comparisons are put in a file named *.nc.cprnc.out
(without the base
; yes, this is a bit counter-intuitive).
If you expect differences in just a small number of tests or a small
number of diagnostic fields, you can confirm that the differences in the
baseline comparisons are just what you expected. The tool
cime/CIME/non_py/cprnc/summarize_cprnc_diffs
facilitates this; run
cime/CIME/non_py/cprnc/summarize_cprnc_diffs -h
for details.
In addition to the baseline comparisons of history files, comparisons are also performed for:
- Namelists (
NLCOMP
). For details on aNLCOMP
failure, seeTestStatus.log
- Model throughput (
TPUTCOMP
). However, note that system variability can cause this to fail even when there isn't a real problem.
It sometimes happens that you want to generate or compare baselines from an already-run test suite. Some reasons this may happen are:
- You forgot to specify
--generate
or--compare
when you ran the test suite. - You wanted to wait to see if the test suite was successful before generating baselines.
- You ran baseline comparisons against one set of baselines, but now want to run comparisons against a different set of baselines.
There are two complementary tools for doing this:
-
cime/CIME/Tools/bless_test_results
: after-the-fact baseline generation -
cime/CIME/Tools/compare_test_results
: after-the-fact baseline comparison
The usage messages for these are a bit confusing, due to the different
workflows used in ACME vs. CESM. A typical usage of
compare_test_results
for CESM would look like this:
./compare_test_results -b BASELINE_NAME --baseline-root BASELINE_ROOT -r TEST_ROOT -t TEST_ID
where:
-
-b BASELINE_NAME
(or--baseline-root BASELINE_NAME
) corresponds to--compare COMPARE_NAME
forcreate_test
-
--baseline-root
corresponds to the same argument forcreate_test
-
-r
(or--test-root
) corresponds to the same argument forcreate_test
-
-t TEST_ID
(or--test-id TEST_ID
) is either the test-id you specified with the-t
(or--test-id
) argument tocreate_test
, or the auto-generated test-id that was appended to each of your tests (a date and time stamp followed by a string of random characters)
To make it easier and less error-prone to run a suite of system tests, we have put
together the run_sys_tests
script, which can be found at the top level of a CTSM
checkout. This is a wrapper to one or more invocations of create_test
, so all of the
above information still applies.
Major benefits of using this wrapper script are:
- You don't need to know the set of compilers a test suite is defined for on our main test
machines: just run this wrapper script and it will run
create_test
on all defined compilers for the given test suite and machine. - On our main test machines, multiple
create_test
invocations are submitted as separate jobs to the compute nodes. - Sensible defaults are chosen for the testroot directory and test ID of a test suite. In addition, a symbolic link is made from the current directory to the testroot directory, making it easier to find the testroot directory later.
- Custom cs.status scripts are created that add arguments to aggregate across all tests in a full test suite and filter out pass results.
- Extra error-checking is done, such as making you explicitly state whether you want to compare against and/or generate baselines.
- Useful information is output to both the screen and a file in the test directory (
SRCROOT_GIT_STATUS
) giving a variety of git information about your current directory.
The primary purpose of this script is to assist with running full test suites, such as the
aux_clm
and clm_short
test suites (via the -s
/ --suite-name
argument). However, it can also be used to run individual tests (via the -t
/
testname
argument) or all tests listed in a plain text file (via the -f
/
--testfile
argument).
Typical usage of this script is simply:
./run_sys_tests -s SUITE_NAME -c COMPARE_NAME -g GENERATE_NAME [--baseline-root /PATH/TO/BASELINE/ROOT]
For example, to run the aux_clm
test suite, replace SUITE_NAME
with aux_clm
(similarly for clm_short
, fates
, etc.; see Test categories for other
options). This automatically detects the machine and launches the appropriate components
of the given test suite on that machine. A symbolic link will be created in the current
directory pointing to the testroot directory that contains all of the test directories in
the test suite. (The path to this directory is also output to the screen.)
Note that the -c
/ --compare
and -g
/ --generate
arguments are required,
unless you specify --skip-compare
and/or --skip-generate
.
The --baseline-root
argument is optional, but is needed if you are generating
baselines and don't have write access to the default baseline location on this machine.
This can also be used to run tests listed in a text file (via the -f
/ --testfile
argument), or tests listed individually on the command line (via the -t
/
--testname
argument).
For any SUITE_NAME that runs the Python system tests, see an additional requirement in Pre-merge system testing.
After running the run_sys_tests
command, you will see output describing a variety of git information about your current directory, ending with the git-fleximod
status. Then run_sys_tests
will exit. This is normal, correct operation: Depending on the machine, run_sys_tests
will either submit the create_test
jobs to the batch queue or will run them in the background.
Run ./run_sys_tests -h
for more details.
As noted in Checking the results of a test suite, you can run a cs.status.TESTID
command to see the results of all tests in a test suite. run_sys_tests
also creates
two additional cs.status
files to make it quicker and easier to parse the results from
a test suite:
-
cs.status
(only created with the-s
/--suite-name
argument): aggregates across all test IDs in this test suite, rather than requiring you to run a separatecs.status.TESTID
command for each compiler. -
cs.status.fails
(created with all modes of operation): adds options to show only test failures (-f
) and to suppress line-by-line output of performance failures, instead just giving a summary of these failures at the bottom (--count-performance-fails
). With the-s
/--suite-name
argument,cs.status.fails
also aggregates across all test IDs, as forcs.status
.
Both versions also include expected failure integration, as described in Expected test failures. These versions of cs.status
also accept the -c
/ --count-fails
argument described in Checking the results of a test suite.
Here are some general tips for running test suites:
- It is very important to not change anything in your CTSM directory (i.e., your git clone) once you start the test suite, until all tests in the test suite finish running.
- On cheyenne, set the
PROJECT
environment variable in your shell startup file, or use some other mechanism to specify a default project / account code to cime. This way, you won't need to add the--project
argument every time you runcreate_test
,run_sys_tests
, orcreate_newcase
.
This section is for those who have been asked to do final system testing on a branch before merging it into the CTSM master branch (or another tightly-controlled branch like one of the release branches).
To ensure that the Python system tests included in the system testing pass, we recommend the following steps in your ctsm directory in the same terminal where you will run run_sys_tests:
> ./py_env_create # you may not need to rerun if you have run this before
> module unload python
> module load conda
> conda activate ctsm_pylib
> module load nco
The following tests should be run:
-
run_sys_tests -s aux_clm -c PREVIOUS_TAG -g NEW_TAG
on cheyenne -
run_sys_tests -s aux_clm -c PREVIOUS_TAG -g NEW_TAG
on izumi
These take a few hours to run, and the cheyenne test suite costs a few thousand core-hours.
If you don't have permissions to create a new directory in the baseline directory space on
a machine (/glade/p/cgd/tss/ctsm_baselines
on cheyenne and /fs/cgd/csm/ccsm_baselines
on izumi), you can:
- Make your own
ctsm_baselines
directory in a space you control (I recommend using your scratch directory) - Make a symbolic link to the previous tag's baselines in the above directory (e.g.,
ln -s /glade/p/cgd/tss/ctsm_baselines/ctsm1.0.dev010 /glade/scratch/$USER/ctsm_baselines/ctsm1.0.dev010
) - When running
run_sys_tests
, point to yourctsm_baselines
directory via the--baseline-root
argument. - When system testing is done, ask someone with permission to copy the generated baselines to the official baseline location.
If you don't need to do baseline generation yet, then use the --skip-generate
option instead of -g
.
When you run run_sys_tests
, the test directories will be placed in a top-level directory in your scratch space with a name that begins with tests_
. The path to this top-level directory will be output to the screen when you run run_sys_tests
, and a symbolic link will be placed in the directory from which you invoked this command.
To check the test results, run the ./cs.status.fails
script in the top-level test directory. This will show you just the failed tests. For more information, see Parsing test suite results and Checking the results of a test suite.
Although we do our best to keep all of the system tests passing, there are typically a few
that are expected to fail at any given time. So, before you spend time looking into a
failure from a test suite, you should check to see if it is in the list of expected
failures. If you are using the cs.status
or cs.status.fails
scripts created by
run_sys_tests
, then expected failures are noted for you in the test
results. Otherwise, read the rest of this section to learn how to find expected failures
manually.
The list of expected failures is maintained under the CTSM checkout, at
cime_config/testdefs/ExpectedTestFails.xml
. Search for the failing test and see if it
appears there; if so, confirm that it is failing in the same phase as before. For example,
if you see:
<test name="ERS_Lm20_Mmpi-serial.1x1_smallvilleIA.I2000Clm50BgcCropGs.cheyenne_gnu.clm-monthly"> <phase name="RUN"> <status>FAIL</status> <issue>#158</issue> </phase> </test>
then a FAIL
in the RUN
phase for this test is acceptable, but a failure in an
earlier phase (such as during the build) would indicate a new problem.
Note that, if a test is expected to FAIL
in the RUN
phase, you might also see a
PEND
result for another phase, like COMPARE_base_rest
; this is not a problem.
It's also possible that a previously-failing test is now passing. If so, this test should
probably be removed from the expected fails list (unless the issue is that this test fails
only sporadically). One way to notice tests that are newly passing is: If you see a
BFAIL
for the BASELINE
comparison phase for a test, and find that there are no
hist files in the baseline: see if this was in the expected fails list, since a
newly-passing test is a common cause of this test result. If a test is newly-passing, you
should consider removing the test from the ExpectedTestFails.xml
list and marking the
relevant issue as resolved. Check with other integrators / reviewers if you are unsure
whether to do this.
Some of the test categories used in CTSM are:
- CTSM-specific test lists
-
aux_clm
: These tests should be run before merging a branch to master or a release branch. -
clm_short
: This is a small subset ofaux_clm
that can be run frequently in the course of working on changes.-
All tests in this list should also appear in the
aux_clm
list to ensure that baselines exist for all tags.
-
All tests in this list should also appear in the
-
fates
: Additional tests run by FATES developers
-
- CESM test lists
-
prealpha
: These tests are run before making a CESM alpha or beta tag.-
All tests in this list should also appear in the
aux_clm
list (or at least have a very similar test inaux_clm
) to prevent surprises in CESM alpha testing.
-
All tests in this list should also appear in the
-
prebeta
: These tests are run before making a CESM beta tag.-
All tests in this list should also appear in the
aux_clm
list (or at least have a very similar test inaux_clm
) to prevent surprises in CESM beta testing. -
prealpha
tests do NOT need to be repeated here, since any CESM beta tag also has theprealpha
test suite run on it.
-
All tests in this list should also appear in the
-
aux_cime_baselines
: These tests are run frequently (e.g., nightly) to ensure that changes to cime do not change answers unexpectedly.- This should be a small list of tests (3-4 tests defined by each component). (We want
this to stay small since this list is run frequently. So it should cover the most
important configurations, but won't cover everything.) Because the main purpose is
baseline comparisons, all tests can be basic smoke (SMS) tests. All tests should be
on the same machine/compiler (currently
cheyenne_intel
). Because the purpose is testing cime, tests in this test list should be chosen to exercise different cime options, such as different time periods and/or datm modes. - It is common for people to run the
prealpha
test list on their cime branch to make sure they haven't broken anything before merging a big set of changes to master. Thus, to ensure that that manual testing includes any important baseline comparisons, all tests inaux_cime_baselines
should have close counterparts inprealpha
. (In many cases, theprealpha
test will be an expanded form of theaux_cime_baselines
test - e.g., anERP_Ld10
test rather than aSMS_Ld3
test.)
- This should be a small list of tests (3-4 tests defined by each component). (We want
this to stay small since this list is run frequently. So it should cover the most
important configurations, but won't cover everything.) Because the main purpose is
baseline comparisons, all tests can be basic smoke (SMS) tests. All tests should be
on the same machine/compiler (currently
-
-
General
-
Documents
-
Bugs/Issues
-
Tutorials
-
Development guides
CTSM Users:
CTSM Developer Team
-
Meetings
-
Notes
-
Editing documentation (tech note, user's guide)