CESM Troubleshooting

Our Experiments

Overview

Teaching: min
Exercises: min
Questions
  • How can I confirm if everything ran correctly?

  • How can I troubleshoot errors?

Objectives

We have three cases (or experiments) we are working with. We will take a look at what happened with each one of them.

  1. b.day1.0
    This was our first case that ran initially for 5-days. We then set CONTINUE_RUN=TRUE and ran it again for another 5-days. This run should have completed successfully. How can we confirm this?

Check your CaseStatus file

Go to your case directory for this case and look at the end of the CaseStatus file

tail CaseStatus

Now let’s check out the output from this case. Remember, it is located in the DOUT_S_ROOT directory.

cases/b.day1.0> ./xmlquery DOUT_S_ROOT

	DOUT_S_ROOT: /glade/scratch/kpegion/archive/b.day1.0

We will go there and see if we now have 10 days of data.

cd /glade/scratch/kpegion/archive/b.day1.0
cd ocn/hist

We now have two output files for the ocean: b.day1.0.pop.h.nday1.0001-01-01.nc and b.day1.0.pop.h.nday1.0001-01-06.nc.

b.day1.0.pop.h.nday1.0001-01-01.nc is the same file we looked at last week which has the first 5-days.
b.day1.0.pop.h.nday1.0001-01-06.nc contains the next 5-days that we ran by setting CONTINUE_RUN=TRUE Use ncview to look at these files.

  1. Second case
    This is our case that we are running for 4-years with daily precip and standard monthly output to use for Assignment #2. Assuming the configuration and namelist changes were entered correctly, this run should have completed successfully.

There should be monthly and daily output for the atmosphere. Let’s confirm:

cd /glade/scratch/kpegion/archive/test1/atm/hist
ls

The test1.cam.h0.*.nc files contain monthly averaged data.

The test1.cam.h1.*.nc contain daily averaged data.

What is in these files?

We will look at each file using ncdump -h to understand what is in the files.

What variables are in the h0 files? What variables are in the h1 files?

We set this with the namelist options

fincl2 = 'PRECC', 'PRECL'

nhtfrq = 0, -24

How many times are in the h0 files?

How many times are in the h1 files?

We set this with the mfilt = 1,1 namelist option.

Remember, you can look up namelist options

mfilt
Array containing the maximum number of time samples written to a history file. The first value applies to the primary history file, the second through tenth to the auxillary history files. Default: 1,30,30,30,30,30,30,30,30,30

We have lots of history files and we can look at each of them using ncview, but that is not very useful.

We can read in all the files using Python xarray, but its a lot of data.

There are some useful tools for postprocessing the data to get timeseries files and easily take a look at some common diagnostics. We will learn about those later in this class.

  1. BRANCH case:
    This is the branch case we ran with lots of configuration changes and namelist changes. This run produces an error with the configuration provided.

We can review how we created and setup our run by looking at the first line of the README.case file:

cases/branchwrong> head -n1 README.case
2020-09-28 09:07:01: ./create_newcase --case /glade/u/home/kpegion/cases/branchwrong --res f19_g17 --compset B1850 --project UGMU0035

You can see that I initially made a mistake in my create_newcase by mistyping the project number.

We can review any changes we made to the configuration of the run by looking at CaseStatus

cases/branchwrong> more CaseStatus
2020-09-28 09:13:01: xmlchange success <command> ./xmlchange RUN_TYPE=branch,RUN_REFCASE=b.day1.0,RUN_REFDATE=0001-01-05,CLM_NAMELIST_OPTS=,GET_REFCA
SE=FALSE,STOP_OPTION=nmonths,STOP_N=1,RESUBMIT=1,CCSM_CO2_PPMV=569.4  </command>
 ---------------------------------------------------
2020-09-28 09:13:20: xmlchange success <command> ./xmlchange JOB_WALLCLOCK_TIME=2:00:00  </command>
 ---------------------------------------------------
2020-09-28 09:13:24: case.setup starting
 ---------------------------------------------------
2020-09-28 09:13:25: case.setup success
 ---------------------------------------------------
2020-09-28 09:29:20: case.build starting
 ---------------------------------------------------
CESM version is cesm2.1.1-exp17
Processing externals description file : Externals.cfg
Processing externals description file : Externals_CLM.cfg
Processing externals description file : Externals_POP.cfg
Processing externals description file : Externals_CISM.cfg
Checking status of externals: clm, fates, ptclm, mosart, ww3, cime, cice, pop, cvmix, marbl, cism, source_cism, rtm, cam,
    ./cime
        clean sandbox, on cime5.6.19
    ./components/cam
        clean sandbox, on cam1/release_tags/cam_cesm2_1_rel_29/components/cam
    ./components/cice
        clean sandbox, on cice5_cesm2_1_1_20190321
    ./components/cism
        clean sandbox, on release-cesm2.0.04
    ./components/cism/source_cism
        clean sandbox, on release-cism2.1.03
    ./components/clm
        clean sandbox, on release-clm5.0.25
    ./components/clm/src/fates
        clean sandbox, on fates_s1.21.0_a7.0.0_br_rev2
    ./components/clm/tools/PTCLM
        clean sandbox, on PTCLM2_180611
    ./components/mosart
        clean sandbox, on release-cesm2.0.03
    ./components/pop
        clean sandbox, on pop2_cesm2_1_rel_n06
    ./components/pop/externals/CVMix
        clean sandbox, on v0.93-beta
    ./components/pop/externals/MARBL
        clean sandbox, on cesm2.1-n00
    ./components/rtm
        clean sandbox, on release-cesm2.0.02
    ./components/ww3
        clean sandbox, on ww3_181001
2020-09-28 09:38:13: case.build success
 ---------------------------------------------------
2020-09-28 09:38:27: case.submit starting
 ---------------------------------------------------
2020-09-28 09:38:35: case.submit error
ERROR: Command: 'qsub -q regular -l walltime=2:00:00 -A UGMU0035 -v ARGS_FOR_SCRIPT='--resubmit' .case.run' failed with error 'qsub: Invalid account, available ac
counts:
Project, Status, Active
P05010048, Normal, True
P93300190, Overspent, True
UGMU0032, Normal, True
P93300042, Overspent, True' from dir '/glade/u/home/kpegion/cases/branchwrong'
 ---------------------------------------------------
2020-09-28 09:39:35: xmlchange success <command> ./xmlchange PROJECT=UGMU0032  </command>
 ---------------------------------------------------
2020-09-28 09:39:51: case.submit starting
 ---------------------------------------------------
2020-09-28 09:39:58: case.submit success case.run:4355823.chadmin1.ib0.cheyenne.ucar.edu, case.st_archive:4355824.chadmin1.ib0.cheyenne.ucar.edu
 ---------------------------------------------------
2020-09-28 11:01:58: case.run starting
 ---------------------------------------------------
2020-09-28 11:02:04: model execution starting
 ---------------------------------------------------
2020-09-28 11:02:07: model execution success
 ---------------------------------------------------
2020-09-28 11:02:07: case.run error
ERROR: RUN FAIL: Command 'mpiexec_mpt -p "%g:"  -np 576  omplace -tm open64  /glade/scratch/kpegion/branchwrong/bld/cesm.exe  >> cesm.log.$LID 2>&1 '
 failed
See log file for details: /glade/scratch/kpegion/branchwrong/run/cesm.log.4355823.chadmin1.ib0.cheyenne.ucar.edu.200928-110158
 ---------------------------------------------------

Another thing we did that is not documented automatically is to copy the restart files from our b.day1.0 case to our new run directory. This was so the model has a set of restart files to start the run from.

cp /glade/scratch/kpegion/archive/b.day1.0/rest/0001-01-06-00000/* /glade/scratch/kpegion/test1/run/ 

How do we figure out what went wrong?

Look at your log file and use grep -i to find errors.

cases/branchwrong> grep -i error /glade/scratch/kpegion/branchwrong/run/cesm.log.4355823.chadmin1.ib0.cheyenne.ucar.edu.200928-110158
16: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
4: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
29: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
33: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
34: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
3: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
22: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
18: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
...

Why is the error repeated many times?

The model runs on many processors. Each one is reporting the error.

What does the error mean?

This is telling that the model is trying to get a file called b.day1.0.cam.r.0001-01-05-00000.nc and it is unable to get it.

Let’s go back to our configuration and think back about how we set up this experiment. What do all the configuration changes mean? We can take a look in env_run.xml to confirm what each setting means.

RUN_TYPE=branch
This is a branch run
RUN_REFCASE=b.day1.0
Reference directory containing RUN_REFCASE data - used for hybrid or branch runs
RUN_REFDATE=0001-01-05
Reference date for hybrid or branch runs (yyyy-mm-dd)
CLM_NAMELIST_OPTS=’’
CLM-specific namelist settings for -namelist option in the CLM build-namelist. CLM_NAMELIST_OPTS is normally set as a compset variable and in general should not be modified for supported compsets. It is recommended that if you want to modify this value for your experiment, you should use your own user-defined component sets via using create_newcase with a compset_file argument. This is an advanced flag and should only be used by expert users.

It seems this option was provided in the NCAR tutorial example, but is not necessary.

GET_REFCASE=FALSE
Flag for automatically prestaging the refcase restart dataset. If TRUE, then the refcase data is prestaged into the executable directory
STOP_OPTION=nmonths
Sets the run length along with STOP_N and STOP_DATE
STOP_N=1
Provides a numerical count for $STOP_OPTION.
RESUBMIT=1
If RESUBMIT is greater than 0, then case will automatically resubmit Since we later set our queue time to only 2 hours, there may be a need to resubmit to complete the run.
CCSM_CO2_PPMV=569.4
Mechanism for setting the CO2 value in ppmv for CLM if CLM_CO2_TYPE is constant or for POP if OCN_CO2_TYPE is constant. This is the CO2 value that gets propogated to the ocean and land models.
JOB_WALLCLOCK_TIME=2:00:00
The machine wallclock setting. This means how long we tell it to run in the queue. The maximum and default are 12:00:00, but we can get our run in more quickly if we tell it we need less time.

Do you see anything in the configuration that could have led to our error?

Look at the RUN_REFDATE and the date we used for our restart file

Solution

RUN_REFDATE=0001-01-06
cp /glade/scratch/kpegion/archive/b.day1.0/rest/0001-01-06-00000/* /glade/scratch/kpegion/test1/run/ 

Fix it!

What did those namelist changes do?

We can look them up

In user_nl_cam

co2vmr=569.4e-6
CO2 volume mixing ratio. This is used as the time invariant surface value of CO2 if no time varying values are specified. Default: set by build-namelist.
ch4vmr = 1583.2e-9
CH4 volume mixing ratio. This is used as the time invariant surface value of CH4 if no time varying values are specified. Default: set by build-namelist.
inithist=’MONTHLY’
Frequency that initial files will be output This produces initial condition files monthly.

Nothing looks questionable there.

Resubmit your case!

Key Points


History and Setup for Diagnostics

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is in our history files?

  • How do I setup to run the postprocessing and diagnostics packages

Objectives

We will now return to the output from our 4-year case. Let’s go to the atmospheric history directory for our case. If your 4-year case did not run to completion, you are welcome to look at mine.

cd /glade/scratch/kpegion/archive/test1/atm/hist

History vs. Timeseries Files

History files
contain all of the variables for a componenet for a particular frequency and are output directly by the model.
Timeseries files
usually span a number of timesteps and contain only one major variable. They are created offline.

When NCAR provides output from their model simulations publicly, they typically provide timeseries files for a select set of variables.

Examples:

A history file: f40_test.cam.h0.1993-11.nc

A timeseries file: f40_test.cam.h0.PSL.199001-199912.nc

CESM Time Variable

The time coordinate variable in CESM history and timeseries files represents the end of the averaging period for variables that are averages. The time that gets resolved when the data are read in does not match the date in the filename. For monthly averaged data, the filename is correct. This can be a source of much confusion.

Example: test1.cam.h0.0001-05.nc

Postprocessing

The process of going from history files to timeseries files and to convert 3D atmospheric data from the model coordinate system to selected pressure levels. We will learn how to use the CESM Postprocessing Tools which are primarily written in NCAR Command Language (NCL). NCL is in the process of being converted to Python, but for now, we can use the preprepared NCL scripts without having to know too much NCL.

Diagnostics Packages

There is a large suite of postprocessing and diagnostic packages developed by NCAR using a combination of NCL and Python scripts that automatically generate a variety of different kinds of plots from model output files and used to evaluate a simulation. They all compute a series of pre-defined metrics and display the plots via a website.

There are five main diagnostics packages:

  1. Atmosphere
  2. Ice
  3. Land
  4. Ocean
  5. Climate Variability and Diagnostics Package (CVDP)

Postprocessing and Diagnostics Packages Setup

We will setup everything necessary for you to be able to run the postprocessing and diagnostics packages on the NCAR computers.

Setup your .profile or .tcshrc

If you have never setup a .profile or .tcshrc on cheyenne:

cp /glade/u/home/kpegion/clim670/profile.sample ~/.profile

If you already have a .profile (bash users) or a .tcshrc (tcsh users), look at the corresponding file and add the necessary items from the sample file to your file. The sample files are located in: ~kpegion/clim670/

Copy the post-processing scripts to the correct location:

Go to your home directory:

cd

Create a scripts directory and go to it:

mkdir scripts
cd scripts

Copy all files needed:

cp -R /glade/u/home/asphilli/CESM_tutorial/* .

You may get an error about being able to copy a particular file. You can ignore the error.

Put the configuration file into the correct location:

mv hluresfile ../.hluresfile

Setup the python environment for the CESM diagnostics and post-processing scripts

cesm_pp_activate 

Create a directory for the CESM postprocessing code:

mkdir /glade/scratch/kpegion/cesm-postprocess

Run the postprocessing using create_postprocess and tell it the name of your 4-year case

create_postprocess --caseroot /glade/scratch/kpegion/cesm-postprocess/test1

Go to the postprocessing directory:

cd /glade/scratch/kpegion/cesm-postprocess/test1

Set the location of the model data:

./pp_config --set DOUT_S_ROOT=/glade/scratch/kpegion/archive/test1

Tell the diagnostics what kinds of grids to expect, our version uses:

./pp_config --set ATM_GRID=1.9x2.5
./pp_config --set LND_GRID=1.9x2.5
./pp_config --set ICE_GRID=gx1v7
./pp_config --set OCN_GRID=gx1v7
./pp_config --set ICE_NX=320
./pp_config --set ICE_NY=384

Key Points


Model Diagnostics Packages

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do I run the model diagnostics packages?

Objectives

Some requirements for the diagnostics packages

Each component diagnostics package has minimum requirements for how much data must be available to run them:

Some other requirements:

Run a Diagnostics Package

Select and run a diagnostics package of interest to you:

Atmosphere Diagnostics Package

Edit the settings for the env_diags_atm.xml file using pp_config

./pp_config --set ATMDIAG_OUTPUT_ROOT_PATH=/glade/scratch/kpegion/diagnostics-output/atm
./pp_config --set ATMDIAG_test_first_yr=1
./pp_config --set ATMDIAG_test_nyrs=3

Run the monthly climatologies

qsub atm_averages -A UGMU0032

You can monitor your job status using qstat -u Check the log file in logs to make sure everything ran ok. This should run relatively quickly (only a few minutes)

Once the averages are done, you can submit the diagnostics script:

qsub atm_diagnostics -A UGMU0032

This will also run relatively quickly (few minutes).
Once it is done, you can go to the location of the diagnostics and look at the output via a webpage:

cd /glade/scratch/kpegion/diagnostics-output/atm/diag/test2-obs.1-3
firefox index.html &

It may be slow for your web browser window to launch and display depending on your bandwidth.

For more information on the Atmosphere (AMWG) Diagnostics Package: http://www.cesm.ucar.edu/working_groups/Atmosphere/amwg-diagnostics-package/

Land Diagnostics Package

Edit the settings for the env_diags_land.xml file using pp_config

./pp_config --set LNDDIAG_OUTPUT_ROOT_PATH=/glade/scratch/kpegion/diagnostics-output/lnd
./pp_config --set LNDDIAG_clim_first_yr_1=1
./pp_config --set LNDDIAG_clim_num_yrs_1=3
./pp_config --set LNDDIAG_trends_first_yr_1=1
./pp_config --set LNDDIAG_trends_num_yrs_1=3

Run the monthly climatologies

qsub lnd_averages -A UGMU0032

You can monitor your job status using qstat -u Check the log file in logs to make sure everything ran ok. This should run relatively quickly (only a few minutes)

Once the averages are done, you can submit the diagnostics script:

qsub lnd_diagnostics -A UGMU0032

This will also run relatively quickly (few minutes). Once it is done, you can go to the location of the diagnostics and look at the output via a webpage:

cd /glade/scratch/kpegion/diagnostics-output/lnd/diag/test2-obs.1_3
firefox setsIndex.html &

It may be slow for your web browser window to launch and display depending on your bandwidth.

For more information the Land (LMWG) Diagnostics Package: http://www.cesm.ucar.edu/models/cesm1.2/clm/clm_diagpackage.html

Ocean Diagnostics Package

Edit the settings for the env_diags_ocn.xml file using pp_config

./pp_config --set OCNDIAG_YEAR0=1
./pp_config --set OCNDIAG_YEAR1=3
./pp_config --set OCNDIAG_TSERIES_YEAR0=1
./pp_config --set OCNDIAG_TSERIES_YEAR1=3
./pp_config --set OCNDIAG_TAVGDIR=/glade/scratch/kpegion/diagnostics-output/ocn/climo/tavg.\$OCNDIAG_YEAR0.\$OCNDIAG_YEAR1
./pp_config --set OCNDIAG_WORKDIR=/glade/scratch/kpegion/diagnostics-output/ocn/diag/test2.\$OCNDIAG_YEAR0.\$OCNDIAG_YEAR1

Run the monthly climatologies

qsub ocn_averages -A UGMU0032

You can monitor your job status using qstat -u Check the log file in logs to make sure everything ran ok. This should run relatively quickly (only a few minutes)

Once the averages are done, you can submit the diagnostics script:

qsub ocn_diagnostics -A UGMU0032

This will also run relatively quickly (few minutes). Once it is done, you can go to the location of the diagnostics and look at the output via a webpage:

cd /glade/scratch/kpegion/diagnostics-output/ocn/diag/test2.1_3
firefox index.html &

It may be slow for your web browser window to launch and display depending on your bandwidth.

Ice Diagnostics Package

Edit the settings for the env_diags_ic.xml file using pp_config

./pp_config --set ICEDIAG_BEGYR_CONT=1
./pp_config --set ICEDIAG_ENDYR_CONT=3
./pp_config --set ICEDIAG_YRS_TO_AVG=3
./pp_config --set ICEDIAG_PATH_CLIMO_CONT=/glade/scratch/kpegion/diagnostics-output/ice/climo/\$ICEDIAG_CASE_TO_CONT/
./pp_config --set ICEDIAG_DIAG_ROOT=/glade/scratch/kpegion/diagnostics-output/ice/diag/\$ICEDIAG_CASE_TO_CONT/

Run the monthly climatologies

qsub ice_averages -A UGMU0032

You can monitor your job status using qstat -u Check the log file in logs to make sure everything ran ok. This should run relatively quickly (only a few minutes)

Once the averages are done, you can submit the diagnostics script:

qsub ice_diagnostics -A UGMU0032

This will also run relatively quickly (few minutes). Once it is done, you can go to the location of the diagnostics and look at the output via a webpage:

cd /glade/scratch/kpegion/diagnostics-output/ice/diag/test2.1_3
firefox index.html &

It may be slow for your web browser window to launch and display depending on your bandwidth.

Try the CVDP

Our runs are not long enough to run the CVDP, but you can test it on an existing long simulation, the CESM Large Ensemble.

On Cheyenne, we need to get an analysis node to Casper

execdav --account=UGMU0032
cd ~/scripts/CVDP

Open the file namelist using your preferred text editor

The format of the file is: Run Name | Path to all data for a simulatoin | Analysis start year | Analysis end year

Modify the rows so that the analysis start and end years are 1979 and 2015.

Open up the file driver.ncl using your preferred text editor

On line 7: replace user with your username On line 19: change False to True to output calculations in nceCDF

Run the CVDP by typing

ncl driver.ncl

It will take ~20 minutes. Once it is complete, go to the output directory and open a firefox window

cd /glade/scatch/kpegion/CVDP
firefox index.html&

Key Points


Postprocessing

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What are some common run time configuration changes?

  • How do I make these changess?

Objectives

The process of going from history files to timeseries files and to convert 3D atmospheric data from the model coordinate system to selected pressure levels. We will learn how to use the CESM Postprocessing Tools

The post processing scripts are located in your ~/scripts/ directory. You can find them using ls create.

Open the script in a text editor (e.g. gedit, vi, emacs)

Change the lines of the script relevant for your run, for example, in `atm.create_timeseries.ncl’

run_name = “test1” styr = 1 enyr = 4 work_dir = “/glade/scratch/kpegion/” archive_dir = “/glade/scratch/kpegion/archive/”+run_name+”/atm/hist”

Run the postprocessing script

ncl atm.create_timeseries.ncl

The timeseries files are located in:

/glade/scratch/kpegion/processed/<case name>/

You can take a quick look at them in ncview. To do Assignment #2, you can read them in using xarray

Run the post-processing for whichever component is of interest to you.

Key Points