Writing your own diagnostic script

Overview

Teaching: 20 min
Exercises: 30 min
Compatibility: ESMValTool v2.11.0

Questions

How do I write a new diagnostic in ESMValTool?

How do I use the preprocessor output in a Python diagnostic?

Objectives

Write a new Python diagnostic script.

Explain how a diagnostic script reads the preprocessor output.

Introduction

The diagnostic script is an important component of ESMValTool and it is where the scientific analysis or performance metric is implemented. With ESMValTool, you can adapt an existing diagnostic or write a new script from scratch. Diagnostics can be written in a number of open source languages such as Python, R, Julia and NCL but we will focus on understanding and writing Python diagnostics in this lesson.

In this lesson, we will explain how to find an existing diagnostic and run it. Also, we will work with the recipe recipe_python.yml and the diagnostic script diagnostic.py called by this recipe that we have seen in the lesson Running your first recipe.

Let’s get started!

Understanding an existing Python diagnostic

A clone of the ESMValTool repository should be available in your user folder in the nf33 scratch folder (/scratch/nf33/$USER/ESMValTool). If not, please make sure to run the check_hackathon command after loading the esmvaltool-workflow module, check for any errors.

The folder ESMValTool contains the source code of the tool. We can find the recipe recipe_python.yml and the python script diagnostic.py in these directories:

ESMValTool/esmvaltool/recipes/examples/recipe_python.yml
ESMValTool/esmvaltool/diag_scripts/examples/diagnostic.py

Let’s have look at the code in diagnostic.py. For reference, we show the diagnostic code in the dropdown box below. There are four main sections in the script:

A description i.e. the docstring (line 1).
Import statements (line 2-16).
Functions that implement our analysis (line 21-102).
A typical Python top-level script i.e. if __name__ == '__main__' (line 105-108).

diagnostic.py

 """Python example diagnostic."""
 import logging
 from pathlib import Path
 from pprint import pformat

 import iris

 from esmvaltool.diag_scripts.shared import (
     group_metadata,
     run_diagnostic,
     save_data,
     save_figure,
     select_metadata,
     sorted_metadata,
 )
 from esmvaltool.diag_scripts.shared.plot import quickplot

 logger = logging.getLogger(Path(__file__).stem)


 def get_provenance_record(attributes, ancestor_files):
     """Create a provenance record describing the diagnostic data and plot."""
     caption = caption = attributes['caption'].format(**attributes)

     record = {
         'caption': caption,
         'statistics': ['mean'],
         'domains': ['global'],
         'plot_types': ['zonal'],
         'authors': [
             'andela_bouwe',
             'righi_mattia',
         ],
         'references': [
             'acknow_project',
         ],
         'ancestors': ancestor_files,
     }
     return record


 def compute_diagnostic(filename):
     """Compute an example diagnostic."""
     logger.debug("Loading %s", filename)
     cube = iris.load_cube(filename)

     logger.debug("Running example computation")
     cube = iris.util.squeeze(cube)
     return cube


 def plot_diagnostic(cube, basename, provenance_record, cfg):
     """Create diagnostic data and plot it."""

     # Save the data used for the plot
     save_data(basename, provenance_record, cfg, cube)

     if cfg.get('quickplot'):
         # Create the plot
         quickplot(cube, **cfg['quickplot'])
         # And save the plot
         save_figure(basename, provenance_record, cfg)


 def main(cfg):
     """Compute the time average for each input dataset."""
     # Get a description of the preprocessed data that we will use as input.
     input_data = cfg['input_data'].values()

     # Demonstrate use of metadata access convenience functions.
     selection = select_metadata(input_data, short_name='tas', project='CMIP5')
     logger.info("Example of how to select only CMIP5 temperature data:\n%s",
                 pformat(selection))

     selection = sorted_metadata(selection, sort='dataset')
     logger.info("Example of how to sort this selection by dataset:\n%s",
                 pformat(selection))

     grouped_input_data = group_metadata(input_data,
                                         'variable_group',
                                         sort='dataset')
     logger.info(
         "Example of how to group and sort input data by variable groups from "
         "the recipe:\n%s", pformat(grouped_input_data))

     # Example of how to loop over variables/datasets in alphabetical order
     groups = group_metadata(input_data, 'variable_group', sort='dataset')
     for group_name in groups:
         logger.info("Processing variable %s", group_name)
         for attributes in groups[group_name]:
             logger.info("Processing dataset %s", attributes['dataset'])
             input_file = attributes['filename']
             cube = compute_diagnostic(input_file)

             output_basename = Path(input_file).stem
             if group_name != attributes['short_name']:
                 output_basename = group_name + '_' + output_basename
             if "caption" not in attributes:
                 attributes['caption'] = input_file
             provenance_record = get_provenance_record(
                 attributes, ancestor_files=[input_file])
             plot_diagnostic(cube, output_basename, provenance_record, cfg)


 if __name__ == '__main__':

     with run_diagnostic() as config:
         main(config)

What is the starting point of a diagnostic?

Can you spot a function called main in the code above?

What are its input arguments?

How many times is this function mentioned?

Solution

The main function is defined in line 65 as main(cfg).

The input argument to this function is the variable cfg, a Python dictionary that holds all the necessary information needed to run the diagnostic script such as the location of input data and various settings. We will next parse this cfg variable in the main function and extract information as needed to do our analyses (e.g. in line 68).

The main function is called near the very end on line 108. So, it is mentioned twice in our code - once where it is called by the top-level Python script and second where it is defined.

The function run_diagnostic

The function run_diagnostic (line 107) is called a context manager provided with ESMValTool and is the main entry point for most Python diagnostics.

Create a copy of the files for you to edit

You would already have a copy of the recipe_python.yml from the lesson Running your first recipe. Use the file you edited after you ran
esmvaltool recipes get examples/recipe_python.yml
Use the edited file from the completion of the lesson.

Copy the file diagnostic.py to your working folder to keep the ones in the repo as templates unaltered while you can more easily find the files you are editing. Edit your recipe to point to your copy of diagnostic.py. Also, note the location for when you run your recipe.
Solution

Example of your working folder:
/scratch/nf33/$USER/Exercise_writeDiagnostic/recipe_python.yml
/scratch/nf33/$USER/Exercise_writeDiagnostic/diagnostic.py
In your recipe_python.yml, edit the path to the diagnostic script.
    script1:
      script: /scratch/nf33/$USER/Exercise_writeDiagnostic/diagnostic.py
      quickplot:
When running the recipe you can run to the full path of your recipe if you are not in that directory:
esmvaltool-workflow run /scratch/nf33/$USER/Exercise_writeDiagnostic/recipe_python.yml

Preprocessor-diagnostic interface

In the previous exercise, we have seen that the variable cfg is the input argument of the main function. The first argument passed to the diagnostic via the cfg dictionary is a path to a file called settings.yml. The ESMValTool documentation page provides an overview of what is in this file, see Diagnostic script interfaces.

What information do I need when writing a diagnostic script?

Load the module in Gadi if you haven’t already. We know how to change the configuration settings before running a recipe. First we set the option remove_preproc_dir to false in the configuration file, then run the recipe recipe_python.yml: (Or look at the output folder from your previous working run.)
module use /g/data/xp65/public/modules
module load esmvaltool-workflow

esmvaltool-workflow run <your_working_folder>/recipe_python.yml
Can you find one example of the file settings.yml in the run directory?

Open the file settings.yml and look at the input_files list. It contains paths to some files metadata.yml. What information do you think is saved in those files?

Solution

One example of settings.yml can be found in the directory: /scratch/nf33/[username]/esmvaltool_outputs/recipe_python_latest/run/map/script1/settings.yml

The metadata.yml files hold information about the preprocessed data. There is one file for each variable having detailed information on your data including project (e.g., CMIP6, CMIP5), dataset names (e.g., BCC-ESM1, CanESM2), variable attributes (e.g., standard_name, units), preprocessor applied and time range of the data. You can use all of this information in your own diagnostic.

Diagnostic shared functions

Looking at the code in diagnostic.py, we see that input_data is read from the cfg dictionary (line 68). Now we can group the input_data according to some criteria such as the model or experiment. To do so, ESMValTool provides many functions such as select_metadata (line 71), sorted_metadata (line 75), and group_metadata (line 79). As you can see in line 8, these functions are imported from esmvaltool.diag_scripts.shared that means these are shared across several diagnostics scripts. A list of available functions and their description can be found in The ESMValTool Diagnostic API reference.

Extracting information needed for analysis

We have seen the functions used for selecting, sorting and grouping data in the script. What do these functions do?

Solution

There is a statement after use of select_metadata, sorted_metadata and group_metadata that starts with logger.info (lines 72, 76 and 82). These lines print output to the log files. In the previous exercise, we ran the recipe recipe_python.yml. If you look at the log file recipe_python_#_#/run/map/script1/log.txt in esmvaltool_output directory, you can see the output from each of these functions, for example:

2023-06-28 12:47:14,038 [2548510] INFO     diagnostic,106	Example of how to
group and sort input data by variable groups from the recipe:
{'tas': [{'alias': 'CMIP5',
         'caption': 'Global map of {long_name} in January 2000 according to '
                    '{dataset}.\n',
         'dataset': 'bcc-csm1-1',
         'diagnostic': 'map',
         'end_year': 2000,
         'ensemble': 'r1i1p1',
         'exp': 'historical',
         'filename': '~/recipe_python_20230628_124639/preproc/map/tas/
               CMIP5_bcc-csm1-1_Amon_historical_r1i1p1_tas_2000-P1M.nc',
         'frequency': 'mon',
         'institute': ['BCC'],
         'long_name': 'Near-Surface Air Temperature',
         'mip': 'Amon',
         'modeling_realm': ['atmos'],
         'preprocessor': 'to_degrees_c',
         'product': ['output1', 'output2'],
         'project': 'CMIP5',
         'recipe_dataset_index': 1,
         'short_name': 'tas',
         'standard_name': 'air_temperature',
         'start_year': 2000,
         'timerange': '2000/P1M',
         'units': 'degrees_C',
         'variable_group': 'tas',
         'version': 'v1'},
        {'activity': 'CMIP',
         'alias': 'CMIP6',
         'caption': 'Global map of {long_name} in January 2000 according to '
                    '{dataset}.\n',
         'dataset': 'BCC-ESM1',
         'diagnostic': 'map',
         'end_year': 2000,
         'ensemble': 'r1i1p1f1',
         'exp': 'historical',
         'filename': '~/recipe_python_20230628_124639/preproc/map/tas/
               CMIP6_BCC-ESM1_Amon_historical_r1i1p1f1_tas_gn_2000-P1M.nc',
         'frequency': 'mon',
         'grid': 'gn',
         'institute': ['BCC'],
         'long_name': 'Near-Surface Air Temperature',
         'mip': 'Amon',
         'modeling_realm': ['atmos'],
         'preprocessor': 'to_degrees_c',
         'project': 'CMIP6',
         'recipe_dataset_index': 0,
         'short_name': 'tas',
         'standard_name': 'air_temperature',
         'start_year': 2000,
         'timerange': '2000/P1M',
         'units': 'degrees_C',
         'variable_group': 'tas',
         'version': 'v20181214'}]}

This is how we can access preprocessed data within our diagnostic.

Diagnostic computation

After grouping and selecting data, we can read individual attributes (such as filename) of each item. Here, we have grouped the input data by variables, so we loop over the variables (line 88). Following this is a call to the function compute_diagnostic (line 93). Let’s look at the definition of this function in line 42, where the actual analysis of the data is done.

Note that output from the ESMValCore preprocessor is in the form of NetCDF files. Here, compute_diagnostic uses Iris to read data from a netCDF file and performs an operation squeeze to remove any dimensions of length one. We can adapt this function to add our own analysis. As an example, here we calculate the bias using the average of the data using Iris cubes.

def compute_diagnostic(filename):
    """Compute an example diagnostic."""
    logger.debug("Loading %s", filename)
    cube = iris.load_cube(filename)

    logger.debug("Running example computation")
    cube = iris.util.squeeze(cube)

    # Calculate a bias using the average of data
    cube.data = cube.core_data() - cube.core_data.mean()
    return cube

iris cubes

Iris reads data from NetCDF files into data structures called cubes. The data in these cubes can be modified, combined with other cubes’ data or plotted.

Reading data using xarray

Alternately, you can use xarray to read the data instead of Iris.
Solution

First, import xarray package at the top of the script as:
import xarray as xr
Then, change the compute_diagnostic as:
def compute_diagnostic(filename):
   """Compute an example diagnostic."""
   logger.debug("Loading %s", filename)
   dataset = xr.open_dataset(filename)

   #do your analyses on the data here

   return dataset
Caution: If you read data using xarray keep in mind to change accordingly the other functions in the diagnostic which are dealing at the moment with Iris cubes.

Reading data using the netCDF4 package

Yet another option to read the NetCDF file data is to use the netCDF-4 Python interface to the netCDF C library.
Solution

First, import the netCDF4 package at the top of the script as:
import netCDF4
Then, change compute_diagnostic as:
def compute_diagnostic(filename):
   """Compute an example diagnostic."""
   logger.debug("Loading %s", filename)
   nc_data = netCDF4.Dataset(filename,'r')

   #do your analyses on the data here

   return nc_data
Caution: If you read data using netCDF4 keep in mind to change accordingly the other functions in the diagnostic which are dealing at the moment with Iris cubes.

Diagnostic output

Plotting the output

Often, the end product of a diagnostic script is a plot or figure. The Iris cube returned from the compute_diagnostic function (line 93) is passed to the plot_diagnostic function (line 102). Let’s have a look at the definition of this function in line 52. This is where we would plug in our plotting routine in the diagnostic script.

More specifically, the quickplot function (line 60) can be replaced with the function of our choice. As can be seen, this function uses **cfg['quickplot'] as an input argument. If you look at the diagnostic section in the recipe recipe_python.yml, you see quickplot is a key there:

    script1:
      script: <path_to_script diagnostic.py>
      quickplot:
        plot_type: pcolormesh
        cmap: Reds

This way, we can pass arguments such as the type of plot pcolormesh and the colormap cmap:Reds from the recipe to the quickplot function in the diagnostic.

Passing arguments from the recipe to the diagnostic

Change the type of the plot and its colormap and inspect the output figure.
Solution

In the recipe recipe_python.yml, you could change plot_type and cmap. As an example, we choose plot_type: pcolor and cmap: BuGn:
    script1:
      script: <path_to_script diagnostic.py>
      quickplot:
        plot_type: pcolor
        cmap: BuGn
The plot can be found at path_to_recipe_output/plots/map/script1/png.

ESMValTool gallery

ESMValTool makes it possible to produce a wide array of plots and figures as seen in the gallery.

Saving the output

In our example, the function save_data in line 56 is used to save the Iris cube. The saved files can be found under the work directory in a .nc format. There is also the function save_figure in line 62 to save the plots under the plot directory in a .png format (or preferred format specified in your configuration settings). Again, you may choose your own method of saving the output.

## in diagnostic.py ##
55:      # Save the data used for the plot
56:      save_data(basename, provenance_record, cfg, cube)
..
61:          # And save the plot
62:          save_figure(basename, provenance_record, cfg)

You will see that they are imported from esmvaltool.diag_scripts.shared and take arguments such as cfg so that they can be saved in the appropriate output location.

Recording the provenance

When developing a diagnostic script, it is good practice to record provenance. To do so, we use the function get_provenance_record (line 100). Let us have a look at the definition of this function in line 21 where we describe the diagnostic data and plot. Using the dictionary record, it is possible to add custom provenance to our diagnostics output. Provenance is stored in the W3C PROV XML format and also in an SVG file under the work and plot directory. For more information, see recording provenance. You will see that the record gets parsed as an argument in the saving outputs functions above.

Congratulations!

You now know the basic diagnostic script structure and some available tools for putting together your own diagnostics. Have a look at existing recipes and diagnostics in the repository for more examples of functions you can use in your diagnostics!

Key Points

ESMValTool provides helper functions to interface a Python diagnostic script with preprocessor output.

Existing diagnostics can be used as templates and modified to write new diagnostics.

Helper functions can be imported from esmvaltool.diag_scripts.shared and used in your own diagnostic script.

previous episode

CMIP7 Evaluation Hackathon

next episode

Writing your own diagnostic script

Overview

Introduction

Understanding an existing Python diagnostic

diagnostic.py

What is the starting point of a diagnostic?

Solution

The function run_diagnostic

Create a copy of the files for you to edit

Solution

Preprocessor-diagnostic interface

What information do I need when writing a diagnostic script?

Solution

Diagnostic shared functions

Extracting information needed for analysis

Solution

Diagnostic computation

iris cubes

Reading data using xarray

Solution

Reading data using the netCDF4 package

Solution

Diagnostic output

Plotting the output

Passing arguments from the recipe to the diagnostic

Solution

ESMValTool gallery

Saving the output

Recording the provenance

Congratulations!

Key Points

previous episode

next episode