Pipeline Overview#

A pipeline is organized into stages. Each option at the base level of a YAML file is treated as a “stage”. The only required stage is setup. For more detailed information on stage options see Stages

setup Stage#

Defines the directory structure, high level information like model version and the order in which the following stages are evaluated.

Output directories#

Output directories are specified relative to the input directory.

setup:

  # defines where the output files for each stage will be written too.
  output_directories:
    monthly: "diags/monthly"
    daily: "diags/daily"
    rtd: "diags/rtd"
    variability: "diags/landon"

Note

If a stage is not present in output_directories the variables created in that stage will not be written to disc.

Loading Data#

Input Files#

By default input files are expected to be in a flat directory in input_dir with filenames of {input}/{variable}.nc. This can be customized using the source keyword. As with encoding, the value of source is propagated to variables in each stage. If source is set at the stage or variable level, the lower level value will take precedent. For example, if most of our input files are organized in monthly folders, expect for some land variables we could write our configuration as:

setup:

  # defines the format to look for input files {input} and {variable} will be replaced dynamically.
  source: "{input}/*/{variable}.nc"

monthly:

   variables:
     - GT
     - BEG
     - ST:
         source: "{input}/land/CLASSIC_{variable}.nc"

Custom Loaders#

By default xarray’s open_mfdataset is used to open files, but if you would like to use other methods this can be overwritten. If a name of a function is provided then the source will be passed to this function. If additional arguments or kwargs are required these should be set using args and kwargs. If a filename is expected then it should be passed as an arg and source will be dynamically replaced.

setup:

  # defines the format to look for input files {input} and {variable} will be replaced dynamically.
  source: "{input}/*/{variable}.nc"

monthly:

   variables:
     - ST:
         source: "{input}/land/CLASSIC_{variable}.001"
         loader: mymodule.load_ccc
     - GT:
         loader:
           function: xr.open_mfdataset
           args: [source]  # uses the default source specified on setup
           kwargs:
             engine: netcdf4
             decode_times: false
             parallel: false

General information#

setup:
  # general options that may affect how we process yaml->dag
  canesm_version: "6.0"

Ordering Stages#

This defines the order in which stages are executed. For example, we may want to reuse data from the daily stage when computing the monthly averages, in this case we could write:

setup:
  stages:
    - daily
    - monthly

If no data is reused between stages then this section can be omitted.

Reusing Stages#

To reuse results from a previous stage, the reuse keyword can be used

setup:
  stages:
    - transforms
    - daily
    - monthly

transforms:
  variables:
    - GT:
        rename: TS

daily:
  reuse: transforms
  variables:
    - GT

monthly:
  reuse: daily
  variables:
    - GT
    - ST

This will tell the daily stage to use the variables from the output of the transforms stage and the monthly stage to use the variables from the output of the daily stage. This will be applied to all variables in the stage in this file. Variables that are not defined in prior stages, e.g. ST here, will fallback to earlier stages, in this case the raw data loaded from disc. If multiple stages are reused a list can be provided e.g.: reuse: [transforms, monthly]

Resampling Stages#

Resampling stages take variables and aggregrates them into coarser time bins. Currently the following stages are supported:

  • 3hourly

  • 6hourly

  • daily

  • monthly

  • yearly

# compute the monthly mean of `GT` and `ST` variables
monthly:
  variables:
    - GT
    - ST

Custom Resampling#

Additional resampling options can also be applied to all variables in a stage using the resample keyword. If we wanted to do a 3-day average we could use

custom_stage:
  resample: 3D
  variables:
    - ST
    - GT

By default this will peform a mean, but min, max or std are also supported.

custom_stage:
  resample:
    resolution: 3D
    method: std
  variables:
    - ST
    - GT

Cycle Stages#

Cycling stages take variables and aggregrates them into coarser time bins. Currently the following stages are supported:

  • annual_cycle

# compute the monthly annual cycle of `GT` and `ST` variables
annual_cycle:
  variables:
    - GT
    - ST

Custom Cycles#

Additional cycle options can also be applied to all variables in a stage using the cycle keyword. If we wanted to do a daily annual cycle we could use

custom_stage:
  cycle: dayofyear
  variables:
    - ST
    - GT

By default this will peform a mean, but min, max or std are also supported.

custom_stage:
  cycle:
    group: dayofyear
    method: std
  variables:
    - ST
    - GT

rtd Stage#

A default RTD stage that converts variables to yearly global average values.

# compute the global, annual mean of `GT` and `ST` variables
rtd:
  variables:
    - GT
    - ST

Custom Stages#

Users can create their own stages. These do not perform any operations by default except saving the ouptut to a file. Instead, users can provide function names, arguments and keyword arguments that are constructed into a DAG. Most parameters are optional, but in the complete form:

# compute monthly standard deviation of the `GT` variable
variability:
  variables:
    - GT:
        dag:
          dag:
            - name: resampled
              function: xr.self.resample
              args: [GT]
              kwargs:
                time: MS
            - name: monthly_std
              function: xr.self.std
              args: [resampled]
          output: monthly_std

If you would like to call your own functions in a pipeline, see User functions.

NetCDF4 Encoding#

If you want to write the netcdf files using a particular encoding this can be done at the variable, stage or setup level, depending on the scope you would like it to apply. In the example below we specify the default encoding as float32 with a _FillValue of 1.0e20. Unless otherwise specified variables will be written with this encoding (e.g. the daily ST variable). The monthly stage overwrites this and sets a new default, so the monthly variables (e.g. ST) will have this encoding. Lastly, if we want a specific encoding for the monthly, variable, GT we can set this at the variable level.

setup:
  ...
  encoding:
    dtype: float32
    _FillValue: 1.0E+20  # note yaml format requires both a "." and a "+" to be read as a float

monthly:
  reuse: daily
  encoding:
    dtype: float64
    _FillValue: -999
  variables:
    - ST
    - GT:
        encoding:
          dtype: float64
          _FillValue: 1.0E+20

daily:
  variables:
    - ST

Variable Attributes#

By default, the output variables are assigned a long_name and units attribute. You can specify the desired values by listing them in the YAML configuration; otherwise, they will be listed as “N/A”. Additional attributes can also be listed under the metadata key. The minimum and maximum values in the data array can also be added as an attribute by adding the keys min/max: True.

setup:
  ...

monthly:
  reuse: daily
  variables:
    - GT:
        metadata:
          long_name: "Monthly mean ground temperature aggregated over all tiles"
          units: "K"
          min: True
          max: True
          project: CMIP