Pipeline Overview#
A pipeline is organized into stages. Each option at the base
level of a YAML file is treated as a “stage”. The only required
stage is setup. For more detailed information on stage options
see Stages
setup Stage#
Defines the directory structure, high level information like model version
and the order in which the following stages are evaluated.
Output directories#
Output directories are specified relative to the input directory.
setup:
# defines where the output files for each stage will be written too.
output_directories:
monthly: "diags/monthly"
daily: "diags/daily"
rtd: "diags/rtd"
variability: "diags/landon"
Note
If a stage is not present in output_directories the variables created in that stage will not be written to disc.
Loading Data#
Input Files#
By default input files are expected to be in a flat directory in input_dir with filenames of {input}/{variable}.nc.
This can be customized using the source keyword. As with encoding, the value of source is propagated to variables in each stage.
If source is set at the stage or variable level, the lower level value will take precedent.
For example, if most of our input files are organized in monthly folders, expect for some land variables we could write our configuration as:
setup:
# defines the format to look for input files {input} and {variable} will be replaced dynamically.
source: "{input}/*/{variable}.nc"
monthly:
variables:
- GT
- BEG
- ST:
source: "{input}/land/CLASSIC_{variable}.nc"
Custom Loaders#
By default xarray’s open_mfdataset is used to open files, but if you would like to use other methods this can be overwritten.
If a name of a function is provided then the source will be passed to this function. If additional arguments or kwargs are required
these should be set using args and kwargs. If a filename is expected then it should be passed as an arg and source will
be dynamically replaced.
setup:
# defines the format to look for input files {input} and {variable} will be replaced dynamically.
source: "{input}/*/{variable}.nc"
monthly:
variables:
- ST:
source: "{input}/land/CLASSIC_{variable}.001"
loader: mymodule.load_ccc
- GT:
loader:
function: xr.open_mfdataset
args: [source] # uses the default source specified on setup
kwargs:
engine: netcdf4
decode_times: false
parallel: false
General information#
setup:
# general options that may affect how we process yaml->dag
canesm_version: "6.0"
Ordering Stages#
This defines the order in which stages are executed. For example, we may want to reuse data from the daily stage when
computing the monthly averages, in this case we could write:
setup:
stages:
- daily
- monthly
If no data is reused between stages then this section can be omitted.
Reusing Stages#
To reuse results from a previous stage, the reuse keyword can be used
setup:
stages:
- transforms
- daily
- monthly
transforms:
variables:
- GT:
rename: TS
daily:
reuse: transforms
variables:
- GT
monthly:
reuse: daily
variables:
- GT
- ST
This will tell the daily stage to use the variables from the output of
the transforms stage and the monthly stage to use the variables from
the output of the daily stage. This will be applied to all variables in
the stage in this file. Variables that are not defined in prior stages, e.g. ST here,
will fallback to earlier stages, in this case the raw data loaded from disc. If multiple stages
are reused a list can be provided e.g.: reuse: [transforms, monthly]
Resampling Stages#
Resampling stages take variables and aggregrates them into coarser time bins. Currently the following stages are supported:
3hourly
6hourly
daily
monthly
yearly
# compute the monthly mean of `GT` and `ST` variables
monthly:
variables:
- GT
- ST
Custom Resampling#
Additional resampling options can also be applied to all variables in a stage using the resample keyword.
If we wanted to do a 3-day average we could use
custom_stage:
resample: 3D
variables:
- ST
- GT
By default this will peform a mean, but min, max or std are also supported.
custom_stage:
resample:
resolution: 3D
method: std
variables:
- ST
- GT
Cycle Stages#
Cycling stages take variables and aggregrates them into coarser time bins. Currently the following stages are supported:
annual_cycle
# compute the monthly annual cycle of `GT` and `ST` variables
annual_cycle:
variables:
- GT
- ST
Custom Cycles#
Additional cycle options can also be applied to all variables in a stage using the cycle keyword.
If we wanted to do a daily annual cycle we could use
custom_stage:
cycle: dayofyear
variables:
- ST
- GT
By default this will peform a mean, but min, max or std are also supported.
custom_stage:
cycle:
group: dayofyear
method: std
variables:
- ST
- GT
rtd Stage#
A default RTD stage that converts variables to yearly global average values.
# compute the global, annual mean of `GT` and `ST` variables
rtd:
variables:
- GT
- ST
Custom Stages#
Users can create their own stages. These do not perform any operations by default except saving the ouptut to a file.
Instead, users can provide function names, arguments and keyword arguments that are constructed into a DAG.
Most parameters are optional, but in the complete form:
# compute monthly standard deviation of the `GT` variable
variability:
variables:
- GT:
dag:
dag:
- name: resampled
function: xr.self.resample
args: [GT]
kwargs:
time: MS
- name: monthly_std
function: xr.self.std
args: [resampled]
output: monthly_std
If you would like to call your own functions in a pipeline, see User functions.
NetCDF4 Encoding#
If you want to write the netcdf files using a particular encoding this can be done at the variable, stage or
setup level, depending on the scope you would like it to apply. In the example below we specify the default encoding
as float32 with a _FillValue of 1.0e20. Unless otherwise specified variables will be written
with this encoding (e.g. the daily ST variable). The monthly stage
overwrites this and sets a new default, so the monthly variables (e.g. ST) will have this encoding. Lastly, if we want a
specific encoding for the monthly, variable, GT we can set this at the variable level.
setup:
...
encoding:
dtype: float32
_FillValue: 1.0E+20 # note yaml format requires both a "." and a "+" to be read as a float
monthly:
reuse: daily
encoding:
dtype: float64
_FillValue: -999
variables:
- ST
- GT:
encoding:
dtype: float64
_FillValue: 1.0E+20
daily:
variables:
- ST
Variable Attributes#
By default, the output variables are assigned a long_name and units attribute. You can specify the desired values by listing them in the YAML configuration; otherwise, they will be listed as “N/A”. Additional attributes can also be listed under the metadata key. The minimum and maximum values in the data array can also be added as an attribute by adding the keys min/max: True.
setup:
...
monthly:
reuse: daily
variables:
- GT:
metadata:
long_name: "Monthly mean ground temperature aggregated over all tiles"
units: "K"
min: True
max: True
project: CMIP