.. pipelines


Pipeline Overview
-----------------

A pipeline is organized into :code:`stages`. Each option at the base 
level of a :code:`YAML` file is treated as a "stage". The only required 
stage is :code:`setup`. For more detailed information on stage options 
see :ref:`stages`


:code:`setup` Stage
*******************
Defines the directory structure, high level information like model version 
and the order in which the following :code:`stages` are evaluated.


Output directories
^^^^^^^^^^^^^^^^^^

Output directories are specified relative to the input directory.

.. code-block:: YAML

    setup:

      # defines where the output files for each stage will be written too.
      output_directories:
        monthly: "diags/monthly"
        daily: "diags/daily"
        rtd: "diags/rtd"
        variability: "diags/landon"


.. note::

   If a stage is not present in :code:`output_directories` the variables created in that stage will not be written to disc.


Loading Data
^^^^^^^^^^^^

Input Files
"""""""""""
By default input files are expected to be in a flat directory in ``input_dir`` with filenames of ``{input}/{variable}.nc``. 
This can be customized using the ``source`` keyword. As with ``encoding``, the value of ``source`` is propagated to variables in each stage.
If ``source`` is set at the ``stage`` or ``variable`` level, the lower level value will take precedent.
For example, if most of our input files are organized in monthly folders, expect for some land variables we could write our configuration as:

.. code-block:: YAML

    setup:

      # defines the format to look for input files {input} and {variable} will be replaced dynamically.
      source: "{input}/*/{variable}.nc"

    monthly:

       variables:
         - GT
         - BEG
         - ST:
             source: "{input}/land/CLASSIC_{variable}.nc"


Custom Loaders
""""""""""""""
By default ``xarray``'s ``open_mfdataset`` is used to open files, but if you would like to use other methods this can be overwritten. 
If a name of a function is provided then the source will be passed to this function. If additional arguments or kwargs are required 
these should be set using ``args`` and ``kwargs``. If a filename is expected then it should be passed as an arg and ``source`` will 
be dynamically replaced.

.. code-block:: YAML

    setup:

      # defines the format to look for input files {input} and {variable} will be replaced dynamically.
      source: "{input}/*/{variable}.nc"

    monthly:

       variables:
         - ST:
             source: "{input}/land/CLASSIC_{variable}.001"
             loader: mymodule.load_ccc
         - GT: 
             loader:
               function: xr.open_mfdataset
               args: [source]  # uses the default source specified on setup
               kwargs:
                 engine: netcdf4
                 decode_times: false
                 parallel: false


General information
^^^^^^^^^^^^^^^^^^^

.. code-block:: YAML

    setup:
      # general options that may affect how we process yaml->dag
      canesm_version: "6.0"


Ordering Stages
^^^^^^^^^^^^^^^
This defines the order in which :code:`stages` are executed. For example, we may want to reuse data from the daily stage when
computing the monthly averages, in this case we could write:

.. code-block:: YAML

    setup:
      stages:
        - daily
        - monthly


If no data is reused between stages then this section can be omitted.


Reusing Stages
^^^^^^^^^^^^^^
To reuse results from a previous stage, the `reuse` keyword can be used

.. code-block:: YAML

    setup:
      stages:
        - transforms
        - daily
        - monthly

    transforms:
      variables:
        - GT:
            rename: TS

    daily:
      reuse: transforms
      variables:
        - GT

    monthly:
      reuse: daily
      variables:
        - GT
        - ST


This will tell the :code:`daily` stage to use the variables from the output of 
the :code:`transforms` stage and the :code:`monthly` stage to use the variables from 
the output of the :code:`daily` stage. This will be applied to all variables in 
the stage in this file. Variables that are not defined in prior stages, e.g. :code:`ST` here,
will fallback to earlier stages, in this case the raw data loaded from disc. If multiple stages
are reused a list can be provided e.g.: :code:`reuse: [transforms, monthly]`


Resampling Stages
*****************

Resampling stages take variables and aggregrates them into coarser time bins. Currently the following stages are supported:

 - 3hourly
 - 6hourly
 - daily
 - monthly
 - yearly


.. code-block:: YAML
    
    # compute the monthly mean of `GT` and `ST` variables
    monthly:
      variables:
        - GT
        - ST


Custom Resampling
^^^^^^^^^^^^^^^^^

Additional resampling options can also be applied to all variables in a stage using the :code:`resample` keyword.
If we wanted to do a 3-day average we could use

.. code-block:: YAML

    custom_stage:
      resample: 3D
      variables:
        - ST
        - GT

By default this will peform a mean, but :code:`min`, :code:`max` or :code:`std` are also supported.

.. code-block:: YAML

    custom_stage:
      resample:
        resolution: 3D
        method: std
      variables:
        - ST
        - GT


Cycle Stages
************

Cycling stages take variables and aggregrates them into coarser time bins. Currently the following stages are supported:

 - annual_cycle


.. code-block:: YAML

    # compute the monthly annual cycle of `GT` and `ST` variables
    annual_cycle:
      variables:
        - GT
        - ST


Custom Cycles
^^^^^^^^^^^^^

Additional cycle options can also be applied to all variables in a stage using the :code:`cycle` keyword.
If we wanted to do a daily annual cycle we could use

.. code-block:: YAML

    custom_stage:
      cycle: dayofyear
      variables:
        - ST
        - GT

By default this will peform a mean, but :code:`min`, :code:`max` or :code:`std` are also supported.

.. code-block:: YAML

    custom_stage:
      cycle:
        group: dayofyear
        method: std
      variables:
        - ST
        - GT


:code:`rtd` Stage
*****************
A default RTD stage that converts variables to yearly global average values.

.. code-block:: YAML

    # compute the global, annual mean of `GT` and `ST` variables
    rtd:
      variables:
        - GT
        - ST


Custom Stages
*************
Users can create their own stages. These do not perform any operations by default except saving the ouptut to a file.
Instead, users can provide function names, arguments and keyword arguments that are constructed into a :code:`DAG`. 
Most parameters are optional, but in the complete form:

.. code-block:: YAML

    # compute monthly standard deviation of the `GT` variable
    variability:
      variables:
        - GT:
            dag:
              dag:
                - name: resampled
                  function: xr.self.resample
                  args: [GT]
                  kwargs:
                    time: MS
                - name: monthly_std
                  function: xr.self.std
                  args: [resampled]
              output: monthly_std


If you would like to call your own functions in a pipeline, see :ref:`custom_functions`.


NetCDF4 Encoding
****************

If you want to write the netcdf files using a particular encoding this can be done at the variable, stage or 
setup level, depending on the scope you would like it to apply. In the example below we specify the default encoding
as :code:`float32` with a :code:`_FillValue` of :code:`1.0e20`. Unless otherwise specified variables will be written
with this encoding (e.g. the daily :code:`ST` variable). The :code:`monthly` stage 
overwrites this and sets a new default, so the monthly variables (e.g. :code:`ST`) will have this encoding. Lastly, if we want a
specific encoding for the monthly, variable, :code:`GT` we can set this at the variable level.

.. code-block:: YAML

    setup:
      ...
      encoding:
        dtype: float32
        _FillValue: 1.0E+20  # note yaml format requires both a "." and a "+" to be read as a float

    monthly:
      reuse: daily
      encoding:
        dtype: float64
        _FillValue: -999
      variables:
        - ST
        - GT:
            encoding: 
              dtype: float64
              _FillValue: 1.0E+20

    daily:
      variables:
        - ST


Variable Attributes
*******************

By default, the output variables are assigned a `long_name` and `units` attribute. You can specify the desired values
by listing them in the YAML configuration; otherwise, they will be listed as "N/A". Additional attributes can also be listed
under the `metadata` key. The minimum and maximum values in the data array can also be added as an attribute by adding
the keys `min/max: True`.

.. code-block:: YAML

    setup:
      ...

    monthly:
      reuse: daily
      variables:
        - GT:
            metadata:
              long_name: "Monthly mean ground temperature aggregated over all tiles"
              units: "K"
              min: True
              max: True
              project: CMIP