.. running

Running a DAG
-------------

The DAG defines the processes that need to be ran, the inputs and outputs and their relationships, 
but is agnostic on how this is executed. For that we use a ``Runner``. A simple runner that works 
well with netcdf, xarray and numpy is ``dask``. To process our dag using dask:

.. code-block:: python
   
   from canproc.runners import DaskRunner
   runner = DaskRunner()
   output = runner.run(dag)


For larger jobs with many nodes that can run in parallel the ``DaskDistributedRunner`` provides more 
control on the number of workers and threads. However, be aware that memory is equally distributed among
workers, so chunk sizes may need to be adjusted accordingly.


.. code-block:: python
   
   from canproc.runners import DaskDistributedRunner
   runner = DaskDistributedRunner(workers=40, threads_per_worker=1)
   output = runner.run(dag)


If you have a large pipeline with many independent branches, you can split the DAG into smaller sub-DAGs
and run them separately. This can be useful to reduce memory usage and better utilize available resources.
To split a DAG into sub-DAGs, you can use the ``target_size`` function, that will try to group the sub-DAGs
so that each sub-DAG has approximately the specified number of nodes.


.. code-block:: python
   
   from canproc.runners import DaskDistributedRunner
   runner = DaskDistributedRunner(workers=40, threads_per_worker=1, target_size=20)
   output = runner.run(dag)


.. note::

   DAGs will only be split if there are independent branches that can be run separately. 
   If the DAG is fully connected, this option will have no effect.