.. running Running a DAG ------------- The DAG defines the processes that need to be ran, the inputs and outputs and their relationships, but is agnostic on how this is executed. For that we use a ``Runner``. A simple runner that works well with netcdf, xarray and numpy is ``dask``. To process our dag using dask: .. code-block:: python from canproc.runners import DaskRunner runner = DaskRunner() output = runner.run(dag) For larger jobs with many nodes that can run in parallel the ``DaskDistributedRunner`` provides more control on the number of workers and threads. However, be aware that memory is equally distributed among workers, so chunk sizes may need to be adjusted accordingly. .. code-block:: python from canproc.runners import DaskDistributedRunner runner = DaskDistributedRunner(workers=40, threads_per_worker=1) output = runner.run(dag) If you have a large pipeline with many independent branches, you can split the DAG into smaller sub-DAGs and run them separately. This can be useful to reduce memory usage and better utilize available resources. To split a DAG into sub-DAGs, you can use the ``target_size`` function, that will try to group the sub-DAGs so that each sub-DAG has approximately the specified number of nodes. .. code-block:: python from canproc.runners import DaskDistributedRunner runner = DaskDistributedRunner(workers=40, threads_per_worker=1, target_size=20) output = runner.run(dag) .. note:: DAGs will only be split if there are independent branches that can be run separately. If the DAG is fully connected, this option will have no effect.