Command Line Interface#
To run a pipeline you can use the command line utility canproc-pipeline:
cancproc-pipeline#
Run a data pipeline
>>> canproc-pipeline "config.yaml" "/space/hall5/sitestore/..."
cancproc-pipeline [OPTIONS] CONFIG INPUT [OUTPUT]
Options
- -s, --scheduler <scheduler>#
The dask scheduler to be used, threads, processes, distributed or single-threaded
- Options:
threads | processes | distributed | single-threaded | syncronous
- -w, --workers <workers>#
number of workers that will be used for distributed runner.
- -t, --threads_per_worker <threads_per_worker>#
number of threads per worker when using distributed runner
- --dry-run#
print the created dag but do not run, useful for debugging
Arguments
- CONFIG#
Required argument
- INPUT#
Required argument
- OUTPUT#
Optional argument
Runner Considerations#
For large pipelines, such as those used to processes CanESM output, the distributed
scheduler is often the fastest. For ppp machines, where a full node is used to process
the data, it is often beneficial to set a large number of workers with 1 thread per worker.
canproc-pipeline config.yaml /space/hall5... /space/hall5/... -w 40 -t 1