Command Line Interface

Command Line Interface#

To run a pipeline you can use the command line utility canproc-pipeline:

cancproc-pipeline#

Run a data pipeline

CONFIG: Path to the config file
INPUT: Directory containing the input files
OUTPUT: Directory where output files will be written
Example
——-
>>> canproc-pipeline "config.yaml" "/space/hall5/sitestore/..."
cancproc-pipeline [OPTIONS] CONFIG INPUT [OUTPUT]

Options

-s, --scheduler <scheduler>#

The dask scheduler to be used, threads, processes, distributed or single-threaded

Options:

threads | processes | distributed | single-threaded | syncronous

-w, --workers <workers>#

number of workers that will be used for distributed runner.

-t, --threads_per_worker <threads_per_worker>#

number of threads per worker when using distributed runner

--dry-run#

print the created dag but do not run, useful for debugging

Arguments

CONFIG#

Required argument

INPUT#

Required argument

OUTPUT#

Optional argument

Runner Considerations#

For large pipelines, such as those used to processes CanESM output, the distributed scheduler is often the fastest. For ppp machines, where a full node is used to process the data, it is often beneficial to set a large number of workers with 1 thread per worker.

canproc-pipeline config.yaml /space/hall5... /space/hall5/... -w 40 -t 1