Command Line Interface#

To run a pipeline you can use the command line utility canproc-pipeline:

cancproc-pipeline#

Run a data pipeline

CONFIG: Path to the config file
INPUT: Directory containing the input files
OUTPUT: Directory where output files will be written

Example

——-

>>> canproc-pipeline "config.yaml" "/space/hall5/sitestore/..."

cancproc-pipeline [OPTIONS] CONFIG INPUT [OUTPUT]

Options

-s, --scheduler <scheduler>#

The dask scheduler to be used, threads, processes, distributed or single-threaded

Options:: threads | processes | distributed | single-threaded | syncronous

-w, --workers <workers>#: number of workers that will be used for distributed runner.

-t, --threads_per_worker <threads_per_worker>#: number of threads per worker when using distributed runner

--dry-run#: print the created dag but do not run, useful for debugging

Arguments

CONFIG#: Required argument

INPUT#: Required argument

OUTPUT#: Optional argument

Runner Considerations#

For large pipelines, such as those used to processes CanESM output, the distributed scheduler is often the fastest. For ppp machines, where a full node is used to process the data, it is often beneficial to set a large number of workers with 1 thread per worker.

canproc-pipeline config.yaml /space/hall5... /space/hall5/... -w 40 -t 1

Command Line Interface

Contents

Command Line Interface#

cancproc-pipeline#

Runner Considerations#