.. quickstart

The Directed Acyclic Graph (DAG)
--------------------------------

A directed acyclic graph defines the program flow. Each node in the graph is a process to be ran, and each edge (the node connections) are the input and outputs of the process. Importantly, a DAG need not be a simple linear pipeline, but may include parallel branching and execution so long as it does not include a cycle (as this would cause infinite recursion)

A DAG Node
**********

As mentioned, a DAG node is simply a function we want to run on a set of inputs. As a simple example, lets say we want to load an array of numbers into memory. If we use ``np.arange`` for this our inputs will just be the length of the array. 

.. code-block:: python
   
   from canproc import DAGProcess
   proc = DAGProcess(name='make_array', function=np.arange, args=[8]),

The process ``name`` is important for defining the relation to other nodes as we'll see later. Generally, a ``function`` can be either a ``Callable`` object, such as python function, or a ``str`` object that can be used to generate a function, e.g. ``function="np.arange"`` would also work. As long as there is a one-to-one mapping between funtion names and functions the ``DAGProcess`` is serializable, allowing for easy storage and reuse. 

Combining Nodes for Data Pipelines
**********************************


.. grid:: 2

    .. grid-item::
        
        Nodes are linked via their input and outputs. The ``name`` of a process is 
        the output that can be used as input to other nodes. As a toy example lets 
        create 2 arrays using numpy, concatenate them, and then take an average. 
        The python code might look like this:

        .. code-block:: python

            import numpy as np
            arr1 = np.arange(8)
            arr2 = np.arange(0, 4)
            concat = np.concatenate([arr1, arr2])
            mean = np.mean(concat)


        This works fine, but the python code has drawbacks that ``canesm-processor`` aims to address:

        #. :code:`arr1` does not depend on :code:`arr2`, but the creation is done sequentially
        #. The process will run locally on the CPU which might not scale for large arrays.

    .. grid-item::

        .. mermaid::
            
            %%{init: {'theme':'neutral', 'look': 'handDrawn'}}%%
            graph
                S1[ ] --> |8| A["np.arange(8)"]
                S2[ ] --> |0, 4| B["np.arange(0, 4)"]
                B --> |arr1| C["np.concatenate(arr1, arr2)"]
                A --> |arr2| C
                C --> |concat| D["np.mean(concat)"]
                D --> |mean| E[ ]
                style E fill:#FFFFFF00, stroke:#FFFFFF00;
                style S1 fill:#FFFFFF00, stroke:#FFFFFF00;
                style S2 fill:#FFFFFF00, stroke:#FFFFFF00;


To create this structure in ``canesm-processor`` we would write:

.. code-block:: python

    import numpy as np
    from canproc import DAGProcess
    graph = [
        DAGProcess(name='arr1', function=np.arange, args=[8]),
        DAGProcess(name='arr2', function=np.arange, args=[0, 4]),
        DAGProcess(name='concat', function=np.concatenate, args=[['arr1', 'arr2']]),
        DAGProcess(name='mean', function=np.mean, args=['concat'])
    ]


Lastly, we define the ``output`` of the ``DAG``. This defines what edges of the graph we want to output. Typically this is the final output, but could be a list where intermediate steps are also returned. Putting this all together, we have the full DAG:

.. code-block:: python
   
   from canproc import DAG
   dag = DAG(dag=graph, output='mean')