Skip to content

Commit

Permalink
Task parameter section upgrade.
Browse files Browse the repository at this point in the history
  • Loading branch information
hjoliver committed Nov 17, 2021
1 parent 6c29d7b commit f922e05
Showing 1 changed file with 110 additions and 76 deletions.
186 changes: 110 additions & 76 deletions src/user-guide/writing-workflows/parameterized-tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,22 @@ Cylc can automatically generate tasks and dependencies by expanding
:term:`parameterized <parameterisation>` task names over lists of parameter
values. Uses for this include:

- generating an ensemble of similar model runs
- generating chains of tasks to process similar datasets
- replicating an entire workflow, or part thereof, over several runs
- splitting a long model run into smaller steps or ``chunks``
(parameterized cycling)
- Generating an ensemble of similar model runs
- Generating chains of tasks to process similar datasets
- Replicating an entire workflow, or part thereof, over several runs
- Splitting a long model run into smaller chunks
- Parameterized cycling

.. note::

This can be done with Jinja2 loops too (:ref:`User Guide Jinja2`)
but parameterization is much cleaner (nested loops can seriously reduce
the clarity of a workflow configuration).*

Cylc supports use of :ref:`Jinja2 <User Guide Jinja2>` and :ref:`Empy
<User Guide Empy>` templating for programmatic generation of workflow
configurations. The built-in parameterization system described here
is a cleaner and easier alternative *for generating tasks and families
over a range of parameters*, but unlike general templating it can only be
used for that specific purpose.


Parameter Expansion
-------------------

Expand Down Expand Up @@ -225,8 +228,8 @@ should be overridden to remove the initial underscore. For example:
"""
Passing Parameter Values To Tasks
---------------------------------
Passing Values To Tasks
-----------------------

Parameter values are passed as environment variables to tasks generated by
parameter expansion. For example, if we have:
Expand Down Expand Up @@ -273,8 +276,8 @@ environment variables:
export MYFILE=/path/to/run002/ship
Selecting Specific Parameter Values
-----------------------------------
Selecting Specific Values
-------------------------

Specific parameter values can be singled out in the graph and under
:cylc:conf:`[runtime]` with the notation ``<p=5>`` (for example).
Expand All @@ -299,8 +302,8 @@ set of model runs:
#...
Selecting Partial Parameter Ranges
----------------------------------
Selecting Partial Ranges
------------------------

The parameter notation does not currently support partial range selection such
as ``foo<p=5..10>``, but you can achieve the same result by defining a
Expand All @@ -325,8 +328,8 @@ template as the full-range parameter. For example:
#...
Parameter Offsets In The Graph
------------------------------
Offsets in the Graph
---------------------

A negative offset notation ``<NAME-1>`` is interpreted as the previous
value in the ordered list of parameter values, while a positive offset is
Expand Down Expand Up @@ -367,8 +370,8 @@ expands to:
proc_small => proc_big => proc_huge
Task Families And Parameterization
----------------------------------
Task Families and Parameters
----------------------------

Task family members can be generated by parameter expansion:

Expand Down Expand Up @@ -459,15 +462,11 @@ expands to:
Parameterized Cycling
---------------------

For most purposes use of
a proper :term:`cycling` workflow is recommended, wherein Cylc incrementally
generates the datetime sequence and extends the workflow, potentially
indefinitely, at run time. For smaller systems of finite duration, however,
parameter expansion can be used to generate a sequence of pre-defined tasks
as a proxy for cycling.
For smaller workflows of finite duration, parameter expansion can be used to
generate a sequence of pre-defined tasks as a proxy for cycling.

Here's a cycling workflow of two-monthly model runs for one year,
with previous-instance model dependence (e.g. for model restart files):
Here's a cycling workflow of two-monthly model runs for one year, with
previous-instance model dependence:

.. code-block:: cylc
Expand All @@ -483,12 +482,13 @@ with previous-instance model dependence (e.g. for model restart files):
[[model]]
script = "run-model $CYLC_TASK_CYCLE_POINT"
And here's how to do the same thing with parameterized tasks:
And here's how to do the same thing with parameterized tasks instead of cycling:

.. code-block:: cylc
[task parameters]
chunk = 1..6
chunk = 1..6
[scheduling]
[[graph]]
R1 = """
Expand All @@ -499,19 +499,17 @@ And here's how to do the same thing with parameterized tasks:
[runtime]
[[model<chunk>]]
script = """
# Compute start date from chunk index and interval, then run the model.
INITIAL_POINT=2020-01
INTERVAL_MONTHS=2
OFFSET_MONTHS=(( (CYLC_TASK_PARAM_chunk - 1)*INTERVAL_MONTHS ))
OFFSET=P${OFFSET_MONTHS}M # e.g. P4M for chunk=3
run-model $(cylc cyclepoint --offset=$OFFSET $INITIAL_POINT)"""
The two workflows are shown together below. They both achieve the same
result, and both can include special tasks at the start, end, or
anywhere in between. But as noted earlier the parameterized version has
several disadvantages: it must be finite in extent and not too large; the
datetime arithmetic has to be done by the user; and the full extent of the
workflow will be visible at all times as the workflow runs.
# Compute start date from chunk index and interval.
INITIAL_POINT=2020-01
INTERVAL_MONTHS=2
OFFSET_MONTHS=(( (CYLC_TASK_PARAM_chunk - 1)*INTERVAL_MONTHS ))
OFFSET=P${OFFSET_MONTHS}M # e.g. P4M for chunk=3
# Run the model.
run-model $(cylc cyclepoint --offset=$OFFSET $INITIAL_POINT)
"""
The two workflows achieve the same result, and both can include special
behaviour at the start, end, or anywhere in between.

.. todo
Create sub-figures if possible: for now hacked as separate figures with
Expand All @@ -527,15 +525,36 @@ workflow will be visible at all times as the workflow runs.

Parameterized (top) and cycling (bottom) versions of the same
workflow. The first three cycle points are shown in the
cycling case. The parameterized case does not have "cycle points".
cycling case. The parameterized case does not have cycle points (technically
all of its tasks have the cycle point 1).

The parameterized version has several disadvantages, however:

- The workflow must be finite in extent and not too large because every
parameterized task generates a new task definition

Here's a yearly-cycling workflow with four parameterized chunks in each cycle
point:
- (In a cycling workflow a single task definition acts as a template for
all cycle point instances of a task)
- Datetime arithmetic has to be done manually

- (This doesn't apply if it's not a datetime sequence; parameterized
integer cycling is straightforward.)


Parameterized Sub-Cycles
^^^^^^^^^^^^^^^^^^^^^^^^

A workflow can have multiple main cycling sequences, but sub-cycles within each
main cycle point have to be parameterized. A typical use case for this is
incremental processing of files generated sequentially during a long model run.

Here's a workflow that uses parameters to split a long model run in each
datetime cycle point into four smaller runs:

.. code-block:: cylc
[task parameters]
chunk = 1..4
chunk = 1..4
[scheduling]
initial cycle point = 2020-01
[[graph]]
Expand All @@ -544,31 +563,46 @@ point:
model<chunk=4>[-P1Y] => model<chunk=1>
"""
.. note::
The inter-cycle trigger connects the first chunk in each cycle point to the
last chunk in the previous cycle point. However, in this particular case it
might be simpler to use a 3-monthly datetime cycle instead:

.. code-block:: cylc
The inter-cycle trigger connects the first chunk in each cycle point
to the last chunk in the previous cycle point. Of course it would be simpler
to just use 3-monthly cycling:
[scheduling]
initial cycle point = 2020-01
[[graph]]
P3M = "model[-P3M] => model"
.. code-block:: cylc
[scheduling]
initial cycle point = 2020-01
[[graph]]
P3M = "model[-P3M] => model"
For another example, here task ``model`` generates 10 files in sequence at it
runs. Task ``proc_file0`` triggers when the model starts running, to wait for
and process the first file; when that is done, ``proc_file1`` triggers to wait
for the second file; and so on.

Here's a possible valid use-case for mixed cycling: consider a portable
datetime cycling workflow of model jobs that can each take too long to run on
some supported platforms. This could be handled without changing the cycling
structure of the workflow by splitting the run (at each cycle point) into a
variable number of shorter steps, using more steps on less powerful hosts.
.. code-block:: cylc
[task parameters]
file = 0..9
[scheduling]
initial cycle point = 2020-01
[[graph]]
P1Y = """
model:start => proc<file=0>
proc<file-1> => proc<file>
proc<file=9> => upload_products
"""
[runtime]
[[model]]
[[proc<file>]]
[[upload_products]]
Cycle Point And Parameter Offsets At Start-Up
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Offsets at Sequence Start
^^^^^^^^^^^^^^^^^^^^^^^^^

In cycling workflows cylc ignores anything earlier than the workflow initial
cycle point. So this graph:
In cycling workflows dependence on tasks prior to the start cycle point is
ignored [1]_. So this graph:

.. code-block:: cylc
Expand All @@ -580,26 +614,26 @@ simplifies at the initial cycle point to this:
P1D = "model"
Similarly, parameter offsets are ignored if they extend beyond the start
of the parameter value list. So this graph:
(Note this is a convenient way to bootstrap into an infinite cycle, but special
behaviour at the start point can be configured explicitly if desired).

Similarly, parameter offsets that go out of range are ignored. So this graph:

.. code-block:: cylc
# for chunk = 1..4
R1 = "model<chunk-1> => model<chunk>"
simplifies for ``chunk=1`` to this:

.. code-block:: cylc
R1 = "model_chunk1"
R1 = "model_chunk0"
.. note::
The initial cut-off applies to every parameter list, but only
to cycle point sequences that start at the workflow initial cycle point.
Therefore it may be somewhat easier to use parameterized cycling if you
need multiple datetime sequences *with different start points* in the
same workflow. We plan to allow this sequence-start simplification for any
datetime sequence in the future, not just at the workflow initial point,
but it needs to be optional because delayed-start cycling tasks
sometimes need to trigger off earlier cycling tasks.
.. [1] Currently this only applies to the unique workflow start cycle point, so
it may be easier to use parameterized cycling if you have multiple
(finite) sequences starting at different points. We plan to extend this
convenience to all sequences regardless of start point, but use will be
optional because delayed-start cycling tasks may need to trigger off of
earlier cycles.

0 comments on commit f922e05

Please sign in to comment.