Skip to content

Commit

Permalink
Merge pull request #109 from cbegeman/reorg-for-shared-steps
Browse files Browse the repository at this point in the history
New shared steps capability design doc
  • Loading branch information
xylar authored Sep 22, 2023
2 parents fe7dcf3 + bf7ac3a commit 683cfac
Show file tree
Hide file tree
Showing 2 changed files with 381 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/design_docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,6 @@
```{toctree}
:titlesonly: true
shared_steps
template
```
380 changes: 380 additions & 0 deletions docs/design_docs/shared_steps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,380 @@
# Shared steps

date: 2023/08/18

Contributors: Carolyn Begeman, Xylar Asay-Davis

## Summary

The capability designed here is the ability to share steps across tasks.
In this design document, "shared steps" refers to any step which may be used by
multiple tasks that are available in polaris.

The main motivation behind this capability is the computational expense of
running steps that could shared across tasks multiple times. In order to
reflect the fact that steps are shared to the user, we present a new design for
the working directory structure. The design is successful insofar as it
guarantees that shared steps are run once per slurm job and that the role of
shared steps is clear to users.

## Requirements

### Requirement: Shared steps are run once.

Shared steps should be run once per invocation of `polaris serial` or
`polaris run`.

### Requirement: Shared steps are run before steps that depend on their output.

### Requirement: Shared steps are not daughters of a task

A shared step's class attributes do not include any task-related information
such as a task it belongs to.

### Requirement: Working directory structure is intuitive.

Shared step directories should be located at the highest level in the working
directory structure where all tasks that use that step are run at or below that
level.

### Requirement: Working directory step paths are easily discoverable by users.

There should be a way to list the paths within the work directory of all steps
in each task. There should also be a way for a user to find the steps
in a task from the task's work directory.

### Requirement: The output of shared steps may be used by multiple tasks.

A step may only be shared across multiple tasks if its output would be
identical for each task.

### Requirement: tasks do not rely on outputs from steps in other tasks

All tasks are self-contained and rely only on either shared steps or steps they
contain.



## Implementation

### Implementation: Shared steps are set up once.

As before, setup of either a list of tasks or a suite proceeds by iterating
through the tasks and then through the steps in each task. An attribute
`setup_complete` has been added to `Step` and is initialized to `False`.
In the `setup_task()` function, setup is skipped for any steps where
`step.setup_complete == True`, and this attribute is set to `True` when a step
has been completed.

### Implementation: Shared steps are run before steps that depend on their output.

Requirement is already satisfied as part of task parallelism design, which
makes use of file dependencies. When running in task-serial mode, the
implementation will be to make sure shared steps are added to the dictionary of
steps before other steps that rely on them.

### Implementation: Shared steps are not daughters of a task

The `task` attribute and constructor argument of the `Step` class has been
replaced by the `component` attribute. The step's `subdir` attribute is now
relative to the component's work directory, rather than a parent task's work
directory.

### Implementation: Working directory structure is intuitive.

The only shared steps that reside inside of a task's work directory are in
situations where another task also lies within the task's work directory.
The only such tasks at the moment are the `cosine_bell/with_viz` tasks, which
reside inside the `cosine_bell` tasks. The `cosine_bell/with_viz` tasks share
all of the steps of the `cosine_bell` (base-mesh, init and forward for each
resolution, and a single analysis step) and also add remapping and
visualization steps that are not shared with any other tasks:

`cosine_bell`:
* ocean
* spherical
* qu
* base_mesh
* 60km
* 90km
* 120km
* 150km
* 180km
* 210km
* 240km
* cosine_bell
* init
* 60km
* 90km
* 120km
* 150km
* 180km
* 210km
* 240km
* forward
* 60km
* 90km
* 120km
* 150km
* 180km
* 210km
* 240km
* analysis

`cosine_bell/with_viz`:
* ocean
* spherical
* qu
* base_mesh
* 60km
* 90km
* 120km
* 150km
* 180km
* 210km
* 240km
* cosine_bell
* init
* 60km
* 90km
* 120km
* 150km
* 180km
* 210km
* 240km
* forward
* 60km
* 90km
* 120km
* 150km
* 180km
* 210km
* 240km
* analysis
* with_viz
* map
* 60km
* 90km
* 120km
* 150km
* 180km
* 210km
* 240km
* viz
* 60km
* 90km
* 120km
* 150km
* 180km
* 210km
* 240km


### Implementation: Working directory step paths are easily discoverable by users.

This is implemented in two ways.

First, `polaris list --verbose` now lists the work-directory relative path of
steps, rather than their path relative to the task's work directory:

```
$ polaris list --verbose
...
10: path: ocean/spherical/qu/cosine_bell/with_viz
name: cosine_bell
component: ocean
subdir: spherical/qu/cosine_bell/with_viz
steps:
- qu_base_mesh_60km: ocean/spherical/qu/base_mesh/60km
- qu_init_60km: ocean/spherical/qu/cosine_bell/init/60km
- qu_forward_60km: ocean/spherical/qu/cosine_bell/forward/60km
- qu_map_60km: ocean/spherical/qu/cosine_bell/with_viz/map/60km
- qu_viz_60km: ocean/spherical/qu/cosine_bell/with_viz/viz/60km
- qu_base_mesh_90km: ocean/spherical/qu/base_mesh/90km
- qu_init_90km: ocean/spherical/qu/cosine_bell/init/90km
- qu_forward_90km: ocean/spherical/qu/cosine_bell/forward/90km
- qu_map_90km: ocean/spherical/qu/cosine_bell/with_viz/map/90km
- qu_viz_90km: ocean/spherical/qu/cosine_bell/with_viz/viz/90km
- qu_base_mesh_120km: ocean/spherical/qu/base_mesh/120km
- qu_init_120km: ocean/spherical/qu/cosine_bell/init/120km
- qu_forward_120km: ocean/spherical/qu/cosine_bell/forward/120km
- qu_map_120km: ocean/spherical/qu/cosine_bell/with_viz/map/120km
- qu_viz_120km: ocean/spherical/qu/cosine_bell/with_viz/viz/120km
- qu_base_mesh_150km: ocean/spherical/qu/base_mesh/150km
- qu_init_150km: ocean/spherical/qu/cosine_bell/init/150km
- qu_forward_150km: ocean/spherical/qu/cosine_bell/forward/150km
- qu_map_150km: ocean/spherical/qu/cosine_bell/with_viz/map/150km
- qu_viz_150km: ocean/spherical/qu/cosine_bell/with_viz/viz/150km
- qu_base_mesh_180km: ocean/spherical/qu/base_mesh/180km
- qu_init_180km: ocean/spherical/qu/cosine_bell/init/180km
- qu_forward_180km: ocean/spherical/qu/cosine_bell/forward/180km
- qu_map_180km: ocean/spherical/qu/cosine_bell/with_viz/map/180km
- qu_viz_180km: ocean/spherical/qu/cosine_bell/with_viz/viz/180km
- qu_base_mesh_210km: ocean/spherical/qu/base_mesh/210km
- qu_init_210km: ocean/spherical/qu/cosine_bell/init/210km
- qu_forward_210km: ocean/spherical/qu/cosine_bell/forward/210km
- qu_map_210km: ocean/spherical/qu/cosine_bell/with_viz/map/210km
- qu_viz_210km: ocean/spherical/qu/cosine_bell/with_viz/viz/210km
- qu_base_mesh_240km: ocean/spherical/qu/base_mesh/240km
- qu_init_240km: ocean/spherical/qu/cosine_bell/init/240km
- qu_forward_240km: ocean/spherical/qu/cosine_bell/forward/240km
- qu_map_240km: ocean/spherical/qu/cosine_bell/with_viz/map/240km
- qu_viz_240km: ocean/spherical/qu/cosine_bell/with_viz/viz/240km
- analysis: ocean/spherical/qu/cosine_bell/analysis
```

Second, we add symlinks within the task to the shared step. In what follows,
the subdirectories in bold are shared steps that reside elsewhere up the
directory tree: each resolution in the `base_mesh`, `init` and `forward`, and
also `analysis`.

`cosine_bell/with_viz`:
* ocean
* spherical
* qu
* cosine_bell
* with_viz
* base_mesh
* **60km**
* **90km**
* **120km**
* **150km**
* **180km**
* **210km**
* **240km**
* init
* **60km**
* **90km**
* **120km**
* **150km**
* **180km**
* **210km**
* **240km**
* forward
* **60km**
* **90km**
* **120km**
* **150km**
* **180km**
* **210km**
* **240km**
* map
* 60km
* 90km
* 120km
* 150km
* 180km
* 210km
* 240km
* viz
* 60km
* 90km
* 120km
* 150km
* 180km
* 210km
* 240km
* **analysis**

Thus, a structure similar to what we had before shared steps is maintained
locally, which should make debugging easier.

### Implementation: The output of shared steps may be used by multiple tasks.

Task steps that use the output of shared steps will make use of symbolic
links as before.

### Implementation: tasks do not rely on outputs from steps in other tasks

There were not any polaris tasks that relied on outputs from other tasks even
before the implementation of shared steps. There are tasks in Compass, though,
such as global ocean `mesh`, `init` and `dynamic_adjustment`, that do allow
outputs from one task to be inputs of another. As these are ported to Polaris,
we will make sure they use shared steps instead.

## Testing

### Testing And Validation: Shared steps are run once.

Output from running a series of tasks or a suite indicates when shared steps
are skipped because they already ran (`already completed`):

```
ocean/spherical/icos/cosine_bell
* step: icos_base_mesh_60km
execution: SUCCESS
runtime: 0:01:00
* step: icos_init_60km
execution: SUCCESS
runtime: 0:00:00
* step: icos_forward_60km
execution: SUCCESS
runtime: 0:00:38
...
* step: analysis
execution: SUCCESS
runtime: 0:00:02
task execution: SUCCESS
task runtime: 0:02:59
ocean/spherical/icos/cosine_bell/with_viz
* step: icos_base_mesh_60km
already completed
* step: icos_init_60km
already completed
* step: icos_forward_60km
already completed
* step: icos_map_60km
execution: SUCCESS
runtime: 0:00:20
* step: icos_viz_60km
execution: SUCCESS
runtime: 0:00:06
...
* step: analysis
already completed
task execution: SUCCESS
task runtime: 0:03:23
```

### Testing And Validation: Shared steps are run before steps that depend on their output.

As before, steps are added to tasks in the order they are to be run, ensuring
that shared steps run before steps that require their output when running in
task serial (`polaris serial`). Task parallelism already has mechanisms to
prevent steps from running before their dependencies are available, and this
is not expected to be affected by shared steps. However, no testing with
task parallelism will be performed at this time.

### Testing And Validation: Shared steps are not daughters of a task

Steps run successfully even after we have removed the `task` attribute from
them, indicating that they no longer rely on information about a task they
formerly belonged to.

### Testing And Validation: Working directory structure is intuitive.

The intuitive work structure will need to be maintained by developers as new
tasks and steps are added, as this is not enforced by the framework. The
proposed implementation ensures that shared steps either reside close to the
root of the directory structure from the tasks that use them or that they
live inside of the tasks, which we have deemed an intuitive structure.

### Testing And Validation: Working directory step paths are easily discoverable by users.

Between `polaris list --verbose` and the local symlinks to shared steps within
each task, we think the shared steps will be discoverable by users and
developers.

### Testing And Validation: The output of shared steps may be used by multiple tasks.

We have implemented shared steps for base meshes, initial conditions and
forward runs, and shown that multiple tasks can make use of their output.

### Testing And Validation: tasks do not rely on outputs from steps in other tasks

This is not enforced, it will simply need to be maintained as the preferred
convention for future development. Currently, all tasks can be run
independently and do not rely on any other tasks.

0 comments on commit 683cfac

Please sign in to comment.