Skip to content

Commit

Permalink
more drafting
Browse files Browse the repository at this point in the history
  • Loading branch information
samumantha committed Mar 13, 2024
1 parent 164e4fb commit 54ca947
Showing 1 changed file with 16 additions and 76 deletions.
92 changes: 16 additions & 76 deletions content/workflow-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,80 +39,16 @@ $ python statistics/count.py data/isles.txt > statistics/isles.data
$ python plot/plot.py --data-file statistics/isles.data --plot-file plot/isles.png
```


```{discussion}
We have two steps and 4 books. But **imagine having 4 steps and processing 500 books**.
Can you relate? Are you using similar setups in your research? How do you record them?
```

````{discussion} Kitchen analogy
```{figure} img/kitchen/busy.png
:alt: Busy kitchen
:width: 50%
Now we have many similar meals to prepare and possibly many chefs
present (cores) and workflow tools can help us to plan and document the steps
and run them efficiently. [Midjourney, CC-BY-NC 4.0]
```
````

**We will imagine solving this in four different ways and discuss pros and cons.**

---

## Solution 1: Graphical user interface (GUI)

Imagine we have programmed a GUI with a nice interface with icons where you can select scripts and input files by clicking:

- Click on counting script
- Select book txt file
- Give a name for the dat file
- Click on a run symbol
- Click on plotting script
- Select book dat file
- Give a name for the image file
- Click on a run symbol
- ...
- Go to next book ...
- Click on counting script
- Select book txt file
- ...

Disclaimer: not all GUIs behave this way - there exist very good GUI solutions which enable
reproducibility and automation.

---

## Solution 2: Manual steps

It is not too much work for four files:

```{code-block} console
---
emphasize-lines: 1-2, 13
---
$ python statistics/count.py data/abyss.txt > statistics/abyss.data
$ python plot/plot.py --data-file statistics/abyss.data --plot-file plot/abyss.png
$ python statistics/count.py data/isles.txt > statistics/isles.data
$ python plot/plot.py --data-file statistics/isles.data --plot-file plot/isles.png
$ python statistics/count.py data/last.txt > statistics/last.data
$ python plot/plot.py --data-file statistics/last.data --plot-file plot/last.png
$ python statistics/count.py data/sierra.txt > statistics/sierra.data
$ python plot/plot.py --data-file statistics/sierra.data --plot-file plot/sierra.png
```
This could also be implemented with a graphical user interface (GUI), where you can for example drag and drop files and click buttons to do the different processing steps.

This is **imperative style**: first do this, then to that, then do that, finally do ...

---

## Solution 3: Script
````{discussion}
Both of the above are tricky in terms of reproducibility. We currently have two steps and 4 books. But **imagine having 4 steps and 500 books**.
How could we deal with this?
Let's express it more compactly with a shell script (Bash). Let's call it `script.sh`:
As a first idea we could express the workflow with a shell script. Let's call it `script.sh` (we could do this with a python script too):
```{code-block} bash
---
emphasize-lines: 4
Expand All @@ -133,11 +69,9 @@ $ bash script.sh
```
This is still **imperative style**: we tell the script to run these
steps in precisely this order. We can do it on many files, but if we
need to re-run just one file, it's a bit of work.
steps in precisely this order.
````{discussion}
- What are the advantages of this solution compared to processing all one by one?
- Is the scripted solution reproducible?
- Imagine adding more steps to the analysis and imagine the steps being time consuming. What problems do you anticipate
Expand All @@ -158,14 +92,21 @@ need to re-run just one file, it's a bit of work.

---

## Solution 4: Using [Snakemake](https://snakemake.readthedocs.io/en/stable/index.html)
## Workflow tools

Sometimes it may be helpful to go from imperative to declarative style. Rather than saying "do this and then that" we describe dependencies but we let
the tool figure out the series of steps to produce results (targets). A workflow file



### Example tool: [Snakemake](https://snakemake.readthedocs.io/en/stable/index.html)

Snakemake is inspired by [GNU Make](https://www.gnu.org/software/make/),
but based on Python and is more general and has easier syntax.

---

## Exercise
## Exercise - demo

````{prereq} Exercise preparation
The exercise (below) and pre-exercise discussion uses a simple
Expand Down Expand Up @@ -232,8 +173,7 @@ rule make_plot:
shell: 'python {input.script} --data-file {input.book} --plot-file {output}'
```
Snakemake uses **declarative style**: we describe dependencies but we let
Snakemake figure out the series of steps to produce results (targets).
We can see that Snakemake uses **declarative style**:
Snakefiles contain rules that relate targets (`output`) to dependencies
(`input`) and commands (`shell`).
Expand Down

0 comments on commit 54ca947

Please sign in to comment.