diff --git a/content/workflow-management.md b/content/workflow-management.md index 2a14af9..a38638a 100644 --- a/content/workflow-management.md +++ b/content/workflow-management.md @@ -39,80 +39,16 @@ $ python statistics/count.py data/isles.txt > statistics/isles.data $ python plot/plot.py --data-file statistics/isles.data --plot-file plot/isles.png ``` - -```{discussion} -We have two steps and 4 books. But **imagine having 4 steps and processing 500 books**. -Can you relate? Are you using similar setups in your research? How do you record them? -``` - -````{discussion} Kitchen analogy -```{figure} img/kitchen/busy.png -:alt: Busy kitchen -:width: 50% - -Now we have many similar meals to prepare and possibly many chefs -present (cores) and workflow tools can help us to plan and document the steps -and run them efficiently. [Midjourney, CC-BY-NC 4.0] -``` -```` - -**We will imagine solving this in four different ways and discuss pros and cons.** - ---- - -## Solution 1: Graphical user interface (GUI) - -Imagine we have programmed a GUI with a nice interface with icons where you can select scripts and input files by clicking: - -- Click on counting script -- Select book txt file -- Give a name for the dat file -- Click on a run symbol -- Click on plotting script -- Select book dat file -- Give a name for the image file -- Click on a run symbol -- ... -- Go to next book ... -- Click on counting script -- Select book txt file -- ... - -Disclaimer: not all GUIs behave this way - there exist very good GUI solutions which enable -reproducibility and automation. - ---- - -## Solution 2: Manual steps - -It is not too much work for four files: - -```{code-block} console ---- -emphasize-lines: 1-2, 13 ---- - -$ python statistics/count.py data/abyss.txt > statistics/abyss.data -$ python plot/plot.py --data-file statistics/abyss.data --plot-file plot/abyss.png - -$ python statistics/count.py data/isles.txt > statistics/isles.data -$ python plot/plot.py --data-file statistics/isles.data --plot-file plot/isles.png - -$ python statistics/count.py data/last.txt > statistics/last.data -$ python plot/plot.py --data-file statistics/last.data --plot-file plot/last.png - -$ python statistics/count.py data/sierra.txt > statistics/sierra.data -$ python plot/plot.py --data-file statistics/sierra.data --plot-file plot/sierra.png - -``` +This could also be implemented with a graphical user interface (GUI), where you can for example drag and drop files and click buttons to do the different processing steps. This is **imperative style**: first do this, then to that, then do that, finally do ... ---- -## Solution 3: Script +````{discussion} +Both of the above are tricky in terms of reproducibility. We currently have two steps and 4 books. But **imagine having 4 steps and 500 books**. +How could we deal with this? -Let's express it more compactly with a shell script (Bash). Let's call it `script.sh`: +As a first idea we could express the workflow with a shell script. Let's call it `script.sh` (we could do this with a python script too): ```{code-block} bash --- emphasize-lines: 4 @@ -133,11 +69,9 @@ $ bash script.sh ``` This is still **imperative style**: we tell the script to run these -steps in precisely this order. We can do it on many files, but if we -need to re-run just one file, it's a bit of work. +steps in precisely this order. -````{discussion} - What are the advantages of this solution compared to processing all one by one? - Is the scripted solution reproducible? - Imagine adding more steps to the analysis and imagine the steps being time consuming. What problems do you anticipate @@ -158,14 +92,21 @@ need to re-run just one file, it's a bit of work. --- -## Solution 4: Using [Snakemake](https://snakemake.readthedocs.io/en/stable/index.html) +## Workflow tools + +Sometimes it may be helpful to go from imperative to declarative style. Rather than saying "do this and then that" we describe dependencies but we let +the tool figure out the series of steps to produce results (targets). A workflow file + + + +### Example tool: [Snakemake](https://snakemake.readthedocs.io/en/stable/index.html) Snakemake is inspired by [GNU Make](https://www.gnu.org/software/make/), but based on Python and is more general and has easier syntax. --- -## Exercise +## Exercise - demo ````{prereq} Exercise preparation The exercise (below) and pre-exercise discussion uses a simple @@ -232,8 +173,7 @@ rule make_plot: shell: 'python {input.script} --data-file {input.book} --plot-file {output}' ``` -Snakemake uses **declarative style**: we describe dependencies but we let -Snakemake figure out the series of steps to produce results (targets). +We can see that Snakemake uses **declarative style**: Snakefiles contain rules that relate targets (`output`) to dependencies (`input`) and commands (`shell`).