diff --git a/blog/softwareenvironments.md b/blog/softwareenvironments.md new file mode 100644 index 000000000..23aa8e5e9 --- /dev/null +++ b/blog/softwareenvironments.md @@ -0,0 +1,106 @@ +--- +author: Pavel Dimens +date: 2024-06-21 +category: guides +description: Deciding between using Conda or Containers +icon: container +image: https://cdn-icons-png.freepik.com/512/7115/7115168.png +--- + +# :icon-container: Choosing a software runtime method +There are two ways you can run Harpy, using a container with the necessary +software environments in it (the default), or with local conda environments +(with the `--conda` option). If software development and containerization +isn't your jam, that's great, you're in the right place! Below is a quick +explanation of what/why and the tradeoffs between either approach so you +can decide for yourself which makes more sense to use. + +### TL;DR +- container is more likely to work on all systems, but much slower +- conda is quicker and better for troubleshooting, but may have unexpected errors + +## What Harpy Provides +An conda-based installation of Harpy provides only the minimal set of +programs Harpy needs to begin a workflow. These include: python 3.12, snakemake-minimal, pandas, and the htslib programs (htslib, samtools, bcftools, tabix). Noticeably, there aren't sequence aligners, quality-assessment tools, phasers, etc. This is because some of the software +dependencies themselves have clashing dependencies and cannot be installed alongside each other, but more importantly, it keeps the Harpy installation quite small and quick. + +## How Harpy Provides the Other Stuff +Instead of a monolithic Harpy environment, which would be impossible with +the current software dependencies, there are a handful of defined conda environment recipes that Harpy workflows generate. Snakemake will make +environments of those recipes, then jump in and out of those local conda +environments as dictated by the software needs of any given job (given in +the `conda:` directive within a rule). Those local environments live inside +`.snakemake/conda/wildhashnumber`, with auto-generated names reflecting the +hash of the environment (e.g. `.snakemake/conda/21ceb8c2fe7dd21206ab90c2af8f847f_`). + +**But**, those environments need to be created at runtime if they don't +already exist in `.snakemake/`, so Harpy (technically Snakemake) will install +them before running the jobs within a workflow. On some HPC systems, this +process can move glacially slow (it might be a RAID or NAS thing) and this +might make you think a Harpy workflow is hanging at the environment +installation step before it even begins its first job. That isn't ideal. +Additionally, sysadmins aren't particularly fond of how many files are +created with conda-based installations, which leads us to containerization. + +## Harpy and Containers +==- Containers, a Primer +If you aren't sure exactly what containers are, great, we aren't either! But +here's what we do know: it's a tiny mountable file containing an entire +operating system and whatever other bits you might need. Creating containers +is done with a recipe that takes a base "image" (an established existing +container) and adds "layers" of modifications to that base image. Imagine a +simple recipe where you declare a base image of a minimal Ubuntu 22 system +and your "layer" (modification) is installing a program into it using `sudo apt install ...`. You could then use this container as the "environment" to +run particular things with the software you installed into it. +=== + +The Harpy team manages [a container on Dockerhub](https://hub.docker.com/repository/docker/pdimens/harpy/general) called, you guessed it, Harpy, that +is synchronously versioned with the Harpy software. In other words, if +you're using Harpy v1.4, it will use the container version v1.4. The +development version of Harpy uses `latest` and the versions are automagically +managed through GitHub Actions. The Harpy container actually contains all of +the conda environments **in** it. So, when Snakemake is using the container +environment method, it will pull the versioned container from Dockerhub, and +jump in and out of container instances as required by the different jobs. +When inside a container, Snakemake will automatically activate the correct +conda environment within the container! + +## What's the Catch? +While local conda enviroments at runtime or containers might seem like foolproof approaches, there are drawbacks. + +### Conda Caveats: +#### ⚠️ Conda Caveat 1: Inconsistent +Despite our and conda's best efforts, sometimes programs just don't install +correctly on some systems due to unexpected system (or conda) configurations. +This results in frustrating errors where jobs fail because software that is +absolutely installed isn't being recognized (false negative), or software that wasn't +successfully installed is being recognized (false positive). + +#### 💣 Conda Caveat 2: Troubleshooting +To manually troubleshoot many of the tasks Harpy workflows perform, you +may need to jump into one of the local conda environments in `.snakemake/conda`. That itself isn't terrible, but it's an extra step because you will +need to identify which environment is the correct one since Snakemake renames +them by their hash. An easy way to do this is to do +```bash idenify the contents of the local conda environments +cat .snakemake/cconda/hashname.yaml +``` +because Snakemake also saves the YAML recipe too. While a little annoying, +this would be the sensible way to manually troubleshoot a step from a +workflow because troubleshooting it with the container method is much, much +more involved and not recommended. + +## Container Caveats +#### 🚥 Container Caveat 1: Speed +The overhead of Snakemake creating a container instance for a job, then +cleaning it up after the job is done is not trivial and can +negatively impact runtime. + +#### 💣 Container Caveat 2: Troubleshooting +The command Snakemake secretly invokes to run a job in a container is +quite lengthy. In most cases that shouldn't matter to you, but when +something eventually goes wrong and you need to troubleshoot, it's harder +to manually rerun steps (e.g. `bwa mem genome.fa sample1.F.fq, sample1.R.fq`) +because you need a much bigger, more involved container-based command line +call to enter a container instance and run everything with the correct +directories mounted. + diff --git a/retype.yml b/retype.yml index 5732da7d4..6cdd23a1d 100644 --- a/retype.yml +++ b/retype.yml @@ -17,7 +17,7 @@ links: icon: ruby target: blank - text: Guides & Tutorials - link: https://pdimens.github.io/harpy/blog + link: /blog icon: code-of-conduct target: blank footer: