Recommended usage

These guidelines are meant to ensure that data analysis is structured, understandable and reproducible. It concerns things like folder structure, naming conventions, paths, etc. Having worked in single cell analysis for about 3 years now, there are some issues that repeatedly come up and that really drive you crazy after a while, so it's good to prevent some of them by introducing some structure.

Note: some items may seem like a total overkill. Wait until you work together on a data-analysis project and someone else tries to run your notebooks, or until you have to re-generate your results from a year ago. These usage principles will start to make sense then.

Some of the items described below are specific to a data-analysis workflow that is centered around scanpy, AnnData, scVelo and CellRank, which are some of the theislab packages I use frequently when doing data analysis.

Folder structure

I generally recommend the following structure:

notebooks: jupyter notebooks with the actual analysis
figures*: figures saved within the notebooks
resources*: extra files like important papers, slides that your collaborators have shared with you, etc.
data*: raw data for this analysis.
cache*: chached intermediate results, usually as pickle or h5ad fileas
utils*: things like python or R scripts used in the analysis, or an exported conda environment

Items marked with a * are not uploaded to github and only exist to your local machine.

Naming conventions

After a while, jupyter notebooks accumulate and it get's really hard to tell what's being done in which notebook. I therefore recommend to settle for naming conventions early on, something like

INITIALS_DATE_SHORT_TITLE

Package versions & conda environments

To ensure that results are reproducible, I definitely recommend to use anaconda or miniconda and to use one environment per data analysis project. The reason for this is that if you use the same environment across multiple projects, you might e.g. install a package for another project, which updates a bunch of packages, which then changes the results you had in your first analysis project.

Further, towards the beginning of every analysis notebook, I would print out the versions of the main packages used in this analysis using a command like scanpy.logging.print_versions().

Once you're done with your analysis, you can export your conda environment and save it under utils/ so that others can set up a (similar) conda environment. Note that this will never work perfectly across platforms and if you worry about having exactly the same versions across platforms, you should be using docker.

Notebooks

This is where the actual analysis happens. I usually have a workflow where I have one initial notebook that takes the raw count matrices (often Cell Ranger output) and combines them into one nice AnnData object, possibly with annotations like cluster labels if this is published data. In that case, I want my data to be raw but to contain annotations in .obs and .var.

Caching

Some steps in an analysis workflow may take very long, others may be difficult to reproduce exactly across different platforms. For this reason, it often makes sense to cache intermediate results like a UMAP embedding, cluster labels, computed velocities, etc. One possibility to do this is to just write and read the AnnData object. However, this has the disadvantage of saving many AnnData objects that contain data you didn't actually intend to cache. A way to get around this is to write individual fields from the AnnData object to file, e.g. to save them as pickle objects. However, this has the disadvantage that it's hard to keep track of all the different fields in an AnnData object that one function may write. As an example, scanpy.tl.pca writes to both .obsm and .varm, and it's easy to forget one of them. Another example is scvelo.tl.recover_dynamics, which writes many fields to .obs.

To solve these issues, Michal Klein and myself developed a caching extension called scachepy. Towards the beginning of your notebook, you initialise a caching object via

import scachepy
c = scachepy.Cache(<caching_directory>)

In your analysis, when you want to compute e.g. PCA, instead of calling scanpy.tl.pca, you simply call c.tl.pca. That will compute PCA using scanpy and write the results (from all AnnData fields relevant to this computation) to disk. Next time you execute that cell, the data will be loaded from disk. If you want to force recomputation and give your cached files a different name, there are parameters for that, check out the scachepy github. It's that simple!

.gitignore and .gitkeep files

As I mentioned above, you shouldn't upload data or figures to github. Your repository will quickly become too large and really impractical to work with. For that reason, you should include data files in the .gitignore. However, it does make sense to expose the folder structure used within the data/ directory to make sure that others working on this repository use the same folder structure and hence paths. For this reason, in this template repository, data, figures etc. are added to the .gitignore via their file extension (e.g. .h5ad), but the actual folders are exposed by placing empty .gitkeep files in the folders. These are necessary because if a directly is empty, git would usually not track it, so other's wouldn't know what my folder structure within the data/ directory is. It doesn't matter what you call these files, calling them .gitkeep is just a convention (a useful one).

Issues, pull requests and branches

To keep track of progress on a larger analysis project, I would totally recommend to package tasks into issues and to solve these via pull requests. It just makes it so much easier to go back to see what happened when, to look into the approaches tried previously, which ones worked, which ones didn't!

Paths

This is a nasty issue. When two people work on the same data analysis, they are likely to use different paths to import e.g. data files. Each time someone else wants to run their analysis, they need to change all of the paths. To circumvent this, I usually take two measures:

I have a paths.py file at the root of the analysis repo that defines some global paths (e.g. where is the data/ directory?)
I make sure that the folder structure in data/, figures/ and cache/ is the same across all users by exposing it with .gitkeep files

The global paths defined in the paths.py file are imported in all my notebooks via

sys.path.insert(0, "../..")  # this depends on the notebook depth and must be adapted per notebook
from paths import DATA_DIR, CACHE_DIR, FIG_DIR

All further paths used in the notebook are always build on these basic paths.

Velocity specifics (`loom` files and `h5ad` files)

When working with RNA velocity, one usually uses a tool like the velocyto command line tool to generate count matrices that separately store spliced and unspliced counts. The way I usually deal with this is I use the standard Cell Ranger count matrices to do all of the basic analysis up to the point where I get to the velocity analysis. At this point, I map the loom files (in case of velocyto) onto my processed AnnData object using scvelo.utils.merge.

These guidelines in action

So what does all of this look like in action? There is a sample notebook included in this template repository that follows the guidelines above. Also, in the reproducibility repository for CellRank, we followed these guidelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended usage

Folder structure

Naming conventions

Package versions & conda environments

Notebooks

Caching

.gitignore and .gitkeep files

Issues, pull requests and branches

Paths

Velocity specifics (`loom` files and `h5ad` files)

These guidelines in action

Clone this wiki locally

Recommended usage

Folder structure

Naming conventions

Package versions & conda environments

Notebooks

Caching

.gitignore and .gitkeep files

Issues, pull requests and branches

Paths

Velocity specifics (loom files and h5ad files)

These guidelines in action

Clone this wiki locally

Velocity specifics (`loom` files and `h5ad` files)