-
Notifications
You must be signed in to change notification settings - Fork 0
Recommended usage
These guidelines are meant to ensure that data analysis is structured, understandable and reproducible. It concerns things like folder structure, naming conventions, paths, etc. Having worked in single cell analysis for about 3 years now, there are some issues that repeatedly come up and that really drive you crazy after a while, so it's good to prevent some of them by introducing some structure.
Note: some items may seem like a total overkill. Wait until you work together on a data-analysis project and someone else tries to run your notebooks, or until you have to re-generate your results from a year ago. These usage principles will start to make sense then.
Some of the items described below are specific to a data-analysis workflow that is centered around scanpy, AnnData, scVelo and CellRank, which are some of the theislab packages I use frequently when doing data analysis.
I generally recommend the following structure:
- notebooks: jupyter notebooks with the actual analysis
- figures*: figures saved within the notebooks
- resources*: extra files like important papers, slides that your collaborators have shared with you, etc.
- data*: raw data for this analysis.
- cache*: chached intermediate results, usually as
pickle
orh5ad
fileas - utils*: things like python or R scripts used in the analysis, or an exported conda environment
Items marked with a * are not uploaded to github and only exist to your local machine.
After a while, jupyter notebooks accumulate and it get's really hard to tell what's being done in which notebook. I therefore recommend to settle for naming conventions early on, something like
INITIALS_DATE_SHORT_TITLE
To ensure that results are reproducible, I definitely recommend to use anaconda or miniconda and to use one environment per data analysis project. The reason for this is that if you use the same environment across multiple projects, you might e.g. install a package for another project, which updates a bunch of packages, which then changes the results you had in your first analysis project.
Further, towards the beginning of every analysis notebook, I would print out the versions of the main packages used in this analysis using a command like scanpy.logging.print_versions()
.
Once you're done with your analysis, you can export your conda environment and save it under utils/
so that others can set up a (similar) conda environment. Note that this will never work perfectly across platforms and if you worry about having exactly the same versions across platforms, you should be using docker.
This is where the actual analysis happens. I usually have a workflow where I have one initial notebook that takes the raw count matrices (often Cell Ranger output) and combines them into one nice AnnData object, possibly with annotations like cluster labels if this is published data. In that case, I want my data to be raw but to contain annotations in .obs
and .var
.
Some steps in an analysis workflow may take very long, others may be difficult to reproduce exactly across different platforms. For this reason, it often makes sense to cache intermediate results like a UMAP embedding, cluster labels, computed velocities, etc. One possibility to do this is to just write and read the AnnData object. However, this has the disadvantage of saving many AnnData objects that contain data you didn't actually intend to cache. A way to get around this is to write individual fields from the AnnData object to file, e.g. to save them as pickle objects. However, this has the disadvantage that it's hard to keep track of all the different fields in an AnnData object that one function may write. As an example, scanpy.tl.pca
writes to both .obsm
and .varm
, and it's easy to forget one of them. Another example is scvelo.tl.recover_dynamics
, which writes many fields to .obs
.
To solve these issues, Michal Klein and myself developed a caching extension called scachepy. Towards the beginning of your notebook, you initialise a caching object via
import scachepy
c = scachepy.Cache(<caching_directory>)
In your analysis, when you want to compute e.g. PCA, instead of calling scanpy.tl.pca
, you simply call c.tl.pca
. That will compute PCA using scanpy and write the results (from all AnnData fields relevant to this computation) to disk. Next time you execute that cell, the data will be loaded from disk. If you want to force recomputation and give your cached files a different name, there are parameters for that, check out the scachepy github. It's that simple!
As I mentioned above, you shouldn't upload data or figures to github. Your repository will quickly become too large and really impractical to work with. For that reason, you should include data files in the .gitignore
. However, it does make sense to expose the folder structure used within the data/
directory to make sure that others working on this repository use the same folder structure and hence paths. For this reason, in this template repository, data, figures etc. are added to the .gitignore via their file extension (e.g. .h5ad
), but the actual folders are exposed by placing empty .gitkeep
files in the folders. These are necessary because if a directly is empty, git would usually not track it, so other's wouldn't know what my folder structure within the data/
directory is. It doesn't matter what you call these files, calling them .gitkeep
is just a convention (a useful one).
To keep track of progress on a larger analysis project, I would totally recommend to package tasks into issues and to solve these via pull requests. It just makes it so much easier to go back to see what happened when, to look into the approaches tried previously, which ones worked, which ones didn't!
This is a nasty issue. When two people work on the same data analysis, they are likely to use different paths to import e.g. data files. Each time someone else wants to run their analysis, they need to change all of the paths. To circumvent this, I usually take two measures:
- I have a
paths.py
file at the root of the analysis repo that defines some global paths (e.g. where is thedata/
directory?) - I make sure that the folder structure in
data/
,figures/
andcache/
is the same across all users by exposing it with.gitkeep
files
The global paths defined in the paths.py
file are imported in all my notebooks via
sys.path.insert(0, "../..") # this depends on the notebook depth and must be adapted per notebook
from paths import DATA_DIR, CACHE_DIR, FIG_DIR
All further paths used in the notebook are always build on these basic paths.
When working with RNA velocity, one usually uses a tool like the velocyto command line tool to generate count matrices that separately store spliced and unspliced counts. The way I usually deal with this is I use the standard Cell Ranger count matrices to do all of the basic analysis up to the point where I get to the velocity analysis. At this point, I map the loom files (in case of velocyto) onto my processed AnnData object using scvelo.utils.merge
.
So what does all of this look like in action? There is a sample notebook included in this template repository that follows the guidelines above. Also, in the reproducibility repository for CellRank, we followed these guidelines.