Skip to content

The code for running the analysis component of CSI-Microbe

Notifications You must be signed in to change notification settings

ruppinlab/CSI-Microbes-analysis

Repository files navigation

CSI-Microbes-analysis

This repository contains part of the workflows for reproducing the results from the bioRxiv paper scRNA-seq analysis of colon and esophageal tumors uncovers abundant microbial reads in myeloid cells undergoing proinflammatory transcriptional alterations by Welles Robinson, Josh Stone, Fiorella Schischlik, Billel Gasmi, Michael Kelly, Charlie Seibert, Kimia Dadkhah, E. Michael Gertz, Joo Sang Lee, Kaiyuan Zhu, Lichun Ma, Xin Wang, S. Cenk Sahinalp, Rob Patro, Mark D.M. Leiserson, Curtis Harris, Alejandro A. Schäffer, and Eytan Ruppin. This repository contains the workflows to analyze microbial reads from 10x and Smart-seq2 scRNA-seq datasets to identify microbial taxa that are differentially abundant or differentially present. Prior to running this code, these microbial reads must be identified using the CSI-Microbes-identification repository. The code in this repository was written by Welles Robinson and Fio Schischlik and alpha-tested by Alejandro Schaffer.

Requirements

This workflow has been tested on Mac OS Mojave (10.14.6) and the Linux OS (biowulf). The minimum memory requirements are 10 GB for all steps except for figure 5A, which requires 30 GB of RAM. This workflow expects that conda has been installed. For instructions on how to install conda, see conda install documentation.

Software Installation

It should take < 30 minutes to install the software, which involves downloading the codebase and setting up the environment (not including the time needed unzip the files, which depends on the OS). There are two ways to download the codebase. To reproduce the key results from our paper, it is recommended to download the latest version of CSI-Microbes-analysis from Zenodo, which contains the intermediate files generated using CSI-Microbes-identification. The intermediate files for a given dataset are located in the <dataset_of_interest>/raw directory. For example, the intermediate files needed to reproduce Aulicino2018 are in Aulicino2018/raw).

The second way to download the codebase is to clone the GitHub repository as shown below (which does not contain the intermediate files). The below instructions assume that you have an ssh key associated with your GitHub account. If you do not, you can generate a new ssh key and associate it with your GitHub username by following these instructions.

git clone [email protected]:ruppinlab/CSI-Microbes-analysis.git

Once the codebase is downloaded, you need to create the conda environment (you need to perform this step only once unless you explicitly delete the conda environment).

cd CSI-Microbes-analysis
conda env create -f envs/CSI-Microbes-analysis.yaml

Finally, you need to activate the recently created conda environment (all of the commands assume that the conda environment CSI-Microbes-env is active).

conda activate CSI-Microbes-env

Software Dependencies

CSI-Microbes-analysis depends on the following software packages that are installed via the conda channels conda-forge, bioconda and defaults: dplyr (1.0.5)REF, ggforce (0.3.3)REF, ggplot2 (3.3.3)REF, ggpubr (0.4.0)REF, rpy2 (3.4.4)REF, scater (1.16.0) REF, scran (1.16.0) REF, SingleCellExperiment (1.10.1)REF, Snakemake (6.2.1)REF, and Seurat (4.0.1)REF.

Reproducing key results and figures from the paper

The reproduction of key results and figures from the paper requires intermediate files generated by CSI-Microbes-identification and available for download from Zenodo.

Reproducing results from Aulicino2018

To reproduce the results from Aulicino2018REF, you first need to be in the Aulicino2018 directory.

cd Aulicino2018

and then you can use snakemake to reproduce the key results

snakemake --cores <number of CPUs> --use-conda all

Reproducing results from Ben-Moshe2019

To reproduce the results from Ben-Moshe2019REF, you first need to be in the Ben-Moshe2019 directory.

cd Ben-Moshe2019

and then you can use snakemake to reproduce the key results

snakemake --cores <number of CPUs> --use-conda all

Reproducing results from Pelka2021

The results from Pelka2021REF are divided into two directories divided by microbial vs. human results. We show how to reproduce the microbial results in this example but the others are very similar.

cd Pelka2021

and then you can use snakemake to reproduce the key results

snakemake --cores <number of CPUs> --use-conda all

Reproducing results from Robinson2023

The results from Robinson2023 are divided into four directories divided by microbial vs. human results and 10x vs. plexWell. We show how to reproduce the microbial results from the 10x dataset in this example but the others are very similar.

cd Robinson2023-10x

and then you can use snakemake to reproduce the key results

snakemake --cores <number of CPUs> --use-conda all

Reproducing results from Zhang2021

The results from Zhang2021REF are divided into two directories divided by microbial vs. human results. We show how to reproduce the microbial results in this example but the others are very similar.

cd Zhang2021

and then you can use snakemake to reproduce the key results

snakemake --cores <number of CPUs> --use-conda all

References

Publications Analyzed

Aulicino, A. et al. Invasive Salmonella exploits divergent immune evasion strategies in infected and bystander dendritic cell subsets. Nat. Commun. 9, 4883 (2018).

Bossel Ben-Moshe, N. et al. Predicting bacterial infection outcomes using single cell RNA-sequencing analysis of human immune cells. Nat. Commun. 10, 3266 (2019).

Paulson, K. G. et al. Acquired cancer resistance to combination immunotherapy from transcriptional loss of class I HLA. Nat. Commun. 9, 3868 (2018).

Pelka, K. et al. Spatially organized multicellular immune hubs in human colorectal cancer. Cell, (2021).

Zhang, X. et al. Dissecting esophageal squamous-cell carcinoma ecosystem by single-cell transcriptomic analysis. Nat.Commun. 12, 5291 (2021).

Software Tools

Wickham, H., François, R., Henry, L. and Müller, K (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.5. https://CRAN.R-project.org/package=dplyr

Pedersen, T.L. (2021). ggforce: Accelerating 'ggplot2'. R package version 0.3.3. https://CRAN.R-project.org/package=ggforce

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.

Kassambara, A. (2020). ggpubr: 'ggplot2' Based Publication Ready Plots. R package version 0.4.0. https://CRAN.R-project.org/package=ggpubr.

rpy2. https://rpy2.github.io/

McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). “Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R.” Bioinformatics, 33, 1179-1186. doi:10.1093/bioinformatics/btw777 (URL:https://doi.org/10.1093/bioinformatics/btw777).

Lun, A. T. L., Mccarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor [ version 2 ; referees : 3 approved , 2 approved with reservations ]. F1000Research 5, (2016). https://github.com/MarioniLab/scran

Lun, A. and Risso, D. (2020). SingleCellExperiment: S4 Classes for Single Cell Data. R package version 1.10.1.

Köster, J., & Rahmann, S. (2012). Snakemake-a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520–2522. https://doi.org/10.1093/bioinformatics/bts480

Hao and Hao et al. Integrated analysis of multimodal single-cell data. bioRxiv (2020) [Seurat V4]