-
Notifications
You must be signed in to change notification settings - Fork 34
10x 3' scRNA seq Pipeline (v0.1)
Welcome to the HCA Skylab 10x 3' scRNA-seq pipeline page. This page will describe our first draft of the 3' scRNA-seq pipeline.
This pipeline is designed to process scRNA-Seq data in Google cloud computing environment.
This pipeline is written in Cromwell/WDL.
One docker has been built for this pipeline. Secondary dockers have been built for support scripts
The test and demonstrative dataset was produced by 10x genomics and is publicly available on their website:"8k PBMCs from a Healthy Donor". A subset of the full dataset comprised of 10,000 reads aligning to chromosome 21 and 2000 unaligned reads was extracted with the Generate10xDemoData.wdl
Reference genomes from Ensembl, UCSC or NCBI may be used. We obtained the human genome reference hg19, downloaded from UCSC for our draft workflow, but selected only chromosome 21 for our demo workflow to decrease run time. Because 10x cellranger requires a specific set of reference files, we pre-constructed then using GenerateReferenceBundle.wdl. This allows us to reliably recreate references for testing purposes.
The genome annotation is obtained from Gencode. We downloaded the v19 gtf file.
Our v0.1 pipeline aims to run the 10x cellranger pipeline on the Google Cloud Platform, in order to create a standard pipeline against which we can benchmark. As such, the individual tasks in the first draft pipeline wdl, 10x/count/count.wdl are explicitly broken up into split, main, and join stages as they are in the cellranger count pipeline. Cellranger is coded using its own workflow language, martian. For more information, see this martian README. The scientific tasks carried out by the cellranger workflow are as follows:
- Extracts barcode information from read 1 and index read 1
- Aligns reads containing genomic information to a reference genome (using STAR)
- Tags aligned bam reads with barcode information extracted in step 1
- Marks reads that result from PCR duplication
- Quantifies molecule counts in each gene and cell
- Filters erroneous cell and molecular barcodes
- Identifies cell barcodes matched to real cells, and excludes contaminants
The cellranger count pipeline is a deterministic pipeline, meaning it should achieve the same result when run multiple time son the same data. Thus, in order to confirm that our pipeline is replicating the cellranger pipeline, we tested for bit-identity in the output files of cellranger and count.wdl, run on the same input dataset, using the same reference genome.
Comparisons between the two pipelines' filtered count matrices in .mex format were bit identical:
$md5sum cellranger/outs/fitered_gene_bc_matrices/hg19_chr21_10x_reference/matrix.mtx
deef7d1362981464823b7576a4d2de13 cellranger/outs/fitered_gene_bc_matrices/hg19_chr21_10x_reference/matrix.mtx
$md5sum count.wdl/outs/fitered_gene_bc_matrices/hg19_chr21_10x_reference/matrix.mtx
deef7d1362981464823b7576a4d2de13 count.wdl/outs/fitered_gene_bc_matrices/hg19_chr21_10x_reference/matrix.mtx
In addition, examination of the hdf5 datasets, including the row and column indices showed that these files also contained identical data:
>>> import tables
>>> import numpy as np
>>> def identical(a, b):
return np.array_equal(a, b)
>>> # load data
>>> cranger = tables.open_file('cellranger_filtered_gene_bc.h5', 'r')
>>> count = tables.open_file('count_filtered_gene_bc.h5', 'r')
>>> # test equivalence
>>> identical(cranger.root.hg19_chr21_10_reference.genes.read(), count.root.hg19_chr21_10_reference.genes.read())
True
>>> identical(cranger.root.hg19_chr21_10_reference.barcodes.read(), count.root.hg19_chr21_10_reference.barcodes.read())
True
>>> identical(cranger.root.hg19_chr21_10_reference.data.read(), count.root.hg19_chr21_10_reference.data.read())
True
>>> # negative control
>>> identical(np.array([0, 1]), np.array([1, 1]))
False