An input JSON file is a file which must include all the information needed to run this pipeline. Hence, it must include the absolute paths to all the control and experimental fastq files; paths to all the genomic data files needed for this pipeline, and it must also specify the parameters and the metadata needed for running this pipeline. If the parameters are not specified in an input JSON file, default values will be used. We provide a set of template JSON files: minimum and full. We recommend to use a minimum template instead of full one. A full template includes all parameters of the pipeline with default values defined.
IMPORTANT: ALWAYS USE ABSOLUTE PATHS.
Mandatory parameters.
-
Pipeline type
chip.pipeline_type
:tf
for TF ChIP-seq,histone
for histone ChIP-seq andcontrol
for aligning control FASTQs only. One major difference between two types is thattf
usesspp
peak caller with controls buthistone
usesmacs2
peak caller without controls. If it iscontrol
thenchip.align_only
is automatically turned on and pipeline will align FASTQs defined inchip.fastqs_repX_RY
(whereX
means control replicate ID andY
is read-end) as controls. For control mode, do not definechip.ctl_fastqs_repX_RY
. Pipeline will skip cross-correlation/JSD/GC-bias analysis for control mode.
-
Experiment title/description
chip.title
: experiment title for a final HTML report.chip.description
: experiment description for a final HTML report.
-
Read endedness
chip.paired_end
:true
if ALL replicates are paired-ended.- (Optional)
chip.paired_ends
: For samples with mixed read ends, you can define read endedness for each biological replicate (e.g.[true, false]
means paired-ended biorep-1 and single-ended biorep-2). chip.ctl_paired_end
:true
if ALL controls are paired-ended. If not defined thenchip.paired_end
will be used.- (Optional)
chip.ctl_paired_ends
: For controls with mixed read ends, you can define read endedness for each biological replicate (e.g.[true, false]
means paired-ended biorep-1 and single-ended biorep-2). If not defined thenchip.paired_ends
will be used.
-
Reference genome
-
chip.genome_tsv
: Choose one from the following genome TSVs.v3
was a standard for >=ENCODE4. It's updated tov4
for >=v2.2.0 (on 12/01/2021). See the version history of genome TSV files here.Genome URL hg38 https://storage.googleapis.com/encode-pipeline-genome-data/genome_tsv/v4/hg38.tsv
mm10 https://storage.googleapis.com/encode-pipeline-genome-data/genome_tsv/v4/mm10.tsv
For DNAnexus CLI (AWS project):
Genome DX URI hg38 dx://project-BKpvFg00VBPV975PgJ6Q03v6:pipeline-genome-data/genome_tsv/v4/hg38.dx.tsv
mm10 dx://project-BKpvFg00VBPV975PgJ6Q03v6:pipeline-genome-data/genome_tsv/v4/mm10.dx.tsv
For DNAnexus CLI (Azure project):
Genome DX URI hg38 dx://project-F6K911Q9xyfgJ36JFzv03Z5J:pipeline-genome-data/genome_tsv/v4/hg38.dx_azure.tsv
mm10 dx://project-F6K911Q9xyfgJ36JFzv03Z5J:pipeline-genome-data/genome_tsv/v4/mm10.dx_azure.tsv
For DNAnexus Web UI (AWS project): Choose one of the following TSV file on
https://platform.DNAnexus.com/projects/BKpvFg00VBPV975PgJ6Q03v6/data/pipeline-genome-data/genome_tsv/v4
.Genome File name hg38 hg38.dx.tsv
mm10 mm10.dx.tsv
For DNAnexus Web UI (Azure project): Choose one of the following TSV file on
https://platform.DNAnexus.com/projects/F6K911Q9xyfgJ36JFzv03Z5J/data/pipeline-genome-data/genome_tsv/v4
.Genome File name hg38 hg38.dx_azure.tsv
mm10 mm10.dx_azure.tsv
-
To build a new TSV file from use your own FASTA (
.fa
and.2bit
) see this. -
You can also define genome specific parameters (defined in a genome TSV file) in an input JSON file. Parameters defined in an input JSON file will override those defined in a genome TSV file. For example, you can simply replace a blacklist file while keeping all other parameters in a genome TSV file for
hg38
.{ "chip.genome_tsv" : "/somewhere/hg38.tsv", "chip.blacklist": "/new/genome/data/new_blacklist.bed.gz" }
-
-
- See this for how to define FASTQ/BAM/TAG-ALIGNs for your sample.
- ChIP-seq pipeline does not have an adapter detector/trimmer.
-
Important parameters
chip.crop_length
: Crop FASTQs with Trimmomatic (using parametersCROP
). It is 0 (disabled) by default.chip.crop_length_tol
: Read length tolerance while cropping FASTQs withchip.crop_length
. Trimmomatic'sMINLEN
will be set aschip.crop_length
-abs(chip.crop_length_tol)
. WARNING: Check your FASTQs' read length first. Reads SHORTER thanMINLEN
will be excluded while cropping, hence not included in output BAM files and all downstream analyses. Tolerance is 2 by default.chip.pval_thresh
: P-value threshold for peak-caller MACS2 (macs2 callpeak -p).chip.fdr_thresh
: FDR threshold for peak-caller SPP (run_spp.R -fdr=).chip.idr_thresh
: IDR (irreproducible discovery rate) threshold.
-
Parameters for subsampling experiment replicates and controls. These parameters are used for subsampling on filtered (nodup) BAMs. Therefore, setting these parameters will affect all downsteam analyses including peak-calling. It's 0 by default, which means no subsampling. These parameters will be applied to each experiment replicate and control.
chip.subsample_reads
: Subsample experiment replicate's reads. For PE dataset, this is not a number of read pairs but number of reads.chip.ctl_subsample_reads
: Subsample control's reads. For PE dataset, this is not a number of read pairs but number of reads.
-
Parameters for automatically choosing an appropriate control and subsampling it. Before calling peaks, there is a separate (from item 7) mechanism to subsample controls to prevent using too deep controls. Pipeline has a logic to find an approriate control for each experiment replicate. Choices are 1) control with the same index (e.g. rep2 vs. ctl2), 2) pooled control (e.g. rep2 vs. ctl1 + ctl2).
chip.always_use_pooled_ctl
: (For experiments with controls only) Always use a pooled control to compare with each replicate. If a single control is given then use it. It is enabled by default.chip.ctl_depth_ratio
: (For experiments with controls only) If ratio of depth between controls is higher than this. then always use a pooled control for all replicates. It's 1.2 by default.
IMPORTANT: Either of the following parameters are set to > 0 for automatic subsampling. Set all of them to 0 to disable automatic subsample. Such automatic control subsampling does not work with controls with mixed endedness (e.g. SE ctl-rep1 + PE ctl-rep2). Pipeline calculates
Limit1
andLimit2
from the following two parameters. And then takes a maximum of them and calls itLimit
. If a chosen control for experiment replicate (each one and pooled one) is deeper thanLimit
then that such control is subsampled toLimit
.chip.exp_ctl_depth_ratio_limit
: (For experiments with controls only)Limit1
is defined as experiment replicate's depth multiplied by this parameter. It's 5.0 by default.chip.ctl_depth_limit
: (For experiments with controls only) Hard limit on control's depth. If control is deeper than this hard limit then such control is subsampled to it.
-
- It is recommened not to change the following parameters unless you get resource-related errors for a certain task and you want to increase resources for such task.
Optional parameters.
-
Other useful parameters
chip.fraglen
: Array of Integers. Fragment length for each bio replicate. If you start from FASTQs then our pipeline automatically estimate it from cross-correlation analysis (taskxcor
) result since such analysis requires a special treamtment for FASTQs. It is possible that fragment length is not estimated correctly (or pipeline can fail due to negative fraglen) if you start from different types (BAM/TAG-ALIGN). For such case, you can manually define fragment length for each bio rep. (e.g.[200, 150]
means 200 for rep1 and 150 for rep2).
-
Flags
chip.redact_nodup_bam
: Redact filtered/nodup BAM at the end of the filtering step (taskfilter
). Raw BAM from the aligner (taskalign
) will still remain unredacted. Quality metrics on filtered BAM will be calculated before being redacted. However, all downstream analyses (e.g. peak-calling) will be done on the redacted BAM. If you start from nodup BAM then this flag will not be active. BAM will be redacted withptools
. To convert BAM into TAG-ALIGN our pipeline usesbedtools bamtobed
to interpret CIGAR string to determine read length but redacted BAM will always haveM
only in CIGAR string. For example, this flag can introduce slight noise to the analysis since original BAM's read length is number ofM
- number ofI
+ number ofD
, which can be different from redacted BAM's read length (number ofM
only).chip.align_only
: Peak calling and its downstream analyses will be disabled. Useful if you just want to align your FASTQs into filtered BAMs/TAG-ALIGNs and don't want to call peaks on them. IMPORTANT: Even thoughchip.pipeline_type
does not matter for align only mode, you still need to define it since it is a required parameter in WDL. Define it astf
for such cases.chip.true_rep_only
: Disable pseudo replicate generation and all related analyses
IMPORTANT: Our pipeline considers a replicate (
rep
) as a biological replicate. You can still define technical replicates for each bio replicate. Tech replicates will be merged together to make a single FASTQ for each bio replicate. Controls can also have technical replicates.
IMPORTANT: Our pipeline supports up to 10 bio replicates and 10 controls.
IMPORTANT: Our pipeline has cross-validation analyses (IDR/overlap) comparing every pair of all replicates. Number of tasks for such analyses will be like nC2. This number will be 45 for 10 bio replicates. It's recommended to keep number of replicates <= 4.
Pipeline can start from any of the following data types (FASTQ, BAM, NODUP_BAM and TAG-ALIGN).
-
Starting from FASTQs
-
Technical replicates for each bio-rep will be MERGED in the very early stage of the pipeline. Each read end R1 and R2 have separate arrays
chip.fastqs_repX_R1
andchip.fastqs_repX_R2
. Do not define R2 array for single-ended replicates. -
Example of 3 paired-ended biological replicates and 2 technical replicates for each bio rep. Two technical replicates
BIOREPX_TECHREP1.R1.fq.gz
andBIOREPX_TECHREP2.R1.fq.gz
for each bio replicate will be merged.{ "chip.paired_end" : true, "chip.fastqs_rep1_R1" : ["BIOREP1_TECHREP1.R1.fq.gz", "BIOREP1_TECHREP2.R1.fq.gz"], "chip.fastqs_rep1_R2" : ["BIOREP1_TECHREP1.R2.fq.gz", "BIOREP1_TECHREP2.R2.fq.gz"], "chip.fastqs_rep2_R1" : ["BIOREP2_TECHREP1.R1.fq.gz", "BIOREP2_TECHREP2.R1.fq.gz"], "chip.fastqs_rep2_R2" : ["BIOREP2_TECHREP1.R2.fq.gz", "BIOREP2_TECHREP2.R2.fq.gz"], "chip.fastqs_rep3_R1" : ["BIOREP3_TECHREP1.R1.fq.gz", "BIOREP3_TECHREP2.R1.fq.gz"], "chip.fastqs_rep3_R2" : ["BIOREP3_TECHREP1.R2.fq.gz", "BIOREP3_TECHREP2.R2.fq.gz"] }
-
-
Starting from BAMs
- Define a BAM for each replicate. Our pipeline does not determine read endedness from a BAM file. You need to explicitly define read endedness.
- Example of 3 singled-ended replicates.
{ "chip.paired_end" : false, "chip.bams" : ["rep1.bam", "rep2.bam", "rep3.bam"] }
-
Starting from filtered/deduped BAMs
- Define a filtered/deduped BAM for each replicate. Our pipeline does not determine read endedness from a BAM file. You need to explicitly define read endedness. These BAMs should not have unmapped reads or duplicates.
- Example of 2 singled-ended replicates.
{ "chip.paired_end" : false, "chip.nodup_bams" : ["rep1.nodup.bam", "rep2.nodup.bam"] }
-
Starting from TAG-ALIGN BEDs
-
Define a TAG-ALIGN for each replicate. Our pipeline does not determine read endedness from a TAG-ALIGN file. You need to explicitly define read endedness.
-
Example of 4 paired-ended replicates.
{ "chip.paired_end" : true, "chip.tas" : ["rep1.tagAlign.gz", "rep2.tagAlign.gz", "rep3.tagAlign.gz", "rep3.tagAlign.gz"] }
-
You need to define controls for TF ChIP-seq pipeline. Skip this if you want to run histone ChIP-seq pipelines. You can define controls similarly to experiment IP replicates. Just add ctl_
prefix to parameter names.
-
Control FASTQs
-
Technical replicates for each bio-rep will be MERGED in the very early stage of the pipeline. Each read end R1 and R2 have separate arrays
chip.ctl_fastqs_repX_R1
andchip.ctl_fastqs_repX_R2
. Do not define R2 array for single-ended replicates. -
Example of 3 paired-ended biological replicates and 2 technical replicates for each bio rep. Two technical replicates
BIOREPX_TECHREP1.R1.fq.gz
andBIOREPX_TECHREP2.R1.fq.gz
for each bio replicate will be merged.{ "chip.ctl_paired_end" : true, "chip.ctl_fastqs_rep1_R1" : ["BIOREP1_TECHREP1.R1.fq.gz", "BIOREP1_TECHREP2.R1.fq.gz"], "chip.ctl_fastqs_rep1_R2" : ["BIOREP1_TECHREP1.R2.fq.gz", "BIOREP1_TECHREP2.R2.fq.gz"], "chip.ctl_fastqs_rep2_R1" : ["BIOREP2_TECHREP1.R1.fq.gz", "BIOREP2_TECHREP2.R1.fq.gz"], "chip.ctl_fastqs_rep2_R2" : ["BIOREP2_TECHREP1.R2.fq.gz", "BIOREP2_TECHREP2.R2.fq.gz"], }
-
-
Control BAMs
-
Define a BAM for each replicate. Our pipeline does not determine read endedness from a BAM file. You need to explicitly define read endedness.
-
Example of 3 singled-ended replicates.
{ "chip.ctl_paired_end" : false, "chip.ctl_bams" : ["ctl1.bam", "ctl2.bam", "ctl3.bam"] }
-
-
Control BAMs
- Define a filtered/deduped BAM for each replicate. Our pipeline does not determine read endedness from a BAM file. You need to explicitly define read endedness. These BAMs should not have unmapped reads or duplicates.
- Example of 2 singled-ended replicates.
{ "chip.ctl_paired_end" : false, "chip.ctl_nodup_bams" : ["ctl1.nodup.bam", "ctl2.nodup.bam"] }
-
Control TAG-ALIGN BEDs
-
Define a TAG-ALIGN for each replicate. Our pipeline does not determine read endedness from a TAG-ALIGN file. You need to explicitly define read endedness.
-
Example of 4 paired-ended replicates.
{ "chip.ctl_paired_end" : true, "chip.ctl_tas" : ["ctl1.tagAlign.gz", "ctl2.tagAlign.gz", "ctl3.tagAlign.gz", "ctl4.tagAlign.gz"] }
-
You can also mix up different data types for individual bio replicate and control. For example, pipeline can start from FASTQs for rep1 (SE) and rep3 (PE), BAMs for rep2 (SE), NODUP_BAMs for rep4 (SE) and TAG-ALIGNs for rep5 (PE). This example has two controls (ctl1: SE BAM, ctl2: PE FASTQs).
{
"chip.paired_ends" : [false, false, true, false, true],
"chip.fastqs_rep1_R1" : ["rep1.fastq.gz"],
"chip.fastqs_rep3_R1" : ["rep3.R1.fastq.gz"],
"chip.fastqs_rep3_R2" : ["rep3.R2.fastq.gz"],
"chip.bams" : [null, "rep2.bam", null, null, null],
"chip.nodup_bams" : [null, null, null, "rep4.nodup.bam", null],
"chip.tas" : [null, null, null, null, "rep5.tagAlign.gz"],
"chip.ctl_paired_ends" : [false, true],
"chip.ctl_fastqs_rep2_R1" : ["ctl2.R1.fastq.gz"],
"chip.ctl_fastqs_rep2_R2" : ["ctl2.R2.fastq.gz"],
"chip.ctl_bams" : ["ctl1.bam", null],
}
WARNING: It is recommened not to change the following parameters unless you get resource-related errors for a certain task and you want to increase resources for such task. The following parameters are provided for users who want to run our pipeline with Caper's
local
on HPCs and 2).
Resources defined here are PER BIO REPLICATE. Therefore, total number of cores will be approximately chip.align_cpu
x NUMBER_OF_BIO_REPLICATES
because align
is a bottlenecking task of the pipeline. This total number of cores will be useful ONLY when you use a local
backend of Caper and manually qsub
or sbatch
your job. disk_factor
is used for GCP/AWS/DNAnexus only.
See this for details.