The new GATK-based pipeline for wild isolate C. elegans strains
_______ _______ _______ __ __ _______ _______
| __| _ |_ _| |/ | | | | ___|
| | | | | | | < | | ___|
|_______|___|___| |___| |__|\__| |__|____|___|
parameters description Set/Default
========== =========== ========================
--debug Use --debug to indicate debug mode null
--output Release Directory WI-{date}
--sample_sheet Sample sheet null
--bam_location Directory of bam files /projects/b1059/data/{species}/WI/alignments/
--mito_name Contig not to polarize hetero sites MtDNA
Reference Genome
---------------
--reference_base Location of ref genomes /projects/b1059/data/{species}/genomes/
--species/project/build These 4 params form --reference {species} / {project} / {ws_build}
Variant Filters
---------------
--min_depth Minimum variant depth 5
--qual Variant QUAL score 30
--strand_odds_ratio SOR_strand_odds_ratio 5
--quality_by_depth QD_quality_by_depth 20
--fisherstrand FS_fisher_strand 100
--high_missing Max % missing genotypes 0.95
--high_heterozygosity Max % max heterozygosity 0.10
- The latest update requires Nextflow version 24+. On Rockfish, you can access this version by loading the
nf24_env
conda environment prior to running the pipeline command:
module load python/anaconda
source activate /data/eande106/software/conda_envs/nf24_env
nextflow run -latest andersenlab/wi-gatk --debug
nextflow run -latest andersenlab/wi-gatk --sample_sheet=/path/sample_sheet_GATK.tsv --bam_location=/vast/eande106/workflows/alignment-nf/
There are three configuration profiles for this pipeline.
rockfish
- Used for running on Rockfish (default).quest
- Used for running on Quest.local
- Used for local development.
Note
If you forget to add a -profile
, the rockfish
profile will be chosen as default
The sample sheet is automatically generated from alignment-nf
in the output folder under the name sample_sheet_for_seq_sheet_ALL.tsv
. The sample sheet contains 5 columns as detailed below:
strain | bam | bai | coverage | percent_mapped |
---|---|---|---|---|
AB1 | AB1.bam | AB1.bam.bai | 64 | 99.4 |
AB4 | AB4.bam | AB4.bam.bai | 52 | 99.2 |
BRC20067 | BRC20067.bam | BRC20067.bam.bai | 30 | 92.5 |
Important
It is essential that you always use the pipelines and scripts to generate this sample sheet and NEVER manually. There are lots of strains and we want to make sure the entire process can be reproduced.
Note
The sample sheet produced from alignment-nf
is only for strains that you ran in the alignment pipeline most recently. If you want to combine old strains with new strains, you will have to combine two or more sample sheets. If you are running a species-wide analysis for CaeNDR, please follow the notes in the full WI protocol here
Path to directory holding all the alignment files for strains in the analysis. Defaults to /vast/eande106/data/{species}/WI/alignments/
Important
Remember to move your bam files output from alignment-nf to this location prior to running wi-gatk
. In most cases, you will want to run wi-gatk
on all samples, new and old combined.
default = c_elegans
Options: c_elegans, c_briggsae, or c_tropicalis
default = PRJNA13758
WormBase project ID for selected species. Choose from some examples here
default = WS283
WormBase version to use for reference genome.
A fasta reference indexed with BWA. On Rockfish, the reference is available here:
/vast/eande106/data/c_elegans/genomes/PRJNA13758/WS283/c_elegans.PRJNA13758.WS283.genome.fa.gz
Note
If running on Rockfish, instead of changing the reference
parameter, opt to change the species
, project
, and ws_build
for other species like c_briggsae (and then the reference will change automatically)
Name of contig to skip het polarization. Might need to change for other species besides c_elegans if the mitochondria contig is named differently. Defaults to MtDNA
.
A directory in which to output results. By default it will be WI-YYYYMMDD
where YYYYMMDD is todays date.
The final output directory looks like this:
├── variation
│ ├── *.hard-filter.vcf.gz
│ ├── *.hard-filter.vcf.tbi
│ ├── *.hard-filter.stats.txt
│ ├── *.hard-filter.filter_stats.txt
│ ├── *.soft-filter.vcf.gz
│ ├── *.soft-filter.vcf.tbi
│ ├── *.soft-filter.stats.txt
│ └── *.soft-filter.filter_stats.txt
└── report
├── multiqc.html
└── multiqc_data
└── multiqc_*.json
andersenlab/gatk4
(link): Docker image is created within this pipeline using GitHub actions. Whenever a change is made toenv/gatk4.Dockerfile
or.github/workflows/build_docker.yml
GitHub actions will create a new docker image and push if successfulandersenlab/r_packages
(link): Docker image is created manually, code can be found in the dockerfile repo.
Make sure that you add the following code to your ~/.bash_profile
. This line makes sure that any singularity images you download will go to a shared location on /vast/eande106
for other users to take advantage of (without them also having to download the same image).
# add singularity cache
export SINGULARITY_CACHEDIR='/vast/eande106/singularity/'