-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathVariant_call_Day4
235 lines (188 loc) · 15.8 KB
/
Variant_call_Day4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# Single Neucleotide Variants (SNVs) and small Insertion and Deletion (Indel) analysis
# Very general workflow for variant analysis
# 1. fastq files from your sample
# 2. Run FastQC for quality control
# 2a) if qualtify of bases (in particular at the end of reads) are bad, perform trimming.
# 2b) if there are adapter sequences present in your fastq files, remove adapter sequence.
# => OK. fastq quality is good. Lets go to step 3
# 3. Perform Alignemnt using your favorite aligner (e.g., BWA, etc.) and generate SAM or BAM files.
# 4. Sort bam file and generate an index file for your bam file.
# 5. Variant calling using VarScan (http://dkoboldt.github.io/varscan/) Please read manual (http://dkoboldt.github.io/varscan/using-varscan.html) and their original publication (http://www.ncbi.nlm.nih.gov/pubmed/22300766)
# 5a) Generate mpileup data.
# => What is (m)pileup data? Go to the following link: http://samtools.sourceforge.net/pileup.shtml
# mpileup data will be used to call SNVs and Indels using VarScan
# 5b) Call variants using VarScan with mpileup data
# 6. Now we have variant calls (e.g., a list of SNVs and Indels). We want to annotate them to see whether those variants are reported by someone eles or associated with known disease phenotype and/or see what would be functional impact.
# 6a) We will use snpEff tool (http://snpeff.sourceforge.net) and please read snpEff readme file (http://snpeff.sourceforge.net/SnpEff.html)
# 7. Interpretation. Now we have annotated variant calls and need to make sense out of it.
# 8. It is important to visualize your bam file to double check whether variants called by computational algorithms are real(?).
# Utilmately, you want to validate your variants using Sanger or etc.
Password:
mkdir day4
cd day4
cp /data/bootcamp/day4/*.fastq .
ls # make sure you have tumor.bam and normal.bam
qrsh
hostname # make sure you do run your code on computing node => compute-4-0.local
########### Part 1: Variant calling using GATK #############
##########################################################################################
# GATK Germline Variant Call #
# https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS #
##########################################################################################
# Please go to the following link if you have any questions regarding GATK command
# https://software.broadinstitute.org/gatk/gatkdocs/
gatk=/home/thwang/software/GenomeAnalysisTK.jar
picard=/home/thwang/software/picard.jar
SnpSift=/home/thwang/software/snpEff/SnpSift.jar
snpEFFanno=/home/thwang/reference
bwa=/home/thwang/software/bwa-0.7.12/bwa
reference=/home/thwang/reference/hg19.fa # bwa reference
samtools=/home/thwang/software/samtools-1.2/bin/samtools
java=/data/bootcamp/seqprg/jre1.8.0_101/bin/java
outdir="$PWD"
sample=normal_TP53
# Run QC for your fastq files!!!!!!!
# /data/bootcamp/seqprg/FastQC/fastqc ${sample}.fastq
# If QC is bad, you need to fix it before you run the rest of analysis. The following is code for trimming from Day 1.
# export PATH=/data/bootcamp/seqprg/jre1.8.0_101/bin:$PATH
# /data/bootcamp/seqprg/FastQC/fastqc NA12878_100ng.R1.1M.fq
# /data/bootcamp/seqprg/FastQC/fastqc NA12878_100ng.R2.1M.fq
# Next we will trim the sequences and run fastqc again
# /data/bootcamp/seqprg/trim_galore_zip/trim_galore --paired -q 25 --illumina --length 35 --no_report_file --path_to_cutadapt /data/bootcamp/seqprg/bin/cutadapt NA12878_100ng.R1.1M.fq NA12878_100ng.R2.1M.fq
# /data/bootcamp/seqprg/FastQC/fastqc NA12878_100ng.R1.1M_val_1.fq
# /data/bootcamp/seqprg/FastQC/fastqc NA12878_100ng.R2.1M_val_2.fq
$bwa mem -M -R "@RG\tID:group1\tSM:${sample}\tPL:illumina\tLB:lib1\tPU:unit1" $reference ${sample}.fastq > ${outdir}/${sample}.sam
$java -jar ${picard} SortSam INPUT=${outdir}/${sample}.sam OUTPUT=${outdir}/${sample}_sorted_reads.bam SORT_ORDER=coordinate
$java -jar ${picard} MarkDuplicates INPUT=${outdir}/${sample}_sorted_reads.bam OUTPUT=${outdir}/${sample}_dedup_reads.bam METRICS_FILE=metrics.txt
$samtools index ${outdir}/${sample}_dedup_reads.bam
$java -jar ${gatk} -T RealignerTargetCreator -R /home/thwang/reference/hg19.fa -I ${outdir}/${sample}_dedup_reads.bam -known /home/thwang/reference/1000G_phase1.indels.b37.vcf -known /home/thwang/reference/Mills_and_1000G_gold_standard.indels.b37.vcf -o ${outdir}/${sample}_target_intervals.list -L 17:7552323-7616951
# Note that IndelRealigner is not required for newer version of GATK
$java -jar ${gatk} -I ${outdir}/${sample}_dedup_reads.bam -R /home/thwang/reference/hg19.fa -T IndelRealigner -targetIntervals ${outdir}/${sample}_target_intervals.list -known /home/thwang/reference/1000G_phase1.indels.b37.vcf -known /home/thwang/reference/Mills_and_1000G_gold_standard.indels.b37.vcf -o ${outdir}/${sample}_realigned.bam -L 17:7552323-7616951
$java -jar ${gatk} -T BaseRecalibrator -R /home/thwang/reference/hg19.fa -I ${outdir}/${sample}_realigned.bam -knownSites /home/thwang/reference/Mills_and_1000G_gold_standard.indels.b37.vcf -knownSites /home/thwang/reference/1000G_phase1.indels.b37.vcf -knownSites /home/thwang/reference/dbsnp_138.b37.vcf -o ${outdir}/${sample}_recal.table -L 17:7552323-7616951
$java -jar ${gatk} -T PrintReads -R /home/thwang/reference/hg19.fa -I ${outdir}/${sample}_realigned.bam -BQSR ${outdir}/${sample}_recal.table -o ${outdir}/${sample}_recal.bam -L 17:7552323-7616951
# Run GATK HaplotypeCaller to call germline variants
$java -jar ${gatk} -T HaplotypeCaller -R ${reference} -I ${outdir}/${sample}_recal.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o ${outdir}/${sample}.vcf -L 17:7552323-7616951
# Annotate VCF (variants) file
snpEFF=/home/thwang/software/snpEff/snpEff.jar
$java -jar $snpEFF hg19 ${outdir}/${sample}.vcf > ${outdir}/${sample}.tmp.vcf
$java -jar ${SnpSift} annotate ${snpEFFanno}/CosmicCodingMuts.vcf ${outdir}/${sample}.tmp.vcf > ${outdir}/${sample}.tmp.COSMIC.coding.snpEFF
$java -jar ${SnpSift} annotate ${snpEFFanno}/CosmicNonCodingVariants.vcf ${outdir}/${sample}.tmp.COSMIC.coding.snpEFF > ${outdir}/${sample}.tmp.COSMIC.coding.noncoding.snpEFF
$java -jar ${SnpSift} annotate ${snpEFFanno}/All_20160408.vcf ${outdir}/${sample}.tmp.COSMIC.coding.snpEFF > ${outdir}/${sample}.COSMIC.coding.noncoding.dbsnp.snpEFF
# Download your bam and bai files and visualize with IGV tool
# scp [email protected]:/home/thwang/day4/*recal.bam* ./Download
##########################################################################################
# GATK Tumor Variant Call without matching normal #
# https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS #
##########################################################################################
gatk=/home/thwang/software/GenomeAnalysisTK.jar
picard=/home/thwang/software/picard.jar
SnpSift=/home/thwang/software/snpEff/SnpSift.jar
snpEFFanno=/home/thwang/reference
bwa=/home/thwang/software/bwa-0.7.12/bwa
reference=/home/thwang/reference/hg19.fa # bwa reference
samtools=/home/thwang/software/samtools-1.2/bin/samtools
java=/data/bootcamp/seqprg/jre1.8.0_101/bin/java
outdir="$PWD"
sample=tumor_TP53
# Run QC for your fastq files!!!!!!!
# /data/bootcamp/seqprg/FastQC/fastqc ${sample}.fastq
# If QC is bad, you need to fix it before you run the rest of analysis. The following is code for trimming from Day 1.
# export PATH=/data/bootcamp/seqprg/jre1.8.0_101/bin:$PATH
# /data/bootcamp/seqprg/FastQC/fastqc NA12878_100ng.R1.1M.fq
# /data/bootcamp/seqprg/FastQC/fastqc NA12878_100ng.R2.1M.fq
# Next we will trim the sequences and run fastqc again
# /data/bootcamp/seqprg/trim_galore_zip/trim_galore --paired -q 25 --illumina --length 35 --no_report_file --path_to_cutadapt /data/bootcamp/seqprg/bin/cutadapt NA12878_100ng.R1.1M.fq NA12878_100ng.R2.1M.fq
# /data/bootcamp/seqprg/FastQC/fastqc NA12878_100ng.R1.1M_val_1.fq
# /data/bootcamp/seqprg/FastQC/fastqc NA12878_100ng.R2.1M_val_2.fq
$bwa mem -M -R "@RG\tID:group1\tSM:${sample}\tPL:illumina\tLB:lib1\tPU:unit1" $reference ${sample}.fastq > ${outdir}/${sample}.sam
$java -jar ${picard} SortSam INPUT=${outdir}/${sample}.sam OUTPUT=${outdir}/${sample}_sorted_reads.bam SORT_ORDER=coordinate
$java -jar ${picard} MarkDuplicates INPUT=${outdir}/${sample}_sorted_reads.bam OUTPUT=${outdir}/${sample}_dedup_reads.bam METRICS_FILE=metrics.txt
$samtools index ${outdir}/${sample}_dedup_reads.bam
$java -jar ${gatk} -T RealignerTargetCreator -R /home/thwang/reference/hg19.fa -I ${outdir}/${sample}_dedup_reads.bam -known /home/thwang/reference/1000G_phase1.indels.b37.vcf -known /home/thwang/reference/Mills_and_1000G_gold_standard.indels.b37.vcf -o ${outdir}/${sample}_target_intervals.list -L 17:7552323-7616951
$java -jar ${gatk} -I ${outdir}/${sample}_dedup_reads.bam -R /home/thwang/reference/hg19.fa -T IndelRealigner -targetIntervals ${outdir}/${sample}_target_intervals.list -known /home/thwang/reference/1000G_phase1.indels.b37.vcf -known /home/thwang/reference/Mills_and_1000G_gold_standard.indels.b37.vcf -o ${outdir}/${sample}_realigned.bam -L 17:7552323-7616951
$java -jar ${gatk} -T BaseRecalibrator -R /home/thwang/reference/hg19.fa -I ${outdir}/${sample}_realigned.bam -knownSites /home/thwang/reference/Mills_and_1000G_gold_standard.indels.b37.vcf -knownSites /home/thwang/reference/1000G_phase1.indels.b37.vcf -knownSites /home/thwang/reference/dbsnp_138.b37.vcf -o ${outdir}/${sample}_recal.table -L 17:7552323-7616951
$java -jar ${gatk} -T PrintReads -R /home/thwang/reference/hg19.fa -I ${outdir}/${sample}_realigned.bam -BQSR ${outdir}/${sample}_recal.table -o ${outdir}/${sample}_recal.bam -L 17:7552323-7616951
# Run GATK HaplotypeCaller to call somatic/germline variants
$java -jar ${gatk} -T HaplotypeCaller -R ${reference} -I ${outdir}/${sample}_recal.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o ${outdir}/${sample}.vcf -L 17:7552323-7616951
# Annotate VCF (variants) file
$java -jar $snpEFF hg19 ${outdir}/${sample}.vcf > ${outdir}/${sample}.tmp.vcf
$java -jar ${SnpSift} annotate ${snpEFFanno}/CosmicCodingMuts.vcf ${outdir}/${sample}.tmp.vcf > ${outdir}/${sample}.tmp.COSMIC.coding.snpEFF
$java -jar ${SnpSift} annotate ${snpEFFanno}/CosmicNonCodingVariants.vcf ${outdir}/${sample}.tmp.COSMIC.coding.snpEFF > ${outdir}/${sample}.tmp.COSMIC.coding.noncoding.snpEFF
$java -jar ${SnpSift} annotate ${snpEFFanno}/All_20160408.vcf ${outdir}/${sample}.tmp.COSMIC.coding.snpEFF > ${outdir}/${sample}.COSMIC.coding.noncoding.dbsnp.snpEFF
# Download your bam and bai files and visualize with IGV tool
# scp [email protected]:/home/thwang/day4/*recal.bam* ./Download
# scp [email protected]:/home/thwang/day4/*vcf ./Download
########### Part 3: Variant calling using VarScan #############
#######################################
# VarScan Variant Call #
# http://dkoboldt.github.io/varscan/ #
#######################################
##########################################
# VarScan Normal Variant Call #
# http://dkoboldt.github.io/varscan/ #
##########################################
varscan=/home/thwang/software/VarScan.v2.3.9.jar
samtools=/home/thwang/software/samtools-1.2/bin/samtools
reference=/home/thwang/reference/hg19.fa
picard=/home/thwang/software/picard.jar
snpEFF=/home/thwang/software/snpEff/snpEff.jar
outdir="$PWD"
Nbam=normal_TP53_recal.bam
$samtools mpileup -f $reference $Nbam > normal.mpileup
$java -jar $varscan mpileup2snp normal.mpileup > normal.varscan.germline.snp.vcf --output-vcf
$java -jar $varscan mpileup2indel normal.mpileup > normal.varscan.germline.indel.vcf --output-vcf
$java -jar $snpEFF hg19 normal.varscan.germline.snp.vcf > normal.varscan.germline.snp.ann.vcf
$java -jar $snpEFF hg19 normal.varscan.germline.indel.vcf > normal.varscan.germline.indel.ann.vcf
#If you want to annotate with COSMIC and dbSNP, please follow steps we desribed the above
# $java -jar ${SnpSift} annotate ${snpEFFanno}/CosmicCodingMuts.vcf your_vcf_file.vcf > tmp.COSMIC.coding.snpEFF
# $java -jar ${SnpSift} annotate ${snpEFFanno}/CosmicNonCodingVariants.vcf tmp.COSMIC.coding.snpEFF > tmp.COSMIC.coding.noncoding.snpEFF
# $java -jar ${SnpSift} annotate ${snpEFFanno}/All_20160408.vcf tmp.COSMIC.coding.noncoding.snpEFF > your_vcf_file.COSMIC.coding.noncoding.dbsnp.snpEFF
# Download your bam and bai files and visualize with IGV tool
# scp [email protected]:/home/thwang/day4/*recal.bam* ./Download
# scp [email protected]:/home/thwang/day4/*vcf ./Download
#################################################################
# VarScan Tumor Variant Call without matched normal #
# http://dkoboldt.github.io/varscan/ #
#################################################################
Tbam=tumor_TP53_recal.bam
$samtools mpileup -f $reference $Tbam > tumor.mpileup
$java -jar $varscan mpileup2snp tumor.mpileup > tumor.varscan.germline.snp.vcf --output-vcf
$java -jar $varscan mpileup2indel tumor.mpileup > tumor.varscan.germline.indel.vcf --output-vcf
$java -jar $snpEFF hg19 tumor.varscan.germline.snp.vcf > tumor.varscan.germline.snp.ann.vcf
$java -jar $snpEFF hg19 tumor.varscan.germline.indel.vcf > tumor.varscan.germline.indel.ann.vcf
#If you want to annotate with COSMIC and dbSNP, please follow steps we desribed the above
# $java -jar ${SnpSift} annotate ${snpEFFanno}/CosmicCodingMuts.vcf your_vcf_file.vcf > tmp.COSMIC.coding.snpEFF
# $java -jar ${SnpSift} annotate ${snpEFFanno}/CosmicNonCodingVariants.vcf tmp.COSMIC.coding.snpEFF > tmp.COSMIC.coding.noncoding.snpEFF
# $java -jar ${SnpSift} annotate ${snpEFFanno}/All_20160408.vcf tmp.COSMIC.coding.noncoding.snpEFF > your_vcf_file.COSMIC.coding.noncoding.dbsnp.snpEFF
# Download your bam and bai files and visualize with IGV tool
# scp [email protected]:/home/thwang/day4/*recal.bam* ./Download
# scp [email protected]:/home/thwang/day4/*vcf ./Download
##########################################################################
# VarScan Somatic Variant Call using tumor and matched normal #
# http://dkoboldt.github.io/varscan/ #
##########################################################################
#calling somatic mutations using varscan, the output would be somatic.snp.vcf and somatic.indel.vcf. Runing time ~ 5min
$java -jar $varscan somatic normal.mpileup tumor.mpileup somatic --output-vcf
#annotate vcf file using snpEFF. Runing time ~ 2min
$java -jar $snpEFF hg19 somatic.indel.vcf > varscan.somatic.indel.ann.vcf
$java -jar $snpEFF hg19 somatic.snp.vcf > varscan.somatic.snp.ann.vcf
#If you want to annotate with COSMIC and dbSNP, please follow steps we desribed the above
# $java -jar ${SnpSift} annotate ${snpEFFanno}/CosmicCodingMuts.vcf your_vcf_file.vcf > tmp.COSMIC.coding.snpEFF
# $java -jar ${SnpSift} annotate ${snpEFFanno}/CosmicNonCodingVariants.vcf tmp.COSMIC.coding.snpEFF > tmp.COSMIC.coding.noncoding.snpEFF
# $java -jar ${SnpSift} annotate ${snpEFFanno}/All_20160408.vcf tmp.COSMIC.coding.noncoding.snpEFF > your_vcf_file.COSMIC.coding.noncoding.dbsnp.snpEFF
# Download your bam and bai files and visualize with IGV tool
# scp [email protected]:/home/thwang/day4/*recal.bam* ./Download
# scp [email protected]:/home/thwang/day4/*vcf ./Download
##################################################################################
## Do it yourself ################################################################
##################################################################################
#There are tumor and matched normal fastq files (paire-end) from whole genome sequencing data
#Your task is to run FASTQC-> BWA -> GATK/VarScan -> snpEff -> Download bam file and visualize it with IGV
# tumor line
# read_sample_2126_T1.R1.fastq # forward_read
# read_sample_2126_T1.R2.fastq # reverse_read
# matched normal line
# read_sample_2126_N1.R1.fastq # forward_read
# read_sample_2126_N1.R2.fastq # reverse_read
#Let's do it!