This pipeline explaines the processing of whole genome sequecing raw data from matched normal-tumor cancer samples generated by next generation sequencing. The pipeline of processing whole genome is similar to whole exome, except that variants calling will be from both exonic and intronic parts. All septs performed in WGS is similar to WES, with changes only in steps 05.0 - Variant Calling - SNP and 06.0 - Variant Calling - INDEL.
- The first step in the pipeline is to create renamed links or concatenated FASTQ files
- This step is done to check the raw data quality before start processing using fastq. The input in for this step is all fastq files available in FASTQ folder.
- This step is to align the read to the refrence genome hs37d5 using BWA aligner and to trimm the adaptor sequences as well. This step is done in the main project folder and uses all fastq files available in FASTQ folder. The output is .bam files of aligned read and will created in BAMS folder.
- Post alignment quality control is done in BAMS folder and use the alligned reads generated from the previous step. There are three modules used to check post alignment quality:
- A. Conpair: To check normal-tumor samples concordance
- B. Targeted Panel: To check the coverage of the targeted regions (exonic regions).
- C. Mosdepth: To get the coverage and plot proportion of bases at coverage.
The output of each module will be generated in new folder with corresponding name inside QC folder.
-
Variants calling of SNP using mutect. The input for this step is the normal and tumor bam files generated from step 2 and the output will be vcf files created in mutect folder.
- Filter vcf file to only selected variants that are marked with PASS in the filter column in the vcf file. The input file is vcf files generated from the previous step and the output will be a PASSED.vcf files generated in PASS folder.
- False positive filter applied to PASSED.vcf files generated from the previous step to filter out false positive variants. The output files will be PASSED_filter.vcf created in a new folder called Filter/filterVcf.
- Converting Vcf to MAF files using tools like VEP that determines the effect of variants on genes, transcripts, and protein sequence (using SIFT), as well as regulatory regions. The input for this step is all PASSED_filter.vcf files created from the previous step and the output will be MAF files created in MAF folder.
- This step is done to merge MAF files from each sample into one MAF file and to create a seperate text file with the column names of MAF file.
Download the output files (samples.maf and head.txt files) to a local directory, and merge the two files into one final MuTect maf file using R script
-
Variant calling of INDEL using strelka2. This step is done in the main project folder and the input are bam files from BAM folder generated from step 2. The output will be a vcf files created in work/strelka2 directory.
- Filter vcf file to only selected variants that are marked with PASS in the filter column in the vcf file. The input file is vcf files generated from the previous step and the output will be a PASSED.vcf files generated in PASS folder.
- Converting Vcf2MAF files using tools like VEP that determines the effect of variants on genes, transcripts, and protein sequence (using SIFT), as well as regulatory regions. The input for this step is all PASSED.vcf files created from the previous step and the output will be MAF files created in MAF folder.
- This step is done to merge MAF files from each sample into one MAF file and to create a seperate text file with the column names of MAF file.
Download the output files (strelka2_all_samples.maf and header_strelka2_all_samples.maf files) to a local directory, and merge the two files into one final strealka2 maf file using R script
-
Further processing of MuTect and Strealka2 MAF files is done in R to filter out SNP and low complexity variants from strealka2 MAF file, combine MuTect and Strelka2 MAF files into one final MAF file, and to filter most deleterious variants from the final MAF file.
Labname/Project
- FASTQ: Raw data (fastq files)
- QC: Quality conrol of fasq files
- BAMS
- mutect
- Results
- PASSED
- MAF
- Filter
- PASSED
- Results
- strelka2
- config
- final
- work