A pipeline for mapping SNPs in a VCF file to a genome that wasn't used to call the variants. It is expected that this strategy could be used for mapping VCF files to an updated genome assembly or for mapping to another closely related species' genome (to be able to directly compare nucleotide variants). Right now this only works for SNPs, but could be modified for other variants.
Table of Contents
These scripts have been tested with Python 2.7 and 3 and should work with either. The script requires the following programs and files.
Programs:
bwa (or equivalent alignment program to produce .bam files)
samtools
bcftools
python
pysam (python)
Files:
vcf file that you wish to map to a new genome
old genome file (fasta format and indexed -- samtools faidx oldGenome.fasta)
new genome file (fasta format and indexed -- samtools faidx newGenome.fasta, bwa index newGenome.fasta)
- Generate flanking sequences around the variant(s) so that they can be mapped to the new genome:
python VCF_2_Fasta_v1.1.py -vcf old.vcf.gz -fasta oldGenome.fasta -fai oldGenome.fasta.fai -flanking 100 > flankingSeqs.fasta
help (and further explanations): python VCF_2_Fasta_v1.1.py -h
To run this step in parallel, please use version 1.2 of this script, which allows the user to specify scaffold.
Here is an example of how this script can be used in parallel (files will need to be concatenated afterwards):
cat test.fasta.fai | awk '{ print $1 }' | xargs -I @@ -P 2 sh -c \
"python3 VCF_2_Fasta_v1.2.py -vcf test.vcf -fasta test.fasta -fai test.fasta.fai -chr @@ > @@.out.fasta"
-
Map fasta sequences to new genome:
bwa mem -M newGenome.fasta flankingSeqs.fasta | samtools sort -o newGenome.v.flankingSeqs.bam -
Index alignment file:
samtools index newGenome.v.flankingSeqs.bam -
Identify variant positions in new genome:
python Bam2SNPPosition.v1.1.py -file newGenome.v.flankingSeqs.bam -qual 60 -multiple 1 -minLen 200 -minPid 99 -pos 101 > old2new.positions.txt
help (and further explanations): python Bam2SNPPosition.v1.1.py -h
note: requires SNPPosBamv10 in same working directory -
Quality control (uses the flanking sequences generated from step 1 to check for exact sequence matching in the new genome, this script can allow the variant to be a mismatch; see help on output):
python CheckMappingVCF.v1.0.py -map old2new.positions.txt -fasta1 flankingSeqs.fasta -fai1 flankingSeqs.fasta.fai -fasta2 newGenome.fasta -fai2 newGenome.fasta.fai -flanking 20 -output most > old2new.positions.filtered.txt
help (and further explanations): python CheckMappingVCF.v1.0.py -h -
Generate new vcf file:
python AssignNewSNPPositions.v1.3.py -map old2new.positions.filtered.txt -vcf old.vcf.gz -fai newGenome.fasta.fai -gen newGenome.fasta > new.vcf
help (and further explanations): python AssignNewSNPPositions.v1.3.py -h -
Sort vcf file
bcftools sort new.vcf > new.sorted.vcf
Distributed under the MIT License.