Skip to content

A pipeline for mapping SNPs in a VCF file to a genome that wasn't used to call the variants. It is expected that this strategy could be used for mapping VCF files to an updated genome assembly or for mapping to another closely related species' genome (to be able to directly compare nucleotide variants). Right now this only works for SNPs, but co…

License

Notifications You must be signed in to change notification settings

KrisChristensen/MapVCF2NewGenome

Repository files navigation

MapVCF2NewGenome

A pipeline for mapping SNPs in a VCF file to a genome that wasn't used to call the variants. It is expected that this strategy could be used for mapping VCF files to an updated genome assembly or for mapping to another closely related species' genome (to be able to directly compare nucleotide variants). Right now this only works for SNPs, but could be modified for other variants.

Table of Contents
  1. Requirements
  2. Usage
  3. License

Requirements

These scripts have been tested with Python 2.7 and 3 and should work with either. The script requires the following programs and files.

Programs:

   bwa (or equivalent alignment program to produce .bam files)
   samtools
   bcftools
   python
   pysam (python)

Files:

   vcf file that you wish to map to a new genome
   old genome file (fasta format and indexed -- samtools faidx oldGenome.fasta)
   new genome file (fasta format and indexed -- samtools faidx newGenome.fasta, bwa index newGenome.fasta)

Usage

  1. Generate flanking sequences around the variant(s) so that they can be mapped to the new genome:

       python VCF_2_Fasta_v1.1.py -vcf old.vcf.gz -fasta oldGenome.fasta -fai oldGenome.fasta.fai -flanking 100 > flankingSeqs.fasta

       help (and further explanations): python VCF_2_Fasta_v1.1.py -h

     To run this step in parallel, please use version 1.2 of this script, which allows the user to specify scaffold.
     Here is an example of how this script can be used in parallel (files will need to be concatenated afterwards):

           cat test.fasta.fai | awk '{ print $1 }' | xargs -I @@ -P 2 sh -c \
           "python3 VCF_2_Fasta_v1.2.py -vcf test.vcf -fasta test.fasta -fai test.fasta.fai -chr @@ > @@.out.fasta"

  1. Map fasta sequences to new genome:

       bwa mem -M newGenome.fasta flankingSeqs.fasta | samtools sort -o newGenome.v.flankingSeqs.bam

  2. Index alignment file:

       samtools index newGenome.v.flankingSeqs.bam

  3. Identify variant positions in new genome:

       python Bam2SNPPosition.v1.1.py -file newGenome.v.flankingSeqs.bam -qual 60 -multiple 1 -minLen 200 -minPid 99 -pos 101 > old2new.positions.txt

       help (and further explanations): python Bam2SNPPosition.v1.1.py -h
       note: requires SNPPosBamv10 in same working directory

  4. Quality control (uses the flanking sequences generated from step 1 to check for exact sequence matching in the new genome, this script can allow the variant to be a mismatch; see help on output):

       python CheckMappingVCF.v1.0.py -map old2new.positions.txt -fasta1 flankingSeqs.fasta -fai1 flankingSeqs.fasta.fai -fasta2 newGenome.fasta -fai2 newGenome.fasta.fai -flanking 20 -output most > old2new.positions.filtered.txt

       help (and further explanations): python CheckMappingVCF.v1.0.py -h

  5. Generate new vcf file:

       python AssignNewSNPPositions.v1.3.py -map old2new.positions.filtered.txt -vcf old.vcf.gz -fai newGenome.fasta.fai -gen newGenome.fasta > new.vcf

       help (and further explanations): python AssignNewSNPPositions.v1.3.py -h

  6. Sort vcf file

       bcftools sort new.vcf > new.sorted.vcf

License

Distributed under the MIT License.

About

A pipeline for mapping SNPs in a VCF file to a genome that wasn't used to call the variants. It is expected that this strategy could be used for mapping VCF files to an updated genome assembly or for mapping to another closely related species' genome (to be able to directly compare nucleotide variants). Right now this only works for SNPs, but co…

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages