Skip to content

CNSGenomics/impute-pipe

Repository files navigation

imputePipe

A pipeline to impute human SNP data to 1000 genomes efficiently by parallelising across a compute cluster.

Summary

Provided here is simply a collection of scripts in bash and R that knit together a two stage imputation process:

It performs the following processes:

  • Alignment of target data to reference data
  • Haplotyping
  • Imputing
  • Converting to best-guess genotypes
  • Filtering steps

Runtime is fast thanks to some great software that is freely available (see below). I typically have 400 or so cores available and for e.g. 1000 individuals this will complete in a few hours. Time complexity scales linearly.

Requirements

  • A compute cluster using SGE
  • Your genotype data (22 autosomes), QC'd, in binary plink format, referred to as the target data set
  • The downloads (including a reference data set) listed in the instructions below
  • R, awk, bash, git, a text editor, etc

Outputs

  • Haplotyped target and imputed data in impute2 format
  • Dosage imputed data in impute2 format
  • 'Best-guess' imputed data in binary plink format
  • 'Best-guess' imputed data, filtered for MAF and HWE in binary plink format
  • Optional SNP2HLA imputation

Credits

Imputation is a big, slow, ugly, long-winded, hand-wavey, unpleasant process. In setting up this pipeline I have used plenty of scripts, programmes, and data found in various corners of the internet, and these have made the whole task much, much easier. Most of these resources have been used without permission from the original authors. If any of the authors are angry about this then let me know and I will take it down!

Here is a list of resources that I have used:

The pipeline was developed by Gibran Hemani under the Complex Trait Genomics Group at the University of Queensland (Diamantina Institute and Queensland Brain Institute). Valuable help was provided by members of Peter Visscher's and Naomi Wray's group, and Paul Leo from Matt Brown's lab.

Instructions

1. Gather all the data and scripts required.

  1. First Clone this repository

    git clone https://github.com/CNSGenomics/impute-pipe.git
    
  2. Then cp your raw data in binary plink format to data/target

  3. Download the strand alignment files for your data's chip from Will Rayner's page and unzip.

  4. Download the chain file for the SNP chip's build from UCSC (most likely you will need HG18 to HG19 which is included in this repo)

  5. Download and unarchive the reference data from the impute2 website, e.g.

    wget http://mathgen.stats.ox.ac.uk/impute/ALL_1000G_phase1integrated_v3_impute.tgz
    tar xzvf ALL_1000G_phase1integrated_v3_impute.tgz
    

2. Customise the parameter.sh file

This file has all the options required for the imputation process. It should be fairly self explanatory, just change the file names and options that are listed in the section marked To be edited by user


3. Perform pre-imputation quality control checks

In order to determine whether imputation will result in high-quality data, it is important to perform a few quality control checks before proceeding. The QC process is performed by executing

./check_strand.sh

This script will generate some allele frequeny plots, and strand information stats that will help inform whether or not to proceed with imputation with the data as is. In particular, the output generated in data/qc/ consists of:

  1. Allele frequency plots of the reference dataset against the target dataset, before and after strand alignment (see section 3 below for information on strand alignment).
  2. An allele frequency information file for each chromosome, indicating the major allele and its frequency in the target dataset, before and after strand alignment, as well as in the reference dataset.
  3. A strand summary file for the whole genome, indicating the percent concordance between major alleles in the target and reference datasets, before and after strand alignment, as well as the number of ambiguous 'palindromic' SNPs (i.e. those SNPs with alleles either AT or CG).
  4. A plaintext list of palindromic SNPs, for exclusion using PLINK.

The main objective of these QC checks is to determine issues that may arise in the following strand alignment step. For example: data that has previously been aligned to the reference will not require any allelic flipping, and using the wrong strand file to align data will result in poor concordance with the reference dataset. Both of these issues can be identified by viewing the allele frequency plots, and examining the strand summary file.

If the output meets your expectations, proceed to execute

./filter_pre.sh

to create a filtered dataset for imputation (filtered on identified palindromic SNPs, and missingness per individual, missingness per marker, Hardy-Weinberg equilibrium and minor allele frequency, as set in parameters.sh).


4. Align the target to the reference data

This is a two step process.

  1. First, convert all alleles to be on the forward strand:

    ./strand_align.sh
    
  2. Second, convert the map to hg19, update SNP names and positions, remove SNPs that are not present in the reference data, and split the data into separate chromosomes. By running

    qsub ref_align.sh
    

the script will be submitted to SGE to execute on all chromosomes in parallel. Alternatively you can run

    ./ref_align.sh <chr>

and the script will only work on chromosome <chr>.

Output

The output from this will be binary plink files for each chromosome located in the data/target directory.


5. Perform haplotyping

This uses Amy Williams' excellent haplotyping programme hapi-ur. We perform the haplotyping three times on each chromosome:

qsub hap.sh

and then vote on the most common outcome at each position to make a final haplotype:

qsub imp.sh

This also creates a new SGE submit script for each chromosome, where each chromosome has been split into 5Mb chunks with 250kb overlapping regions (these options can be amended in the parameter.sh file.

For both scripts the script can run on a specified chromosome in the front end by using

./hap.sh <chr>
./imp.sh <chr>

which might be useful for testing to see if it is working etc.

Output

The output from this will be three haplotype file sets for each chromosome, as well as a final, democratically elected (!) file set in impute2 format, located in the data/haplotypes directory.


6. Imputation

Most likely the lengthiest and most memory demanding stage. By running

./submit_imp.sh

the scripts spawned for each in the last step will be submitted to SGE.

With large sample sizes e.g. >10k individuals, my cluster will occasionally kill a particular chunk. Should this happen it is safe to run the submit script in its entirety again at the end - it will not overwrite anything that is already completed, and only those chunks that are incomplete will continue running.

This script will perform the imputation using impute2, and then convert the dosage output to best-guess genotypes in binary plink format.

Again, to test that it is working you can simply run the submit script in the front end for a particular chunk of the chromosome, e.g.

cd data/imputed/chr22
./submit_imp.sh 4

will run the 4th 5Mb chunk of chromosome 22.

Output

The outputs from this script will be imputed dosages, haplotypes and best-guess genotypes in chromosomes broken into 5Mb chunks. These will be located in data/imputed.


7. Stitching the imputation chunks into whole chromosomes

This will stitch together the 5Mb chunks for each chromosome:

qsub stitch_plink.sh

Again, a single chromosome can be executed in the frontend by running:

./stitch_plink.sh

8. Optional - HLA Imputation

This will run HLA imputation on the genotype data for chromosome 6

qsub hla.sh

The data can be found in

data/imputation/hla/

Output

Imputed data for entire chromosomes in:

  • Dosages (impute2 format)
  • Haplotypes (impute2 format)
  • Best-guess genotypes (binary plink format)
  • impute2 info files

8. Filtering

The final stage is to filter on MAF and HWE. The thresholds can be amended in the parameter.sh file.

qsub filter.sh

or

./filter.sh <chr>

Output

Best-guess genotypes (in binary plink format) for each chromosome.

Disclaimer

This pipeline works for me. I use it regularly, and I thought it was a good idea to share it given that I am using so much stuff that has been shared by others.

I have never tried it on another cluster, and I imagine that some of the parameters will have to be customised for different cluster setups.

It is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

About

🚿 Canonical SGE cluster genotype imputation pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published