A pipeline to impute human SNP data to 1000 genomes efficiently by parallelising across a compute cluster.
Provided here is simply a collection of scripts in bash
and R
that knit together a two stage imputation process:
- Stage one uses
hapi-ur
to haplotype the target data- Williams AL, Patterson N, Glessner J, Hakonarson H, and Reich D. Phasing of Many Thousands of Genotyped Samples. American Journal of Human Genetics 2012 91(2) 238-251.
- Stage two uses
impute2
to impute to the 1000 genomes reference- Howie BN, Donnelly P, and Marchini, J. A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLoS Genetics 2009 5(6):e1000529
It performs the following processes:
- Alignment of target data to reference data
- Haplotyping
- Imputing
- Converting to best-guess genotypes
- Filtering steps
Runtime is fast thanks to some great software that is freely available (see below). I typically have 400 or so cores available and for e.g. 1000 individuals this will complete in a few hours. Time complexity scales linearly.
- A compute cluster using
SGE
- Your genotype data (22 autosomes), QC'd, in binary plink format, referred to as the target data set
- The downloads (including a reference data set) listed in the instructions below
R
,awk
,bash
,git
, a text editor, etc
- Haplotyped target and imputed data in
impute2
format - Dosage imputed data in
impute2
format - 'Best-guess' imputed data in binary
plink
format - 'Best-guess' imputed data, filtered for MAF and HWE in binary
plink
format - Optional SNP2HLA imputation
Imputation is a big, slow, ugly, long-winded, hand-wavey, unpleasant process. In setting up this pipeline I have used plenty of scripts, programmes, and data found in various corners of the internet, and these have made the whole task much, much easier. Most of these resources have been used without permission from the original authors. If any of the authors are angry about this then let me know and I will take it down!
Here is a list of resources that I have used:
hapi-ur
developed by Amy Williamsimpute2
developed by Bryan Howieplink
developed by Shaun Purcell- Parts of the
GermLine
software developed by Itsik Pe'er - Some of the
BEAGLE
utilities, written by Brian and Sharon Browning liftOver
developed at UCSC- Reference data hosted by the developers of
impute2
- Strand alignment data files, produced and hosted by Will Rayner
- Strand alignment script developed by Neil Robertson
- The
plyr
library inR
, written by Hadley Wickham - Help and guidelines from the
MaCH
andminimac
Imputation Cookbook, developed by Goncalo Abecasis
The pipeline was developed by Gibran Hemani under the Complex Trait Genomics Group at the University of Queensland (Diamantina Institute and Queensland Brain Institute). Valuable help was provided by members of Peter Visscher's and Naomi Wray's group, and Paul Leo from Matt Brown's lab.
-
First Clone this repository
git clone https://github.com/CNSGenomics/impute-pipe.git
-
Then
cp
your raw data in binary plink format todata/target
-
Download the strand alignment files for your data's chip from Will Rayner's page and unzip.
-
Download the chain file for the SNP chip's build from UCSC (most likely you will need HG18 to HG19 which is included in this repo)
-
Download and unarchive the reference data from the impute2 website, e.g.
wget http://mathgen.stats.ox.ac.uk/impute/ALL_1000G_phase1integrated_v3_impute.tgz tar xzvf ALL_1000G_phase1integrated_v3_impute.tgz
This file has all the options required for the imputation process. It should be fairly self explanatory, just change the file names and options that are listed in the section marked To be edited by user
In order to determine whether imputation will result in high-quality data, it is important to perform a few quality control checks before proceeding. The QC process is performed by executing
./check_strand.sh
This script will generate some allele frequeny plots, and strand information stats that will help inform whether or not to proceed with imputation with the data as is. In particular, the output generated in data/qc/
consists of:
- Allele frequency plots of the reference dataset against the target dataset, before and after strand alignment (see section 3 below for information on strand alignment).
- An allele frequency information file for each chromosome, indicating the major allele and its frequency in the target dataset, before and after strand alignment, as well as in the reference dataset.
- A strand summary file for the whole genome, indicating the percent concordance between major alleles in the target and reference datasets, before and after strand alignment, as well as the number of ambiguous 'palindromic' SNPs (i.e. those SNPs with alleles either
AT
orCG
). - A plaintext list of palindromic SNPs, for exclusion using PLINK.
The main objective of these QC checks is to determine issues that may arise in the following strand alignment step. For example: data that has previously been aligned to the reference will not require any allelic flipping, and using the wrong strand file to align data will result in poor concordance with the reference dataset. Both of these issues can be identified by viewing the allele frequency plots, and examining the strand summary file.
If the output meets your expectations, proceed to execute
./filter_pre.sh
to create a filtered dataset for imputation (filtered on identified palindromic SNPs, and missingness per individual, missingness per marker, Hardy-Weinberg equilibrium and minor allele frequency, as set in parameters.sh
).
This is a two step process.
-
First, convert all alleles to be on the forward strand:
./strand_align.sh
-
Second, convert the map to hg19, update SNP names and positions, remove SNPs that are not present in the reference data, and split the data into separate chromosomes. By running
qsub ref_align.sh
the script will be submitted to SGE
to execute on all chromosomes in parallel. Alternatively you can run
./ref_align.sh <chr>
and the script will only work on chromosome <chr>
.
The output from this will be binary plink
files for each chromosome located in the data/target
directory.
This uses Amy Williams' excellent haplotyping programme hapi-ur
. We perform the haplotyping three times on each chromosome:
qsub hap.sh
and then vote
on the most common outcome at each position to make a final haplotype:
qsub imp.sh
This also creates a new SGE
submit script for each chromosome, where each chromosome has been split into 5Mb chunks with 250kb overlapping regions (these options can be amended in the parameter.sh
file.
For both scripts the script can run on a specified chromosome in the front end by using
./hap.sh <chr>
./imp.sh <chr>
which might be useful for testing to see if it is working etc.
The output from this will be three haplotype file sets for each chromosome, as well as a final, democratically elected (!) file set in impute2
format, located in the data/haplotypes
directory.
Most likely the lengthiest and most memory demanding stage. By running
./submit_imp.sh
the scripts spawned for each in the last step will be submitted to SGE
.
With large sample sizes e.g. >10k individuals, my cluster will occasionally kill a particular chunk. Should this happen it is safe to run the submit script in its entirety again at the end - it will not overwrite anything that is already completed, and only those chunks that are incomplete will continue running.
This script will perform the imputation using impute2
, and then convert the dosage output to best-guess genotypes in binary plink
format.
Again, to test that it is working you can simply run the submit script in the front end for a particular chunk of the chromosome, e.g.
cd data/imputed/chr22
./submit_imp.sh 4
will run the 4th 5Mb chunk of chromosome 22.
The outputs from this script will be imputed dosages, haplotypes and best-guess genotypes in chromosomes broken into 5Mb chunks. These will be located in data/imputed
.
This will stitch together the 5Mb chunks for each chromosome:
qsub stitch_plink.sh
Again, a single chromosome can be executed in the frontend by running:
./stitch_plink.sh
This will run HLA imputation on the genotype data for chromosome 6
qsub hla.sh
The data can be found in
data/imputation/hla/
Imputed data for entire chromosomes in:
- Dosages (
impute2
format) - Haplotypes (
impute2
format) - Best-guess genotypes (binary
plink
format) impute2
info files
The final stage is to filter on MAF and HWE. The thresholds can be amended in the parameter.sh
file.
qsub filter.sh
or
./filter.sh <chr>
Best-guess genotypes (in binary plink
format) for each chromosome.
This pipeline works for me. I use it regularly, and I thought it was a good idea to share it given that I am using so much stuff that has been shared by others.
I have never tried it on another cluster, and I imagine that some of the parameters will have to be customised for different cluster setups.
It is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.