6. Trans ethnic polygenic risk prediction with PolyPred

PolyPred leverages fine-mapping to improve trans-ethnic polygenic risk scores, by estimating causal rather than tagging SNP effect sizes. PolyPred consists of four stages:

Genome-wide fine-mapping using PolyFun (which also estimates causal effect sizes)
Estimating tagging SNP effect sizes using another method (e.g. BOLT-LMM or SBayesR).
Linearly combining the effect sizes of PolyFun and the other method, using a small training set from the target cohort
Computing PRS in the target population using the combined SNP effect sizes.

We now explain each step in detail.

1. Genome-wide fine-mapping using PolyFun

This step performs fine-mapping in each locus in the genome, using PolyFun. Please read the PolyFun documentation for details. Here is a brief end-to-end example that you can try out:

mkdir -p output

#created munged sumstats
python munge_polyfun_sumstats.py \
    --sumstats polypred_example/bolt.sumstats.txt.gz \
    --out output/bolt.sumstats.munged.parquet \
    --n 337491

#create fine-mapping jobs
python create_finemapper_jobs.py \
    --sumstats output/bolt.sumstats.munged.parquet \
    --n 337491 \
    --method susie \
    --non-funct \
    --max-num-causal 2 \
    --out-prefix output/polyfun_output \
    --jobs-file output/jobs.txt 
    
#run each and every fine-mapping job (could take several hours or more)
bash output/jobs.txt 

#aggregate all of the results
python aggregate_finemapper_results.py \
    --out-prefix output/polyfun_output \
    --sumstats output/bolt.sumstats.munged.parquet \
    --out output/polyfun_output.agg.txt.gz \
    --adjust-beta-freq

The first step creates munged sumstats; the second step creates fine-mapping job commands (one per locus); The third step runs each and every fine-mapping job; and the last step aggregates the results. The third step could take a long time. Users with access to a computer cluster can parallelize this step by invoking each step (each line in output/jobs.txt ) as a separate job. Please read the PolyFun documentation for further details.

Please note that we specified the flags --max-num-causal 2 and -non-funct in step 2. The first flag tells PolyFun to assume only two causal SNPs per locus, and the second tells PolyFun to perform non-functionally-informed fine-mapping. In both cases, we specified these flags to speed up the analysis. In real data analysis, we recommend omitting these flags (or setting --max-num-causal 5 or a larger number) to maximize fine-mapping power. The flag --adjust-beta-freq adjusts the posterior effect size estimates to be on a per-causal-allele scale rather than a per-standardized-genotype scale. Also, please note that A1 is considered the effect allele throughout the PolyFun code.

2. Estimating tagging SNP effect sizes using another method

This step is method-specific. Please read the documentation of the method you'd like to use for details. For completeness, we provide here the command line arguments required for estimating tagging SNP effect sizes using BOLT-LMM:

bolt_v2.3.4 \
--bed=[plink_prefix].{1:22}.bed \
--bim=[plink_prefix].{1:22}.bed \
--fam=[fam_file] \
--remove=[file with individuals to remove] \
--covarFile=[covariates file] \
--covarCol=[discrete covariate name 1] --covarCol=[discrete covariate name 2] --covarCol=... \
--qcovarCol=[quantitative covariate name 1] --qcovarCol=[quantitative covariate name 2] --qcovarCol ... \
--phenoFile=[pheotype file] \
--phenoCol=[phenotype column] \
--LDscoresFile=[path to BOLT-LMM]/tables/LDSCORE.1000G_EUR.tab.gz \
--LDscoresMatchBp \
--lmmForceNonInf \
--numThreads=6 \
--verboseStats \
--covarMaxLevels=30 \
--maxModelSnps=1500000 \
--statsFile=[output summary statistics file] \
--predBetasFile=[output tagging effect sizes file]

Please note that you can find all of the BOLT-LMM arguments by typing bolt_v2.3.4 --helpFull

3. Linearly combining the effect sizes of PolyFun and the other method

This step finds the best linear combination of SNP effect sizes of several methods (e.g. PolyFun and BOLT-LMM) using non-negative least squares. Here is a use example:

python polypred.py \
    --estimate-mixweights \
    --betas polypred_example/bolt.betas.gz,output/polyfun_output.agg.txt.gz \
    --pheno polypred_example/1000G.pheno.txt \
    --output-prefix output/polypred \
    --plink-exe ~/plink/plink \
    polypred_example/1000G.subset.*.bed

notes and comments:

the argument --betas accepts a comma-separated list of files with SNP effect sizes.
The argument --pheno is a file of phenotypes of training individuals in the target population
The argument --plink-exe is a plink 1.9 executable file
The last argument is a list of plink files that will be analyzed. Typically there is one file per chromosome, but the code doesn't make this assumption. The example files here are a subset of European individuals in the 1000 genomes cohort.
The code can also work with plink 2 .bpfiles (i.e., files with .pgen extension instead of .bed extension, which can store imputed SNP data). In this case please specify the flag --plink2-exe instead of --plink-exe
The script prints the mixing weights of each method into the file output/polypred.mixweights
polypred_example/bolt.betas.gz contains a subset of the BOLT-LMM SNP effect sizes for height, estimated using 337K British-ancestry individuals in the UK Biobank.
1000G.pheno.txt uses synthetic data only and has nothing to do with actual 1000 genomes phenotypes.

4. Computing a weighted PRS in the target population using the best linear combination

This step simply computes a PRS for a target population. Here is a use example:

python polypred.py \
    --predict \
    --betas polypred_example/bolt.betas.gz,output/polyfun_output.agg.txt.gz \
    --mixweights-prefix output/polypred \
    --output-prefix output/polypred.predictions \
    --plink-exe ~/plink/plink \
    polypred_example/1000G.subset.*.bed

Here, --mixweights-prefix must be exactly the same as the value of --output-prefix in the previous command (when we ran polypred.py with --estimate-mixweights).

The script will create two output files:

output/polypred.predictions.prs: This will include the computed PRS
output/polypred.predictions.prs_jk: This will include multiple PRSs for every individual, each corresponding to a different block in a genomic block jackknife scheme (using partitions of 200 blocks spanning the same number of SNPs by default)

Provide feedback

Saved searches