-
Notifications
You must be signed in to change notification settings - Fork 22
2. Using and creating functional annotations
You can either download existing functional annotation files, or create your own:
We provide functional annotations for ~19 million UK Biobank imputed SNPs with MAF>0.1%, based on the baseline-LF 2.2.UKB annotations (WARNING: this is a large download, requiring 30GB). This is a broad set of coding, conserved, regulatory and LD-related annotations, based on Gazal et al. 2018 Nat Genet and described in Supplementary Table 1 of Weissbrod et al. 2020 Nat Genet.
You can easily create your own annotations. For each chromosome you need to create two or three files:
- An annotations file containing columns for CHR, BP, SNP, A1, A2 and arbitrary other columns representing your annotations. These files can be either .parquet or .gz files (we recommend using .parquet files).
- An file with extension .l2.M containing a single line with the sums of the columns of each annotation (whitespace-delimited).
- (optional) A file with extension l2.M_5_50 that is similar to the .M file but considers only common SNPs (with MAF between 5% and 50%). By default PolyFun will not use these files, but you can use them when running S-LDSC to estimate enrichment of common SNP heritability.
To see an example annotations file, type the following commands from within python:
import pandas as pd
df = pd.read_parquet('example_data/annotations.22.annot.parquet')
print(df.head())
The output should be:
SNP CHR BP A1 A2 ... Conserved_LindbladToh_common Conserved_LindbladToh_lowfreq Repressed_Hoffman_common Repressed_Hoffman_lowfreq base
0 rs139069276 22 16866502 G A ... 0 0 1 0 1
1 rs34747326 22 16870173 A G ... 0 0 0 1 1
2 rs4010550 22 16900134 G A ... 0 0 0 0 1
3 rs5994099 22 16905044 G A ... 0 0 0 1 1
4 rs59750689 22 16936369 T C ... 0 0 1 0 1
The corresponding .l2.M file is example_data/annotations.22.l2.M
. You can see it from the shell by typing:
cat example_data/annotations.22.l2.M
The output is:
25 34 17 34 305 401 2511
Hence, the sum of the annotation Coding_UCSC_common
is 25, the sum of the annotation Coding_UCSC_lowfreq
is 34, etc.
After creating annotation files, you should compute LD-scores for these annotations (one file for each chromosome). We provide two options to do this:
- Using a reference panel of sequenced individuals (e.g. UK10K)
- Using pre-computed UK Biobank LD matrices that we provide.
Option 1 is generally preferable if you have a large reference panel of >3,000 sequenced individuals from the target population (e.g. UK10K). Otherwise, we recommend using option 2 (with the caveat that it's based on imputed rather then sequenced genotypes). We caution that the 1000 genomes resource is generally too small for PolyFun purposes when restricted to a specific population (e.g. Europeans).
You can create LD-scores with a reference panel by using the script compute_ldscores.py
. Here is a use example for chromosome 1:
mkdir -p output
python compute_ldscores.py \
--bfile example_data/reference.1 \
--annot example_data/annotations.1.annot.parquet \
--out output/ldscores_example.parquet
Here, --bfile
is the prefix of a plink .bed file of a reference panel with chromosome 1 SNPs, --annot
is the name of an annotations file, and --out
is the name of an output file. The script also accepts a --keep <keep file>
parameter to use a subset of individuals for faster computation. This script accepts annotations in either .parquet or .gz format (parquet is much faster). Please note that you can also use S-LDSC to compute LD-scores. However, S-LDSC requires python 2 and does not use the columns A1, A2 in the LD-score and annotation files.
You can also use the same script to get weight files (by neglecting the --annot flag in this case).
You can create LD-scores with pre-computed UK Biobank LD matrices by using the script compute_ldscores_from_ld.py
. Here is a use example for chromosome 1:
mkdir -p output
python compute_ldscores_from_ld.py \
--annot example_data/annotations.1.annot.parquet \
--ukb \
--out output/ldscores_example2.parquet
Here, --annot
is an annotations file, and --out
is the name of an output file. The script will automatically download pre-computed UK Biobank LD matrices as required. The script also accepts a --ld-dir <LD directory>
parameter that causes it to store downloaded LD matrices in the specified directory (otherwise it will store them in a temporary directory that may get deleted after it completes running). This script accepts annotations in either .parquet or .gz format (parquet is much faster).
You can create LD-scores with your own pre-computed LD matrices (in .bcor format) by using the script compute_ldscores_from_ld.py
. Here is a use example for chromosome 1:
mkdir -p output
python compute_ldscores_from_ld.py \
--annot example_data/annotations.1.annot.parquet \
--out output/ldscores_example3.parquet \
--n 10000 \
bcor_files/*.bcor
Here, --n
is the size of the sample used to compute LD in the .bcor files, and the last argument specifies a list of .bcor files to be used (these should be in BCOR 1.1 format, the native format of LDstore 2.0, and they should ideally cover all of chromosome 1).