Skip to content

2. Using and creating functional annotations

Omer Weissbrod edited this page May 25, 2023 · 15 revisions

You can either download existing functional annotation files, or create your own:

Downloading existing functional annotation files

We provide functional annotations for ~19 million UK Biobank imputed SNPs with MAF>0.1%, based on the baseline-LF 2.2.UKB annotations (WARNING: this is a large download, requiring 30GB). This is a broad set of coding, conserved, regulatory and LD-related annotations, based on Gazal et al. 2018 Nat Genet and described in Supplementary Table 1 of Weissbrod et al. 2020 Nat Genet.



Creating your own annotations

You can easily create your own annotations. For each chromosome you need to create two or three files:

  1. An annotations file containing columns for CHR, BP, SNP, A1, A2 and arbitrary other columns representing your annotations. These files can be either .parquet or .gz files (we recommend using .parquet files).
  2. An file with extension .l2.M containing a single line with the sums of the columns of each annotation (whitespace-delimited).
  3. (optional) A file with extension l2.M_5_50 that is similar to the .M file but considers only common SNPs (with MAF between 5% and 50%). By default PolyFun will not use these files, but you can use them when running S-LDSC to estimate enrichment of common SNP heritability.

To see an example annotations file, type the following commands from within python:
import pandas as pd
df = pd.read_parquet('example_data/annotations.22.annot.parquet')
print(df.head())

The output should be:

           SNP  CHR        BP A1 A2  ...  Conserved_LindbladToh_common  Conserved_LindbladToh_lowfreq  Repressed_Hoffman_common  Repressed_Hoffman_lowfreq  base
0  rs139069276   22  16866502  G  A  ...                             0                              0                         1                          0     1
1   rs34747326   22  16870173  A  G  ...                             0                              0                         0                          1     1
2    rs4010550   22  16900134  G  A  ...                             0                              0                         0                          0     1
3    rs5994099   22  16905044  G  A  ...                             0                              0                         0                          1     1
4   rs59750689   22  16936369  T  C  ...                             0                              0                         1                          0     1

The corresponding .l2.M file is example_data/annotations.22.l2.M. You can see it from the shell by typing:

cat example_data/annotations.22.l2.M

The output is:

25 34 17 34 305 401 2511

Hence, the sum of the annotation Coding_UCSC_common is 25, the sum of the annotation Coding_UCSC_lowfreq is 34, etc.



Computing LD-scores for annotations

After creating annotation files, you should compute LD-scores for these annotations (one file for each chromosome). We provide two options to do this:

  1. Using a reference panel of sequenced individuals (e.g. UK10K)
  2. Using pre-computed UK Biobank LD matrices that we provide.

Option 1 is generally preferable if you have a large reference panel of >3,000 sequenced individuals from the target population (e.g. UK10K). Otherwise, we recommend using option 2 (with the caveat that it's based on imputed rather then sequenced genotypes). We caution that the 1000 genomes resource is generally too small for PolyFun purposes when restricted to a specific population (e.g. Europeans).

Computing LD-scores with a reference panel

You can create LD-scores with a reference panel by using the script compute_ldscores.py. Here is a use example for chromosome 1:

mkdir -p output

python compute_ldscores.py \
  --bfile example_data/reference.1 \
  --annot example_data/annotations.1.annot.parquet \
  --out output/ldscores_example.parquet

Here, --bfile is the prefix of a plink .bed file of a reference panel with chromosome 1 SNPs, --annot is the name of an annotations file, and --out is the name of an output file. The script also accepts a --keep <keep file> parameter to use a subset of individuals for faster computation. This script accepts annotations in either .parquet or .gz format (parquet is much faster). Please note that you can also use S-LDSC to compute LD-scores. However, S-LDSC requires python 2 and does not use the columns A1, A2 in the LD-score and annotation files.

You can also use the same script to get weight files (by neglecting the --annot flag in this case).

Computing LD-scores with pre-computed UK Biobank LD matrices

You can create LD-scores with pre-computed UK Biobank LD matrices by using the script compute_ldscores_from_ld.py. Here is a use example for chromosome 1:

mkdir -p output

python compute_ldscores_from_ld.py \
  --annot example_data/annotations.1.annot.parquet \
  --ukb \
  --out output/ldscores_example2.parquet

Here, --annot is an annotations file, and --out is the name of an output file. The script will automatically download pre-computed UK Biobank LD matrices as required. The script also accepts a --ld-dir <LD directory> parameter that causes it to store downloaded LD matrices in the specified directory (otherwise it will store them in a temporary directory that may get deleted after it completes running). This script accepts annotations in either .parquet or .gz format (parquet is much faster).

Computing LD-scores with your own pre-computed LD matrices

You can create LD-scores with your own pre-computed LD matrices (in .bcor format) by using the script compute_ldscores_from_ld.py. Here is a use example for chromosome 1:

mkdir -p output

python compute_ldscores_from_ld.py \
  --annot example_data/annotations.1.annot.parquet \
  --out output/ldscores_example3.parquet \
  --n 10000 \
  bcor_files/*.bcor

Here, --n is the size of the sample used to compute LD in the .bcor files, and the last argument specifies a list of .bcor files to be used (these should be in BCOR 1.1 format, the native format of LDstore 2.0, and they should ideally cover all of chromosome 1).