A repository to implement UniFrac, but on functional profiles of metagenomic data.
To install the package, run the following command:
conda env update -f environment.yml
conda activate fununifrac
All required resources are shared by Onedrive currently. Please check here. Detailed processes of generating the files are described at the "Preprocessing" section.
The original UniFrac uses a phylogenetic tree, allowing one to measure the difference in composition between two
samples. For FunUniFrac, we use a KEGG Orthology (KO) tree and functional profiles instead of an OTU table as
input samples. The only obstacle is that the KO tree does not naturally come with branch lengths. To overcome
this, we use pairwise AAI (average amino acid identity) as a proxy of the pairwise distances between leaf nodes
and assign the rest of the branch lengths by solving the linear system as demonstrated below.
- edge list: The KEGG hierarchy where all the ancestors of the given brite_id.
- pairwise distances: (KO1, KO2), edge j is on the shortest path from ' 'KO1 to KO2'
wget -O "edge_ko00001.txt" "https://pennstateoffice365-my.sharepoint.com/:t:/g/personal/akp6031_psu_edu/ETUrru_StwJFray9WJ8q4lAB9OwzSwF2Zhze_OvXv6HP4A?download=1"
# 1.59GB
wget -O "KOs_sketched_scaled_10_compare_5" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/ESd-RVfvWbZKpoky2lE6K8EBNktiaw8elnSbxZGZ9mbzHQ?download=1"
wget -O "KOs_sketched_scaled_10_compare_5.labels.txt" "https://pennstateoffice365-my.sharepoint.com/:t:/g/personal/akp6031_psu_edu/EW39Mh7xqsdKsVAIxM5FWVEBnRMkR4bWs5oLvJlbR1y8fw?download=1"
python ./fununifrac/compute_edges.py -e edge_ko00001.txt -d KOs_sketched_scaled_10_compare_5 -o ../fununifrac_out -b ko00001 -n 50 -f 10 -r 100 --distance
- fununifrac_edge_lengths_{datetime}.csv: The edge lengths added to the original edge list
This is a sub-script called inside of compute_edges.py. In order to infer the edge lengths, we need to know which edges are traversed between all pairs of KOs. For example, if we have a binary tree with 2 leaves:
1
L1 / \ L2
2 3
and edge lengths L1
and L2
respectively, then the distance between the two leaves is L1 + L2
, resulting in a
matrix equation of the form:
[1, 1] [L1;L2] = [d(a,b)]
where d(a,b)
is the distance between the two leaves a
and b
derived in the previous section.
The coefficient matrix of all edges traversed between all pairs of KOs is obtained by running:
The same input is used to generate the intermediate matrix.
python ./fununifrac/create_edge_matrix.py -e edge_ko00001.txt -d KOs_sketched_scaled_10_compare_5 -o ../fununifrac_out/large -b ko00001
The resulting matrix will have a name such as: fununifrac_A.npz
and will be a
massive (233M rows, ~20K columns) sparse matrix called A
below.
- fununifrac_A.npz: The matrix describing the least sequare problem for computing edge lengths.
- fununifrac_A_basis.txt: The column information.
We will employ a regularized, randomized nonnegative least squares approach to infer the edge lengths.
In practice, we've observed that the matrix A
is rank deficient, yet over-determined (more rows than columns).
Thus there is no unique solution to Ax=y
where y
is the vector of pairwise distances between all KOs. We can,
however, find a unique "shortest length" solution by adding a regularization term to the objective function:
min ||Ax - y||_2^2 + lambda * ||x||_1^2
s.t. x>=0
where lambda
is a regularization parameter. This can be cast as a NNLS problem via:
min ||A'x - y'||_2^2
s.t. x>=0
where
A' = [A; sqrt(lambda)*ones(1, n)]
y' = [y; 0]
We can then solve this problem via a randomized approach, selecting enough rows of A'
to ensure that the rank is
large enough. After a bunch of iterates, the average is then taken.
The output will be an edge list with an additional column specifying the length of that edge. Eg. kegg_ko_edge_df_br:ko00001.txt
This is the core function of the functional unifrac package. It takes a directory containing functional profiles of the sample, in the format of sourmash gather files, as well as an edge list file (in this case, for instance, the output of Script1 in the preprocessing steps above) describing the underlying the KEGG tree. compute_fununifrac.py computes the pairwise functional UniFrac values among the sourmash gather files provided by the user.
- A directory containing functional profiles with the option to specify the file pattern to match to. The default
file pattern is
*_gather.csv
. - An edge list file describing the underlying KO tree
- A path specifying where the output directory
python ./fununifrac/compute_fununifrac.py -e {edge} -fd . -o {output} --diffab --force -b ko00001 -a median_abund --L2
- Pairwise functional UniFrac distances matrix saved in .npy format
- Basis of the above matrix in .npy format
We provide a minimum example to demonstrate the usage of this function below.
wget -O "edge_ko00001_lengths.txt" "https://pennstateoffice365-my.sharepoint.com/:t:/g/personal/akp6031_psu_edu/EVbB-ieWK7xDqux7Y7u_4lEBpMi-jCyA7oDkq3RhtrueXQ?download=1"
wget -O "f1077_ihmp_IBD_MSM5LLGF_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/EeSp1SOl-aJPsJOtcDylQc8B_cSWK_pxKiuPumW5CBJevQ?download=1"
wget -O "f2103_ihmp_IBD_HSM7J4Q3_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/EVlFAOKUEDhCslA5KBk8s2QBOcUdhrlkhDJXsFRap72cKA?download=1"
wget -O "f3158_ihmp_IBD_PSM6XBW1_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/EWnhFoHwindPp1q5JxmYbS4BQ9p2OArBGGxQwa0XCP9cSg?download=1"
python ./fununifrac/compute_fununifrac.py -e ./edge_ko00001_lengths.txt -d . -o fununifrac_results --diffab
- fununifrac_results/fununifrac_out_20230420-150954.main.basis.npy
- fununifrac_results/fununifrac_out_20230420-150954.main.npy
- fununifrac_results/ununifrac_out_20230420-150954.diffab.nodes.npy
- fununifrac_results/fununifrac_out_20230420-150954.diffab.npz
The underlying graph structure on which the earth mover's distance is based on is obtained from the KEGG database.
The KEGG Orthology (KO) hierarchy tree was donwloaded using
KEGG API. The KEGG BRITE Database
is a collection of hierarchical classification system in which ko00001 is the KO
hierarchy. Therefore, we utilized the get operation of KEGG API to
access the KO hierarchy. For each KO leaf node in this hierarchy tree, we get its
corresponding lineage. For instance, the lineage of KO K03382
is "09100 Metabolism; 09111 Xenobiotics biodegradation
and metabolism; 00791 Atrazine degradation; K03382 atzB; hydroxydechloroatrazine ethylaminohydrolase". With this
lineage information, we save the KO hierarchical tree into a file kegg_ko00001_edges.txt
with an edge list format.
Currently, to keep things simple, we are just using subtrees of the KEGG database.
As such, you need to pick a parent BRITE node to start with. Later, we will extract the tree
of all the children of this parent node. For example, you can choose the BRITE: ko00001 which
will correspond to "KEGG Orthology (KO)". In the following, call the resulting tree the "KO tree".
In the end, you should have a file like kegg_ko_edge_df_br:ko00001.txt
.
For a more detailed description of the process, see extraction repo. In particular: get_ko_hierarchy.py
wget -O "kegg_ko00001_edges.txt" "https://pennstateoffice365-my.sharepoint.com/:t:/g/personal/akp6031_psu_edu/EWWFfRZc4o5OltW7YZFH8yYBCQB8HW8IW_zLveO7eAeMPQ?download=1"
python ./fununifrac/reproducibility/generate_ko_hierarchy.py --outdir ${OUTDIR}
From the KEGG database, we downloaded amino acid sequences for all the genes in each KO and used Sourmash to build FracMinHash sketch for each KO. These sketches can be used as a reference to efficiently generate the functional profiles for metagenomic samples, which will be explained in the Section 3. Sample sketches below. To use the pre-generated KO sketches directly, simply download using the command below. Alternatively, the user can download the raw data and perform sourmash sketch using the parameters of their own choice. For usage of sourmash, refer to the Appendix section.
Below is command by which we generated the KO sketches above, where $INPUT is the raw KO data downloaded from KEGG.
sourmash sketch fromfile -p protein,k=5,k=7,k=11,abund,scaled=${scaleFactor} -o ${OUTDIR}/KOs_sketched_scaled_${scaleFactor}.sig.zip $INPUT
Note: this process will take forever. It is advised to just download the precomputed ones.
wget -O "KOs_sketched_scaled_10.sig.zip" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/ESayLPk49QhIhmPExvgNyrIBHhSyk4hzkdg5dPNHMGL5Mg?download=1"
With the KOs sketched, we can also compute the pairwise distances among the KOs, which correspond to the leaf nodes of the underlying tree obtained from KEGG. The pairwise distances among the leaf nodes of the KO tree is computed based on AAI (average amino acid identity) using Sourmash. Here we use k=5.
sourmash compare --protein --no-dna -o ${DATADIR}/compare_k_${kSize} --ani -k ${kSize} -p 50 ${DATADIR}/KOs_sketched_scaled_10.sig.zip
The output of the above is a pairwise distance matrix for all the KOs, along with a text file telling you which
KO corresponds to which row/column in the matrix. Eg. of such files are: KOs_sketched_scaled_10_compare_5
and KOs_sketched_scaled_10_compare_5.labels.txt
, which can be downloaded below.
wget -O "KOs_sketched_scaled_10_compare_5" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/ESsA10lqAXtOrv0Z_mVVedABTBdgHLAXzeE3N8bZ6q9f7Q?download=1"
wget -O "KOs_sketched_scaled_10_compare_5.labels.txt" "https://pennstateoffice365-my.sharepoint.com/:t:/g/personal/akp6031_psu_edu/Ea4iI9gCtl5Nue0bVABSqGQBIR1cv79l3fUk2bHSv6pxMA?download=1"
This procedure uses Sourmash sketch to convert the raw metagenomic samples in DNA sequences into protein sequences and creates a sketch for each sample. We provide some sample files to illustrate this process.
wget -O "f1077_ihmp_IBD_MSM5LLGF_P.fastq" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/EeFqDGl6lZZIgGFPme4GkBMBK20P_y0qrjqDWDuWnZ6ImA?download=1"
wget -O "f2103_ihmp_IBD_HSM7J4Q3_P.fastq" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/EaRRHDpkIK5Nid9pFf9BRAsB8Y-KT8lA4Zp1cdYv1eGJ1Q?download=1"
wget -O "f3158_ihmp_IBD_PSM6XBW1_P.fastq" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/ER7xvoZtBSRKlw7YoYJrfZ8BiLkxdqa-43suUolSUS86Ng?download=1"
sourmash sketch translate -f -p scaled=1000,k=5,abund f1077_ihmp_IBD_MSM5LLGF_P.fastq.gz -o f1077_ihmp_IBD_MSM5LLGF_P.fastq.gz.sig
sourmash sketch translate -f -p scaled=1000,k=5,abund f2103_ihmp_IBD_HSM7J4Q3_P.fastq.gz -o f2103_ihmp_IBD_HSM7J4Q3_P.fastq.gz.sig
sourmash sketch translate -f -p scaled=1000,k=5,abund f3158_ihmp_IBD_PSM6XBW1_P.fastq.gz -o f3158_ihmp_IBD_PSM6XBW1_P.fastq.gz.sig
You can also directly download the output using the following command.
# outputs
wget -O "f1077_ihmp_IBD_MSM5LLGF_P.fastq.gz.sig" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/Ebzry6ZzMrpAocLia1gb8zgBzSThqyvc7YZN_lcJ8Fkofg?download=1"
wget -O "f2103_ihmp_IBD_HSM7J4Q3_P.fastq.sig" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/EdSy4EMYhJVOmorMSHyOBCMBu5HshZMur5BKG49D7DgSAA?download=1"
wget -O "f3158_ihmp_IBD_PSM6XBW1_P.fastq.sig" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/EQsLhGlmu09KuY6QDm6ymLMBi9PgIRADjL0ZLQCHSFEmwg?download=1"
Using the output of the previous process, we then produce the functional profile for each sample which will then be used as the input for functional UniFrac computation. The input data for this process include a sample sketch, which is the output of step 3, as well as the sketched KO, the output of step 1.
sourmash gather --protein -k 5 --estimate-ani-ci --threshold-bp 500 f3158_ihmp_IBD_PSM6XBW1_P.fastq.gz.sig KOs_sketched_scaled_10.sig.zip -o f3158_ihmp_IBD_PSM6XBW1_P_k_5_gather.csv
sourmash gather --protein -k 5 --estimate-ani-ci --threshold-bp 500 f2103_ihmp_IBD_HSM7J4Q3_P.fastq.sig KOs_sketched_scaled_10.sig.zip -o f2103_ihmp_IBD_HSM7J4Q3_P_k_5_gather.csv
sourmash gather --protein -k 5 --estimate-ani-ci --threshold-bp 500 f1077_ihmp_IBD_MSM5LLGF_P.fastq.gz.sig KOs_sketched_scaled_10.sig.zip -o f1077_ihmp_IBD_MSM5LLGF_P_k_5_gather.csv
The output of the above will be three sourmash gather files which you can also download directly as follows.
wget -O "f1077_ihmp_IBD_MSM5LLGF_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/EYe7vV9JDBhIgY0v0otSz8IBzJ9sp-LXFZQkM2qV7n5d0Q?download=1"
wget -O "f2103_ihmp_IBD_HSM7J4Q3_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/Ee-QM-tEoQhPsM-2sKOv3y4BdpHZWIJW_PcItrx7JPTQyQ?download=1"
wget -O "f3158_ihmp_IBD_PSM6XBW1_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/EVN4JF7jkVhGmdHoCKoSTY4Byz3ybRFAsVtBDSv2VM2wEQ?download=1"
Installation directions and usage of sourmash can be found at https://sourmash.readthedocs.io/en/latest/.