Protein function prediction with GO #36

sfluegel05 · 2024-06-27T08:47:01Z

Until now, we have only used our framework for ChEBI, but in principle, it should also be applicable to other data sets and prediction tasks. One such task is the prediction of protein functions as specified by the Gene Ontology in combination with protein data from UniProtKB. As an orientation, we can use the DeepGO paper which proposes a solution for this exact task. The goal is to apply our model to the GO / UniProtKB datasets and compare the results to those of DeepGO.

Tasks

Minimal dataset implementation: Build a dataset class that extracts proteins and labels from UniProtKB / GO and processes them into a dataset that can be used to train Electra (one-hot encoding of trigrams of amino acids)
Model training and evaluation: Evaluate using the same metrics as DeepGO for comparing the models
Additonal features and finetuning: Pretraining with additional unlabeled protein data, trained input embeddings, hyperparameters

aditya0by0 · 2024-08-28T13:52:04Z

Protein Preprocessing Statistics

These are the statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids or sequence lengths greater than 1002, as per the guidelines outlined in the paper:

Number of proteins with non-valid amino acids: 2,672 (0.47% of the dataset)
Number of proteins with sequence length greater than 1002: 19,004 (3.32% of the dataset)
Number of proteins with both non-valid amino acids and length greater than 1002: 154
Total number of ignored proteins (either condition): 21,522 (3.76% of the dataset)
Original dataset size: 571,864 proteins

The number of ignored proteins is very insignificant in size compared to the whole dataset.

I have attached the CSV file which lists the IDs of the ignored proteins for reference.
proteins_with_issues.csv

sfluegel05 · 2024-09-19T08:12:14Z

Shorten input sequence lengths:

parameter for maximum length (default: 1002)
trigrams / n-grams combining several amino acids into one token

aditya0by0 · 2024-10-01T09:09:55Z

Note: The issue will be implemented in 3 PRs:

sfluegel05 · 2024-10-01T12:49:06Z

From #39 (comment):

If 1002 is set as the maximum input sequence length, the updated behavior will truncate any protein sequences longer than 1002 amino acids, selecting only the first 1002. This may result in a partial representation of the protein, as the entire sequence may not be captured.
In contrast, the approach described in the DeepGO paper excludes any protein sequences that exceed the specified length threshold, skipping them entirely rather than truncating.

Excluding long sequences instead of truncating them seems to be the better option. We should implement that

aditya0by0 · 2024-10-01T21:03:29Z

Excluding long sequences instead of truncating them seems to be the better option. We should implement that

I performed a quick analysis by truncating all the protein sequences to different maximum lengths and then checked for duplicates based on the truncated sequences. The results show that as we reduce the maximum sequence length, the number of duplicates increases significantly.

Here are the results for different truncation lengths:

import pandas as pd
df = pd.DataFrame(pd.read_pickle("data/GO_UniProt/GO250_BP/processed/data.pkl"))
df['first_1002'] = df['sequence'].str[:1002]
df['first_700'] = df['sequence'].str[:700]
df['first_500'] = df['sequence'].str[:500]
df['first_200'] = df['sequence'].str[:200]

# Checking for duplicates in the original sequences
df.groupby('sequence').filter(lambda x: len(x) > 1).shape
# (476, 1053)

# Checking for duplicates with truncated sequences
df.groupby('first_1002').filter(lambda x: len(x) > 1).shape
# (480, 1053)

df.groupby('first_700').filter(lambda x: len(x) > 1).shape
# (503, 1053)

df.groupby('first_500').filter(lambda x: len(x) > 1).shape
# (545, 1053)

df.groupby('first_200').filter(lambda x: len(x) > 1).shape
# (1011, 1053)

sfluegel05 · 2024-10-17T12:55:03Z

I still get models that don't train at all, even if reducing the input data length. Two things we should do that (possibly) help with the GO-task:

find GO classes that should be easy to predict based on the protein structure. In ChEBI, equivalent classes would be chemical entity (everything) or organic molecular entity (everything with carbon). Training only on these classes should be easier
add pretraining to the pipeline: For chemicals, we use Electra pretraining on PubChem data before applying the model to ChEBI. Similarly, we could use protein data for pretraining here
- find a data source - does the SwissProt dataset have some proteins that are not annotated with GO labels?

sfluegel05 · 2024-10-17T15:16:24Z

I finally found the problem: The labels were incomplete. The dataset only included the direct labels (as assigned by SwissProt), but ignored the transitive labels (all the superclasses of the direct labels). However, when deciding whether to include a class (based on having at least N samples), the transitive labels were used. This resulted in

A majority of labels missing for many classes
Classes being included that have less than N (and often 0) samples

That explains why the model didn't learn anything. I fixed this in 6511086 and now it is learning better. (Also, starting with an easier task helps - the CC subset gets to a micro-F1 of 40% even with missing labels)

aditya0by0 · 2024-10-23T09:26:04Z

Protein Preprocessing Statistics

These are the statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids or sequence lengths greater than 1002, as per the guidelines outlined in the paper:

Number of proteins with non-valid amino acids: 2,672 (0.47% of the dataset)

Number of proteins with sequence length greater than 1002: 19,004 (3.32% of the dataset)

Number of proteins with both non-valid amino acids and length greater than 1002: 154

Total number of ignored proteins (either condition): 21,522 (3.76% of the dataset)

Original dataset size: 571,864 proteins

The number of ignored proteins is very insignificant in size compared to the whole dataset.

I have attached the CSV file which lists the IDs of the ignored proteins for reference. proteins_with_issues.csv

Updated Protein Preprocessing Statistics

These are the updated statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids, sequence lengths greater than 1002, or no valid associated GO IDs:

Note: A valid GO ID is one that has one of the following evidence codes, as per the paper: "EXP", "IDA", "IPI", "IMP", "IGI", "IEP", "TAS", "IC".

Ignored Proteins Stats:

Original raw dataset size: 571,864 proteins
Number of proteins with non-valid amino acids: 2,672 (0.47% of the dataset)
Number of proteins with sequence length greater than 1002: 19,004 (3.32% of the dataset)
Number of proteins which are NOT associated with any valid GO IDs: 496,157 (86.76% of the dataset)
- Number of proteins with at least one valid associated GO ID: 75,707 (13.24% of the dataset)
Total number of ignored proteins (either non-valid amino acids or sequence length greater than 1002): 21,522 (3.76% of the dataset)
Total number of proteins ignored (non-valid amino acids or length greater than 1002 or no valid GO IDs associated): 503,935 (88.12% of the dataset)
Total number of proteins included in the Swiss final dataset: 67,929 (13.24% of the dataset)

The data.pkl file contains 67,884 instances. The difference of 45 instances might be due to valid GO IDs associated with proteins in the Swiss dataset that are not present in the GO dataset. This discrepancy likely results in all labels being set to False, leading to the removal of those protein instances.

Percentages (with respect to number of proteins associated with valid GO IDs):

Number of proteins with at least one valid associated GO ID: 75,707
Percentage of proteins with non-valid amino acids: 0.27%
Percentage of proteins with sequence length greater than 1002: 10.05%
Percentage of proteins with sequence length greater than 1002 or non-valid amino acids: 10.27%

Invalid/No Association

Number of proteins which are associated with only non-valid GO IDs: 475,420 (83.14% of the dataset)
Number of proteins which are NOT associated with any GO ID: 20,737 (3.63% of the dataset)

protein_stats_script.txt

aditya0by0 · 2024-10-23T09:34:32Z

add pretraining to the pipeline: For chemicals, we use Electra pretraining on PubChem data before applying the model to ChEBI. Similarly, we could use protein data for pretraining here

find a data source - does the SwissProt dataset have some proteins that are not annotated with GO labels?

As per the statistics, we have 20,737 proteins that are not associated or annotated with any GO labels. These proteins can be used for pretraining in our model pipeline.

List of proteins ids (swiss_ids) that are not annotated with any GO label: no_go_id_proteins.csv

sfluegel05 · 2024-10-24T12:52:26Z

That is good news. I would also use the proteins with non-valid experimental codes as well. Since we are not using them in finetuning, we might as well use them for pretraining. That would give us ~500,000 proteins, more than enough for pretraining.

aditya0by0 · 2024-10-25T21:39:10Z

Pretraining Protein Dataset Statistics

Total number of proteins with no association to any valid GO class: 496,157
Proteins with no association to any valid GO class and no invalid amino acids: 493,688
Proteins with no association to any valid GO class and length ≤ 1002: 484,760
Proteins with no association to any valid GO class, no invalid amino acids, and length ≤ 1002: 482,413

Current Filtering Approach for Pretraining

Currently, the pretraining setup filters proteins based on:

No association with any valid GO class
No invalid amino acids

This includes 493,688 proteins.

Question for Additional Filtering

Should we add an additional filter to include only proteins with a sequence length of ≤ 1002 (length being a hyperparameter with default 1002)? Similar to the approach we use in training.
This would reduce the dataset to 482,413 proteins for 1002 length.

Also, Please review the code for Pretraining,

sfluegel05 · 2024-10-30T16:19:19Z

The next steps here are:

Pretraining: add filter for sequence length as hyperparameter
merge the feature branch into the dev branch
on a new branch: metrics for evaluation (I talked to Martin about the Fmax score: Although it has some methodological issues, we should include it in our evaluation to do a comparison with DeepGO)
DeepGO-SE (paper): use these results as a baseline, integrate their data into our pipeline (there is a link to the dataset on their github page

aditya0by0 · 2024-11-05T21:07:54Z

DeepGO-SE (paper): We should consider using these results as a baseline, incorporating their dataset into our pipeline (dataset link available on their GitHub page).

While working on this, I observed some key differences between the

original DeepGO paper by
Maxat Kulmanov, Mohammed Asif Khan, Robert Hoehndorf, DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier, Bioinformatics, Volume 34, Issue 4, February 2018, Pages 660–668 (https://doi.org/10.1093/bioinformatics/btx624),

and the latest DeepGO-SE paper:
Kulmanov, M., Guzmán-Vega, F.J., Duek Roggli, P. et al. Protein function prediction as approximate semantic entailment. Nat Mach Intell 6, 220–228 (2024). https://doi.org/10.1038/s42256-024-00795-w

Experimental Codes:

DeepGO (2018):

EXP_CODES = set(['EXP', 'IDA', 'IPI', 'IMP', 'IGI', 'IEP', 'TAS', 'IC'])

Referenced in their GitHub repository.

DeepGO-SE (2024):

EXP_CODES = set([
    'EXP', 'IDA', 'IPI', 'IMP', 'IGI', 'IEP', 'TAS', 'IC',
    'HTP', 'HDA', 'HMP', 'HGI', 'HEP'
])

Updated code in the DeepGO2 repository.

Amino Acid Considerations:

DeepGO (2018): Valid amino acids:

AALETTER = [
    'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
    'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'
]

Reference

Invalid amino acids:

INVALID_ACIDS = set(['U', 'O', 'B', 'Z', 'J', 'X', '*'])

Reference

DeepGO-SE (2024): Valid amino acids, with X added:

AALETTER = [
    'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
    'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', 'X'
]

Reference

Invalid amino acids:

INVALID_ACIDS = set(['U', 'O', 'B', 'Z', 'J', '*'])

Reference

Given these changes, should we update the experimental codes and amino acid definitions in our implementation to align with the latest paper?

aditya0by0 · 2024-11-05T21:45:25Z

Also, While reviewing the latest (DeepGO-SE) paper, I noticed the authors have implemented a specific approach to dataset splitting. Here’s the relevant excerpt from the paper:

“We mainly aim to predict functions of novel proteins that have a low sequence similarity to existing proteins in the dataset. Therefore, we decided to split our dataset based on any similarity hit with a maximum e-value score of 0.001. We computed pairwise similarity using Diamond (v.2.0.9) and grouped the sequences that have some similarity and split these groups into training, validation and testing sets. Supplementary Table D3 summarizes the datasets for each subontology. We train and evaluate a separate model for each subontology.”

sfluegel05 · 2024-11-06T13:27:56Z

For the experimental codes and valid amino acids I would say yes, we should update that. I don't have the expert knowledge to assess if the H- ("high throughput") experimental codes are less or more reliable thant the others, but if that is what DeepGO-SE uses, it should be fine.

And for the splits: The method sounds interesting and might influence results (testing on random data might give better results than testing on data that is "different" to the training data). Therefore, we should use the same splits they used when training with their data. In the best case, the splits are included in the dataset they provide.

- #36 (comment)

aditya0by0 · 2024-11-11T17:18:52Z

Here is the link to access data from all relevant DeepGO papers:

DeepGO Data

aditya0by0 · 2024-11-19T23:25:27Z

There’s a significant implementation difference between the two models regarding how they incorporate Protein-Protein Interaction (PPI) networks:

DeepGO (2018):

PPI is used an explicit component of the model, and embeddings learned from the PPI graph are combined with sequence-based features to improve predictions.

Quote from paper:

We combined the knowledge graph embeddings for the nodes with the output of the max-pooling layer of length 832 as a combined feature vector.

DeepGO-SE:

Interactions between proteins guide the construction of training datasets or provide additional label information.
These interactions help improve the mapping between protein sequence embeddings and the GO terms by implicitly encoding co-functionality.

Quote from paper:

To combine PPIs with individual features of proteins, we use graph attention networks (GAT)⁵⁸ and embed the protein ( p ) in the ELEmbeddings space using the formula:

$$ f_\eta(p) = \text{GATConv}(\text{MLPBlock}(x), g) $$

Where:

( x ) is an input feature vector for ( p ),

( g ) is the PPI graph,

MLPBlock is described in Equation (6),

GATConv is a GAT layer.

https://github.com/bio-ontology-research-group/deepgo2/blob/main/train_gat.py#L119

Question:
Please let me know, which approach do we intend to use in our implementation or what in way we want to use PPI in our model.?

aditya0by0 · 2024-12-04T13:33:25Z

DeepGO2 Invalid Amino acid handling using "X" amino acid notation: #64 (comment)

sfluegel05 assigned aditya0by0 Jun 27, 2024

aditya0by0 mentioned this issue Jul 28, 2024

Protein function prediction with GO #39

Merged

aditya0by0 linked a pull request Aug 18, 2024 that will close this issue

Protein function prediction with GO #39

Merged

aditya0by0 removed a link to a pull request Sep 20, 2024

Protein function prediction with GO #39

Merged

aditya0by0 mentioned this issue Oct 1, 2024

Protein function prediction with GO - Part 2 #57

Merged

aditya0by0 linked a pull request Oct 1, 2024 that will close this issue

Protein function prediction with GO - Part 2 #57

Merged

aditya0by0 removed a link to a pull request Oct 1, 2024

Protein function prediction with GO - Part 2 #57

Merged

This was linked to pull requests Oct 1, 2024

Protein function prediction with GO #39

Merged

Protein function prediction with GO - Part 2 #57

Merged

aditya0by0 removed a link to a pull request Oct 30, 2024

Protein function prediction with GO - Part 2 #57

Merged

aditya0by0 linked a pull request Nov 4, 2024 that will close this issue

Protein function prediction with GO - Part 3 #64

Draft

This was linked to pull requests Nov 4, 2024

Protein function prediction with GO - Part 2 #57

Merged

Protein function prediction with GO - Part 3 #64

Draft

aditya0by0 added a commit that referenced this issue Nov 6, 2024

update experiment evidence codes as per DeepGo SE

227a014

- #36 (comment)

aditya0by0 added a commit that referenced this issue Nov 6, 2024

consIder X as a valid amino acid as per DeepGO-SE

c6d60cd

- #36 (comment)

aditya0by0 added a commit that referenced this issue Nov 7, 2024

protein trigram containing tokenS with X

085b13b

- #36 (comment)

aditya0by0 added a commit that referenced this issue Nov 7, 2024

protein token unigram contain X

3e0bae0

- #36 (comment)

sfluegel05 mentioned this issue Dec 12, 2024

Add SCOPe dataset to our pipeline #67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protein function prediction with GO #36

Protein function prediction with GO #36

sfluegel05 commented Jun 27, 2024 •

edited

Loading

aditya0by0 commented Aug 28, 2024

sfluegel05 commented Sep 19, 2024

aditya0by0 commented Oct 1, 2024 •

edited

Loading

sfluegel05 commented Oct 1, 2024

aditya0by0 commented Oct 1, 2024

sfluegel05 commented Oct 17, 2024 •

edited

Loading

sfluegel05 commented Oct 17, 2024

aditya0by0 commented Oct 23, 2024 •

edited

Loading

Protein Preprocessing Statistics

aditya0by0 commented Oct 23, 2024 •

edited

Loading

sfluegel05 commented Oct 24, 2024

aditya0by0 commented Oct 25, 2024 •

edited

Loading

sfluegel05 commented Oct 30, 2024

aditya0by0 commented Nov 5, 2024 •

edited

Loading

aditya0by0 commented Nov 5, 2024

sfluegel05 commented Nov 6, 2024

aditya0by0 commented Nov 11, 2024

aditya0by0 commented Nov 19, 2024 •

edited

Loading

aditya0by0 commented Dec 4, 2024

Protein function prediction with GO #36

Protein function prediction with GO #36

Comments

sfluegel05 commented Jun 27, 2024 • edited Loading

Tasks

aditya0by0 commented Aug 28, 2024

Protein Preprocessing Statistics

sfluegel05 commented Sep 19, 2024

aditya0by0 commented Oct 1, 2024 • edited Loading

sfluegel05 commented Oct 1, 2024

aditya0by0 commented Oct 1, 2024

sfluegel05 commented Oct 17, 2024 • edited Loading

sfluegel05 commented Oct 17, 2024

aditya0by0 commented Oct 23, 2024 • edited Loading

Protein Preprocessing Statistics

Updated Protein Preprocessing Statistics

Ignored Proteins Stats:

Percentages (with respect to number of proteins associated with valid GO IDs):

Invalid/No Association

aditya0by0 commented Oct 23, 2024 • edited Loading

sfluegel05 commented Oct 24, 2024

aditya0by0 commented Oct 25, 2024 • edited Loading

Pretraining Protein Dataset Statistics

Current Filtering Approach for Pretraining

Question for Additional Filtering

sfluegel05 commented Oct 30, 2024

aditya0by0 commented Nov 5, 2024 • edited Loading

aditya0by0 commented Nov 5, 2024

sfluegel05 commented Nov 6, 2024

aditya0by0 commented Nov 11, 2024

aditya0by0 commented Nov 19, 2024 • edited Loading

aditya0by0 commented Dec 4, 2024

sfluegel05 commented Jun 27, 2024 •

edited

Loading

aditya0by0 commented Oct 1, 2024 •

edited

Loading

sfluegel05 commented Oct 17, 2024 •

edited

Loading

aditya0by0 commented Oct 23, 2024 •

edited

Loading

aditya0by0 commented Oct 23, 2024 •

edited

Loading

aditya0by0 commented Oct 25, 2024 •

edited

Loading

aditya0by0 commented Nov 5, 2024 •

edited

Loading

aditya0by0 commented Nov 19, 2024 •

edited

Loading