Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein function prediction with GO #36

Open
sfluegel05 opened this issue Jun 27, 2024 · 18 comments · Fixed by #39 or #57 · May be fixed by #64
Open

Protein function prediction with GO #36

sfluegel05 opened this issue Jun 27, 2024 · 18 comments · Fixed by #39 or #57 · May be fixed by #64
Assignees

Comments

@sfluegel05
Copy link
Collaborator

sfluegel05 commented Jun 27, 2024

Until now, we have only used our framework for ChEBI, but in principle, it should also be applicable to other data sets and prediction tasks. One such task is the prediction of protein functions as specified by the Gene Ontology in combination with protein data from UniProtKB. As an orientation, we can use the DeepGO paper which proposes a solution for this exact task. The goal is to apply our model to the GO / UniProtKB datasets and compare the results to those of DeepGO.

Tasks

  1. Minimal dataset implementation: Build a dataset class that extracts proteins and labels from UniProtKB / GO and processes them into a dataset that can be used to train Electra (one-hot encoding of trigrams of amino acids)
  2. Model training and evaluation: Evaluate using the same metrics as DeepGO for comparing the models
  3. Additonal features and finetuning: Pretraining with additional unlabeled protein data, trained input embeddings, hyperparameters
@aditya0by0
Copy link
Collaborator

Protein Preprocessing Statistics

These are the statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids or sequence lengths greater than 1002, as per the guidelines outlined in the paper:

  • Number of proteins with non-valid amino acids: 2,672 (0.47% of the dataset)
  • Number of proteins with sequence length greater than 1002: 19,004 (3.32% of the dataset)
  • Number of proteins with both non-valid amino acids and length greater than 1002: 154
  • Total number of ignored proteins (either condition): 21,522 (3.76% of the dataset)
  • Original dataset size: 571,864 proteins

The number of ignored proteins is very insignificant in size compared to the whole dataset.

I have attached the CSV file which lists the IDs of the ignored proteins for reference.
proteins_with_issues.csv

@sfluegel05
Copy link
Collaborator Author

Shorten input sequence lengths:

  • parameter for maximum length (default: 1002)
  • trigrams / n-grams combining several amino acids into one token

@aditya0by0
Copy link
Collaborator

aditya0by0 commented Oct 1, 2024

@aditya0by0 aditya0by0 linked a pull request Oct 1, 2024 that will close this issue
@sfluegel05
Copy link
Collaborator Author

From #39 (comment):

If 1002 is set as the maximum input sequence length, the updated behavior will truncate any protein sequences longer than 1002 amino acids, selecting only the first 1002. This may result in a partial representation of the protein, as the entire sequence may not be captured.
In contrast, the approach described in the DeepGO paper excludes any protein sequences that exceed the specified length threshold, skipping them entirely rather than truncating.

Excluding long sequences instead of truncating them seems to be the better option. We should implement that

@aditya0by0
Copy link
Collaborator

Excluding long sequences instead of truncating them seems to be the better option. We should implement that

I performed a quick analysis by truncating all the protein sequences to different maximum lengths and then checked for duplicates based on the truncated sequences. The results show that as we reduce the maximum sequence length, the number of duplicates increases significantly.

Here are the results for different truncation lengths:

import pandas as pd
df = pd.DataFrame(pd.read_pickle("data/GO_UniProt/GO250_BP/processed/data.pkl"))
df['first_1002'] = df['sequence'].str[:1002]
df['first_700'] = df['sequence'].str[:700]
df['first_500'] = df['sequence'].str[:500]
df['first_200'] = df['sequence'].str[:200]

# Checking for duplicates in the original sequences
df.groupby('sequence').filter(lambda x: len(x) > 1).shape
# (476, 1053)

# Checking for duplicates with truncated sequences
df.groupby('first_1002').filter(lambda x: len(x) > 1).shape
# (480, 1053)

df.groupby('first_700').filter(lambda x: len(x) > 1).shape
# (503, 1053)

df.groupby('first_500').filter(lambda x: len(x) > 1).shape
# (545, 1053)

df.groupby('first_200').filter(lambda x: len(x) > 1).shape
# (1011, 1053)

@sfluegel05
Copy link
Collaborator Author

sfluegel05 commented Oct 17, 2024

I still get models that don't train at all, even if reducing the input data length. Two things we should do that (possibly) help with the GO-task:

  • find GO classes that should be easy to predict based on the protein structure. In ChEBI, equivalent classes would be chemical entity (everything) or organic molecular entity (everything with carbon). Training only on these classes should be easier
  • add pretraining to the pipeline: For chemicals, we use Electra pretraining on PubChem data before applying the model to ChEBI. Similarly, we could use protein data for pretraining here
    • find a data source - does the SwissProt dataset have some proteins that are not annotated with GO labels?

@sfluegel05
Copy link
Collaborator Author

I finally found the problem: The labels were incomplete. The dataset only included the direct labels (as assigned by SwissProt), but ignored the transitive labels (all the superclasses of the direct labels). However, when deciding whether to include a class (based on having at least N samples), the transitive labels were used. This resulted in

  1. A majority of labels missing for many classes
  2. Classes being included that have less than N (and often 0) samples

That explains why the model didn't learn anything. I fixed this in 6511086 and now it is learning better. (Also, starting with an easier task helps - the CC subset gets to a micro-F1 of 40% even with missing labels)

@aditya0by0
Copy link
Collaborator

aditya0by0 commented Oct 23, 2024

Protein Preprocessing Statistics

These are the statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids or sequence lengths greater than 1002, as per the guidelines outlined in the paper:

  • Number of proteins with non-valid amino acids: 2,672 (0.47% of the dataset)
  • Number of proteins with sequence length greater than 1002: 19,004 (3.32% of the dataset)
  • Number of proteins with both non-valid amino acids and length greater than 1002: 154
  • Total number of ignored proteins (either condition): 21,522 (3.76% of the dataset)
  • Original dataset size: 571,864 proteins

The number of ignored proteins is very insignificant in size compared to the whole dataset.

I have attached the CSV file which lists the IDs of the ignored proteins for reference. proteins_with_issues.csv

Updated Protein Preprocessing Statistics

These are the updated statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids, sequence lengths greater than 1002, or no valid associated GO IDs:

Note: A valid GO ID is one that has one of the following evidence codes, as per the paper: "EXP", "IDA", "IPI", "IMP", "IGI", "IEP", "TAS", "IC".

Ignored Proteins Stats:

  • Original raw dataset size: 571,864 proteins

  • Number of proteins with non-valid amino acids: 2,672 (0.47% of the dataset)

  • Number of proteins with sequence length greater than 1002: 19,004 (3.32% of the dataset)

  • Number of proteins which are NOT associated with any valid GO IDs: 496,157 (86.76% of the dataset)

    • Number of proteins with at least one valid associated GO ID: 75,707 (13.24% of the dataset)
  • Total number of ignored proteins (either non-valid amino acids or sequence length greater than 1002): 21,522 (3.76% of the dataset)

  • Total number of proteins ignored (non-valid amino acids or length greater than 1002 or no valid GO IDs associated): 503,935 (88.12% of the dataset)

  • Total number of proteins included in the Swiss final dataset: 67,929 (13.24% of the dataset)

    The data.pkl file contains 67,884 instances. The difference of 45 instances might be due to valid GO IDs associated with proteins in the Swiss dataset that are not present in the GO dataset. This discrepancy likely results in all labels being set to False, leading to the removal of those protein instances.

Percentages (with respect to number of proteins associated with valid GO IDs):

  • Number of proteins with at least one valid associated GO ID: 75,707
  • Percentage of proteins with non-valid amino acids: 0.27%
  • Percentage of proteins with sequence length greater than 1002: 10.05%
  • Percentage of proteins with sequence length greater than 1002 or non-valid amino acids: 10.27%

Invalid/No Association

  • Number of proteins which are associated with only non-valid GO IDs: 475,420 (83.14% of the dataset)
  • Number of proteins which are NOT associated with any GO ID: 20,737 (3.63% of the dataset)

protein_stats_script.txt

@aditya0by0
Copy link
Collaborator

aditya0by0 commented Oct 23, 2024

add pretraining to the pipeline: For chemicals, we use Electra pretraining on PubChem data before applying the model to ChEBI. Similarly, we could use protein data for pretraining here

  • find a data source - does the SwissProt dataset have some proteins that are not annotated with GO labels?

As per the statistics, we have 20,737 proteins that are not associated or annotated with any GO labels. These proteins can be used for pretraining in our model pipeline.

List of proteins ids (swiss_ids) that are not annotated with any GO label: no_go_id_proteins.csv

@sfluegel05
Copy link
Collaborator Author

That is good news. I would also use the proteins with non-valid experimental codes as well. Since we are not using them in finetuning, we might as well use them for pretraining. That would give us ~500,000 proteins, more than enough for pretraining.

@aditya0by0
Copy link
Collaborator

aditya0by0 commented Oct 25, 2024

Pretraining Protein Dataset Statistics

  • Total number of proteins with no association to any valid GO class: 496,157
  • Proteins with no association to any valid GO class and no invalid amino acids: 493,688
  • Proteins with no association to any valid GO class and length ≤ 1002: 484,760
  • Proteins with no association to any valid GO class, no invalid amino acids, and length ≤ 1002: 482,413

Current Filtering Approach for Pretraining

Currently, the pretraining setup filters proteins based on:

  • No association with any valid GO class
  • No invalid amino acids

This includes 493,688 proteins.

Question for Additional Filtering

Should we add an additional filter to include only proteins with a sequence length of ≤ 1002 (length being a hyperparameter with default 1002)? Similar to the approach we use in training.
This would reduce the dataset to 482,413 proteins for 1002 length.

Also, Please review the code for Pretraining,

@sfluegel05
Copy link
Collaborator Author

The next steps here are:

  • Pretraining: add filter for sequence length as hyperparameter
  • merge the feature branch into the dev branch
  • on a new branch: metrics for evaluation (I talked to Martin about the Fmax score: Although it has some methodological issues, we should include it in our evaluation to do a comparison with DeepGO)
  • DeepGO-SE (paper): use these results as a baseline, integrate their data into our pipeline (there is a link to the dataset on their github page

@aditya0by0
Copy link
Collaborator

aditya0by0 commented Nov 5, 2024

  • DeepGO-SE (paper): We should consider using these results as a baseline, incorporating their dataset into our pipeline (dataset link available on their GitHub page).

While working on this, I observed some key differences between the

original DeepGO paper by
Maxat Kulmanov, Mohammed Asif Khan, Robert Hoehndorf, DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier, Bioinformatics, Volume 34, Issue 4, February 2018, Pages 660–668 (https://doi.org/10.1093/bioinformatics/btx624),

and the latest DeepGO-SE paper:
Kulmanov, M., Guzmán-Vega, F.J., Duek Roggli, P. et al. Protein function prediction as approximate semantic entailment. Nat Mach Intell 6, 220–228 (2024). https://doi.org/10.1038/s42256-024-00795-w

  1. Experimental Codes:

    • DeepGO (2018):

      EXP_CODES = set(['EXP', 'IDA', 'IPI', 'IMP', 'IGI', 'IEP', 'TAS', 'IC'])

      Referenced in their GitHub repository.

    • DeepGO-SE (2024):

      EXP_CODES = set([
          'EXP', 'IDA', 'IPI', 'IMP', 'IGI', 'IEP', 'TAS', 'IC',
          'HTP', 'HDA', 'HMP', 'HGI', 'HEP'
      ])

      Updated code in the DeepGO2 repository.

  2. Amino Acid Considerations:

    • DeepGO (2018): Valid amino acids:

      AALETTER = [
          'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
          'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'
      ]

      Reference

      Invalid amino acids:

      INVALID_ACIDS = set(['U', 'O', 'B', 'Z', 'J', 'X', '*'])

      Reference

    • DeepGO-SE (2024): Valid amino acids, with X added:

      AALETTER = [
          'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
          'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', 'X'
      ]

      Reference

      Invalid amino acids:

      INVALID_ACIDS = set(['U', 'O', 'B', 'Z', 'J', '*'])

      Reference

Given these changes, should we update the experimental codes and amino acid definitions in our implementation to align with the latest paper?

@aditya0by0
Copy link
Collaborator

Also, While reviewing the latest (DeepGO-SE) paper, I noticed the authors have implemented a specific approach to dataset splitting. Here’s the relevant excerpt from the paper:

“We mainly aim to predict functions of novel proteins that have a low sequence similarity to existing proteins in the dataset. Therefore, we decided to split our dataset based on any similarity hit with a maximum e-value score of 0.001. We computed pairwise similarity using Diamond (v.2.0.9) and grouped the sequences that have some similarity and split these groups into training, validation and testing sets. Supplementary Table D3 summarizes the datasets for each subontology. We train and evaluate a separate model for each subontology.”

@sfluegel05
Copy link
Collaborator Author

For the experimental codes and valid amino acids I would say yes, we should update that. I don't have the expert knowledge to assess if the H- ("high throughput") experimental codes are less or more reliable thant the others, but if that is what DeepGO-SE uses, it should be fine.

And for the splits: The method sounds interesting and might influence results (testing on random data might give better results than testing on data that is "different" to the training data). Therefore, we should use the same splits they used when training with their data. In the best case, the splits are included in the dataset they provide.

@aditya0by0
Copy link
Collaborator

Here is the link to access data from all relevant DeepGO papers:

DeepGO Data

@aditya0by0
Copy link
Collaborator

aditya0by0 commented Nov 19, 2024

There’s a significant implementation difference between the two models regarding how they incorporate Protein-Protein Interaction (PPI) networks:

  1. DeepGO (2018):
  • PPI is used an explicit component of the model, and embeddings learned from the PPI graph are combined with sequence-based features to improve predictions.

Quote from paper:

We combined the knowledge graph embeddings for the nodes with the output of the max-pooling layer of length 832 as a combined feature vector.

  1. DeepGO-SE:
  • Interactions between proteins guide the construction of training datasets or provide additional label information.
    These interactions help improve the mapping between protein sequence embeddings and the GO terms by implicitly encoding co-functionality.

Quote from paper:

To combine PPIs with individual features of proteins, we use graph attention networks (GAT)58 and embed the protein ( p ) in the ELEmbeddings space using the formula:

$$ f_\eta(p) = \text{GATConv}(\text{MLPBlock}(x), g) $$

Where:

  • ( x ) is an input feature vector for ( p ),
  • ( g ) is the PPI graph,
  • MLPBlock is described in Equation (6),
  • GATConv is a GAT layer.

https://github.com/bio-ontology-research-group/deepgo2/blob/main/train_gat.py#L119

Question:
Please let me know, which approach do we intend to use in our implementation or what in way we want to use PPI in our model.?

@aditya0by0
Copy link
Collaborator

DeepGO2 Invalid Amino acid handling using "X" amino acid notation: #64 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants