Protein function prediction with GO - Part 3 #64

aditya0by0 · 2024-11-04T12:07:11Z

PR for the Issue Protein function prediction with GO #36

Note: The above issue will be implemented in 3 PRs:

Changes to be done in this PR

evaluation: Evaluate using the same metrics as DeepGO for comparing the models

on a new branch: metrics for evaluation (I talked to Martin about the Fmax score: Although it has some methodological issues, we should include it in our evaluation to do a comparison with DeepGO)

DeepGO-SE (paper): use these results as a baseline, integrate their data into our pipeline (there is a link to the dataset on their github page

- migration from deep go format to chebai->go_uniprot format

- #36 (comment)

- +migration structure changes

aditya0by0 · 2024-11-13T22:45:41Z

I have made the suggested changes for migration. Please check.

Config for DeepGO1:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO1MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1002
  reader_kwargs: {n_gram: 3}

Config for DeepGO2:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO2MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1000
  reader_kwargs: {n_gram: 3}

sfluegel05 · 2024-12-03T15:41:39Z

I ran the migration script for DeepGO2 and tried training a model with the data. I noticed three issues:

The max_sequence_length parameter seems to have no effect (I set it to 1,000, but the processed dataset contains feature-entries that have a length of more than 4,000)
The labels are empty lists in the processed data (although a classes_deep_go2.txt file does get generated)
The dataset contains "invalid" amino acids (O and U for pyrrolysine and selenocysteine) - as you pointed out, DeepGO does not include them as valid, but as far as I can tell, that is not enforced. Either way, currently, when processing the migrated data, chebai raises an error and stops when reaching either an O or U in the dataset. This is not very useful. We should either raise a warning and skip the protein or include O and U as valid (I would suggest to do both, i.e., adding O and U as safe and raising a warning for other letters that are definitely invalid).

@aditya0by0 Can you have a look at that?

Edit: I used the following commands:
Migration:

python chebai/preprocessing/migration/deep_go_migrate_deep_go_2_data.py migrate --data_dir=data/deepgo2-train-data --go_branch=mf

Run:

python -m fit --trainer=configs/training/default_trainer.yml --trainer.min_epochs=10 --trainer.max_epochs=10 --model=configs/model/electra.yml --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --data=configs/data/deepGO2.yml --trainer.logger=configs/training/csv_logger.yml --model.out_dim=898 --model.criterion=configs/loss/bce.yml --data.init_args.batch_size=10 --data.init_args.num_workers=10 --model.pass_loss_kwargs=false --trainer.logger.init_args.name=DeepGO2_MF --data.init_args.splits_file_path=data/GO_UniProt/GO_MF_1000/processed/splits_deep_go2.csv

aditya0by0 · 2024-12-04T11:30:47Z

DeepGO does not include them as valid, but as far as I can tell, that is not enforced.

We should either raise a warning and skip the protein or include O and U as valid (I would suggest to do both, i.e., adding O and U as safe and raising a warning for other letters that are definitely invalid).

I agree that invalid amino acids like "O" and "U" are not explicitly enforced as invalid by DeepGO2, but they are also not explicitly treated as valid in their data pipeline.

To elaborate, DeepGO2 includes the amino acid notation "X" in its set of valid amino acids, where "X" represents any/unknown amino acids (as per [Wikipedia].

In their pipeline, when invalid amino acids such as "U", "O", "B", "Z", "J", or "*" are encountered in the protein sequences, they are effectively mapped to "X". This is evident in their implementation, where the index of "X" is used for any amino acid that doesn't belong to the valid set.

You can see this behavior in their code here:
[DeepGO2 Amino Acid relevant code].

So, while "O" and "U" are not explicitly handled, the use of "X" as a catch-all ensures that any invalid amino acids are safely represented.

If we want to follow the approach mentioned above, we would need to replace every invalid amino acid in the sequence with "X" as part of a pre-processing step before tokenization. This to avoid inconsistencies in the n-gram tokenization process.

Please let me know how we want to proceed with it.

sfluegel05 · 2024-12-04T12:41:15Z

Ok, so if DeepGO2 replaces every not explicitly valid amino acid with X, then we should do the same. I think that is the easiest solution.

script to evaluate go predictions

bdba442

aditya0by0 self-assigned this Nov 4, 2024

aditya0by0 mentioned this pull request Nov 4, 2024

Protein function prediction with GO #36

Open

aditya0by0 linked an issue Nov 4, 2024 that may be closed by this pull request

Protein function prediction with GO #36

Open

aditya0by0 added 12 commits November 4, 2024 15:22

Merge branch 'dev' into protein_prediction

264bd94

add fmax to evaluation script

6c0fce1

Merge branch 'dev' into protein_prediction

154e827

add base code for deep_go data migration

58ae92d

- migration from deep go format to chebai->go_uniprot format

varry fmax threshold as per paper

78a38de

go_uniprot: add sequence len to docstring

3a4e007

update experiment evidence codes as per DeepGo SE

227a014

- #36 (comment)

Merge branch 'dev' into protein_prediction

33436e8

consIder X as a valid amino acid as per DeepGO-SE

c6d60cd

- #36 (comment)

deepgo se mirgration : add class to migrate

ca5461f

Merge branch 'dev' into protein_prediction

af54954

migration: rectify errors

dfb9430

aditya0by0 requested a review from sfluegel05 November 7, 2024 10:15

aditya0by0 added 9 commits November 7, 2024 13:25

protein trigram containing tokenS with X

085b13b

- #36 (comment)

protein token unigram contain X

3e0bae0

- #36 (comment)

add migration for deepgo1 - 2018 paper

99b5af1

deepgo1: create non-exclusive val set as a placeholder

a15d492

deepgo1: further split train set into train and val for

e0a8524

- +migration structure changes

migration script update

093be28

add classes to use migrated deepgo data

14db9d6

deepgo: minor code change

8922d4d

modify prints to display actual file name

796356c

aditya0by0 added 3 commits November 17, 2024 23:42

create sub dir for deego dataset and move rel files

3c11a69

update imports as per new deepGO dir

2b571c5

update import dir for pretrain test

f75e30b

migration fix : truncate seq and save data with labels

1b8b270

Delete protein_protein_interactions.py

bcda11c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protein function prediction with GO - Part 3 #64

Protein function prediction with GO - Part 3 #64

aditya0by0 commented Nov 4, 2024 •

edited

Loading

aditya0by0 commented Nov 13, 2024

sfluegel05 commented Dec 3, 2024 •

edited

Loading

aditya0by0 commented Dec 4, 2024 •

edited

Loading

sfluegel05 commented Dec 4, 2024

Protein function prediction with GO - Part 3 #64

Are you sure you want to change the base?

Protein function prediction with GO - Part 3 #64

Conversation

aditya0by0 commented Nov 4, 2024 • edited Loading

PR for the Issue Protein function prediction with GO #36

Changes to be done in this PR

aditya0by0 commented Nov 13, 2024

sfluegel05 commented Dec 3, 2024 • edited Loading

aditya0by0 commented Dec 4, 2024 • edited Loading

sfluegel05 commented Dec 4, 2024

aditya0by0 commented Nov 4, 2024 •

edited

Loading

sfluegel05 commented Dec 3, 2024 •

edited

Loading

aditya0by0 commented Dec 4, 2024 •

edited

Loading