Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein function prediction with GO - Part 3 #64

Draft
wants to merge 27 commits into
base: dev
Choose a base branch
from

Conversation

aditya0by0
Copy link
Collaborator

@aditya0by0 aditya0by0 commented Nov 4, 2024

Note: The above issue will be implemented in 3 PRs:

Changes to be done in this PR

evaluation: Evaluate using the same metrics as DeepGO for comparing the models

From comment #36 (comment)

  • on a new branch: metrics for evaluation (I talked to Martin about the Fmax score: Although it has some methodological issues, we should include it in our evaluation to do a comparison with DeepGO)
  • DeepGO-SE (paper): use these results as a baseline, integrate their data into our pipeline (there is a link to the dataset on their github page

@aditya0by0 aditya0by0 self-assigned this Nov 4, 2024
@aditya0by0 aditya0by0 linked an issue Nov 4, 2024 that may be closed by this pull request
@aditya0by0
Copy link
Collaborator Author

I have made the suggested changes for migration. Please check.

Config for DeepGO1:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO1MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1002
  reader_kwargs: {n_gram: 3}

Config for DeepGO2:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO2MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1000
  reader_kwargs: {n_gram: 3}

@sfluegel05
Copy link
Collaborator

sfluegel05 commented Dec 3, 2024

I ran the migration script for DeepGO2 and tried training a model with the data. I noticed three issues:

  • The max_sequence_length parameter seems to have no effect (I set it to 1,000, but the processed dataset contains feature-entries that have a length of more than 4,000)
  • The labels are empty lists in the processed data (although a classes_deep_go2.txt file does get generated)
  • The dataset contains "invalid" amino acids (O and U for pyrrolysine and selenocysteine) - as you pointed out, DeepGO does not include them as valid, but as far as I can tell, that is not enforced. Either way, currently, when processing the migrated data, chebai raises an error and stops when reaching either an O or U in the dataset. This is not very useful. We should either raise a warning and skip the protein or include O and U as valid (I would suggest to do both, i.e., adding O and U as safe and raising a warning for other letters that are definitely invalid).

@aditya0by0 Can you have a look at that?

Edit: I used the following commands:
Migration:

python chebai/preprocessing/migration/deep_go_migrate_deep_go_2_data.py migrate --data_dir=data/deepgo2-train-data --go_branch=mf

Run:

python -m fit --trainer=configs/training/default_trainer.yml --trainer.min_epochs=10 --trainer.max_epochs=10 --model=configs/model/electra.yml --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --data=configs/data/deepGO2.yml --trainer.logger=configs/training/csv_logger.yml --model.out_dim=898 --model.criterion=configs/loss/bce.yml --data.init_args.batch_size=10 --data.init_args.num_workers=10 --model.pass_loss_kwargs=false --trainer.logger.init_args.name=DeepGO2_MF --data.init_args.splits_file_path=data/GO_UniProt/GO_MF_1000/processed/splits_deep_go2.csv

@aditya0by0
Copy link
Collaborator Author

aditya0by0 commented Dec 4, 2024

  • DeepGO does not include them as valid, but as far as I can tell, that is not enforced.
  • We should either raise a warning and skip the protein or include O and U as valid (I would suggest to do both, i.e., adding O and U as safe and raising a warning for other letters that are definitely invalid).

I agree that invalid amino acids like "O" and "U" are not explicitly enforced as invalid by DeepGO2, but they are also not explicitly treated as valid in their data pipeline.

To elaborate, DeepGO2 includes the amino acid notation "X" in its set of valid amino acids, where "X" represents any/unknown amino acids (as per [Wikipedia].

In their pipeline, when invalid amino acids such as "U", "O", "B", "Z", "J", or "*" are encountered in the protein sequences, they are effectively mapped to "X". This is evident in their implementation, where the index of "X" is used for any amino acid that doesn't belong to the valid set.

You can see this behavior in their code here:
[DeepGO2 Amino Acid relevant code].

So, while "O" and "U" are not explicitly handled, the use of "X" as a catch-all ensures that any invalid amino acids are safely represented.

If we want to follow the approach mentioned above, we would need to replace every invalid amino acid in the sequence with "X" as part of a pre-processing step before tokenization. This to avoid inconsistencies in the n-gram tokenization process.

Please let me know how we want to proceed with it.

@sfluegel05
Copy link
Collaborator

Ok, so if DeepGO2 replaces every not explicitly valid amino acid with X, then we should do the same. I think that is the easiest solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Protein function prediction with GO
2 participants