Protein function prediction with GO #39

aditya0by0 · 2024-07-21T11:18:19Z

PR for the Issue Protein function prediction with GO #36

Tasks

Minimal dataset implementation: Build a dataset class that extracts proteins and labels from UniProtKB / GO and processes them into a dataset that can be used to train Electra

sfluegel05 · 2024-08-01T08:51:03Z

Things we talked about today:

Currently, the dataset is using GO classes as samples. Instead, GO classes should only be labels and SwissProt proteins should only be samples (there is no 1-to-1 correspondence between the two).
- You can do a sanity check for the data by comparing the number of samples / labels to the DeepGO paper
Users should be able to select a GO branch (biological processes (BP), molecular functions (MF) and cellular components (CC))
Since some of the functionality is not specific to this dataset (e.g. creating the data splits) and is also used in ChEBI, it should be outsourced into an intermediate class inherited by both
The generated tokens.txt is incomplete

…e class

aditya0by0 · 2024-08-02T23:44:02Z

RuntimeError: Error while merging hparams: the keys ['class_path', 'init_args'] are present in both the LightningModule's and LightningDataModule's hparams but have different values.

I wanted to mention an issue I encountered recently with the RuntimeError, which indicated that ['class_path', 'init_args'] were present in both the LightningModule's and LightningDataModule's hyperparameters with differing values.

It seems this error was related to the recent versions of the pytorch-lightning and lightning packages. I found that downgrading both packages from version 2.3.2 to 2.1.2 resolved the issue.

It might be helpful to review the compatibility of the latest versions with our current configuration at a later date to prevent similar issues in the future.

torchmd/torchmd-net#205
Lightning-AI/pytorch-lightning#9492

sfluegel05 · 2024-08-07T09:23:26Z

Great that you were able to solve this issue. However, I still don't understand where this is coming from exactly. For me, I can't reproduce it with either lightning version 2.3.2, nor 2.1.2. Also, class_path and init_args are not hyperparameters I would expect to be present in the LightningModule or LightningDataModule. The parser should have had resolved those into classes.

In the _log_hyperparams call that raises the RuntimeError, I have the following hyperparameters:

lightning_hparams
"config":                {'vocab_size': 1400, 'max_position_embeddings': 1800, 'num_attention_heads': 8, 'num_hidden_layers': 6, 'type_vocab_size': 1, 'hidden_size': 256}
"load_prefix":           generator.
"optimizer_kwargs":      {'lr': 0.001}
"out_dim":               1511
"pass_loss_kwargs":      False
"pretrained_checkpoint": electra_pretrained.ckpt
datamodule_hparams
"balance_after_filter": None
"base_dir":             None
"batch_size":           10
"chebi_version":        200
"data_limit":           None
"fold_index":           None
"inner_k_folds":        -1
"label_filter":         None
"num_workers":          10
"prediction_kind":      test
"reader_kwargs":        None
"seed":                 42
"splits_file_path":     None
"train_split":          0.85

- logic to select go data branch based on given input - update class hierarchy and raw data logic

- combines the swiss data with GO data

- 20 natural amino acid notation tokens as per below wiki - https://en.wikipedia.org/wiki/Protein_primary_structure

- ambiguous_amino_acids - sequence_length - experimental_evidence_codes

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy() data_df[self.select_classes(g, data_df=data_df)] = False

chebai/preprocessing/datasets/go_uniprot.py

aditya0by0 · 2024-08-28T13:50:16Z

Protein Preprocessing Statistics

These are the statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids or sequence lengths greater than 1002, as per the guidelines outlined in the paper:

Number of proteins with non-valid amino acids: 2,672 (0.47% of the dataset)
Number of proteins with sequence length greater than 1002: 19,004 (3.32% of the dataset)
Number of proteins with both non-valid amino acids and length greater than 1002: 154
Total number of ignored proteins (either condition): 21,522 (3.76% of the dataset)
Original dataset size: 571,864 proteins

The number of ignored proteins is very insignificant in size compared to the whole dataset.

I have attached the CSV file which lists the IDs (and their relevant details) of the ignored proteins for reference.

proteins_with_issues.csv

aditya0by0 · 2024-08-28T14:25:50Z

Also, I have updated the Wiki for GOUniProt data folder structure, as suggested. Please review whenever possible.

- #48 (comment)

aditya0by0 · 2024-09-21T18:04:18Z

Shortening Input Sequence Lengths and Handling n-grams #36 (comment)

Input Sequence Length (Commit: 62a3f45):
- Added a parameter for maximum input sequence length (default: 1002).
- Removed the restriction that considered proteins with sequences shorter than 1002 as maximum sequence length is now an input parameters. The new dataloader will now select the first 'n' features based on the specified sequence length. (Here, each feature corresponds to index of token)

Trigrams / n-grams (Commit: 108d9ca):

A new data.pt file is created for each n-gram.
Handling n-grams and Sequence Length: If we are using the dataloader to truncate the sequence based on the maximum sequence length, then when using trigrams, the sequence length will refer to the length of trigrams, not individual amino acid letters. The dataloader loads the data.pt file, which has the sequence numerically encoded in the features key based on the token's index position in tokens.txt. Is this the intended behavior?
Question: Do we need separate token.txt files for each n-gram, or can we have a single common file for all n-grams? For trigrams, there are at most 8,000 unique tokens (as there are 20 valid amino acids).
Vocabulary Issue: The current vocabulary size is 1,400. Since trigrams require handling up to 8,000 unique tokens, an increase in vocab_size is necessary. However, using 8,000 tokens is causing an error due to a mismatch with the pre-trained electra_pretrained.ckpt model, which is trained with a vocab size of 1,400.

Error:

 Sanity Checking: |          | 0/? [00:00<?, ?it/s]
 Loading splits from data/GO_UniProt/GO250_BP/processed/splits.csv...
 G:\anaconda3\envs\env_chebai\lib\site-packages\torch\utils\data\dataloader.py:558: UserWarning: This DataLoader will create 9 worker processes in total. Our suggested max number of worker in current system is 8 (`cpuset` is not taken into account), which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
   warnings.warn(_create_warning_msg(
 Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
   File "G:\anaconda3\envs\env_chebai\lib\runpy.py", line 196, in _run_module_as_main
     return _run_code(code, main_globals, None,
   File "G:\anaconda3\envs\env_chebai\lib\runpy.py", line 86, in _run_code
     exec(code, run_globals)
   File "G:\github-aditya0by0\python-chebai\chebai\__main__.py", line 10, in <module>
     cli()
   File "G:\github-aditya0by0\python-chebai\chebai\cli.py", line 75, in cli
     r = ChebaiCLI(
   File "G:\github-aditya0by0\python-chebai\chebai\cli.py", line 31, in __init__
     super().__init__(trainer_class=CustomTrainer, *args, **kwargs)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\cli.py", line 386, in __init__
     self._run_subcommand(self.subcommand)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\cli.py", line 677, in _run_subcommand
     fn(**fn_kwargs)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 544, in fit
     call._call_and_handle_interrupt(
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\call.py", line 44, in _call_and_handle_interrupt
     return trainer_fn(*args, **kwargs)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 580, in _fit_impl
     self._run(model, ckpt_path=ckpt_path)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 989, in _run
     results = self._run_stage()
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 1033, in _run_stage
     self._run_sanity_check()
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 1062, in _run_sanity_check
     val_loop.run()
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\loops\utilities.py", line 182, in _decorator
     return loop_run(self, *args, **kwargs)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\loops\evaluation_loop.py", line 134, in run
     self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\loops\evaluation_loop.py", line 391, in _evaluation_step
     output = call._call_strategy_hook(trainer, hook_name, *step_args)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\call.py", line 309, in _call_strategy_hook
     output = fn(*args, **kwargs)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\strategies\strategy.py", line 403, in validation_step
     return self.lightning_module.validation_step(*args, **kwargs)
   File "G:\github-aditya0by0\python-chebai\chebai\models\base.py", line 169, in validation_step
     return self._execute(
   File "G:\github-aditya0by0\python-chebai\chebai\models\base.py", line 234, in _execute
     model_output = self(data, **data.get("model_kwargs", dict()))
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "G:\github-aditya0by0\python-chebai\chebai\models\electra.py", line 326, in forward
     inp = self.electra.embeddings.forward(data["features"].int())
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\transformers\models\electra\modeling_electra.py", line 193, in forward
     inputs_embeds = self.word_embeddings(input_ids)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
     return forward_call(*args, **kwargs)
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\modules\sparse.py", line 163, in forward
     return F.embedding(
   File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\functional.py", line 2264, in embedding
     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
 IndexError: index out of range in self

It seems that this error occurs when trying to embed tokens with indexes exceeding the 1,400 limit of the pre-trained model.

sfluegel05 · 2024-09-24T07:49:16Z

Thanks for implementing this.

For the sequence length: I would except the maximum sequence length to refer to the number of amino acids. That way, the same proteins are included in the dataset for a given sequence length, no matter the encoding.
Separate tokens.txt files for each n-gram: Definitely, since they have different sets of tokens (tokens always have length n for each n-gram). This should happen automatically if you change the name property of the reader.
Vocabulary size: That is easy to fix: Simply don't use a pretrained model. Since the pretraining has been done on SMILES, it makes no sense to use that model for protein sequences. (Maybe we will do pretraining for protein sequences in the future, then we will have to pretrain a model with vocab_size=8000)

I will merge this so we can use the classes for other PRs. Please open a new PR for this branch if you have new changes.

aditya0by0 · 2024-10-01T08:40:29Z

If 1002 is set as the maximum input sequence length, the updated behavior will truncate any protein sequences longer than 1002 amino acids, selecting only the first 1002. This may result in a partial representation of the protein, as the entire sequence may not be captured.
In contrast, the approach described in the DeepGO paper excludes any protein sequences that exceed the specified length threshold, skipping them entirely rather than truncating.

Shortening Input Sequence Lengths and Handling n-grams #36 (comment)

Input Sequence Length (Commit: 62a3f45):

Added a parameter for maximum input sequence length (default: 1002).

Removed the restriction that considered proteins with sequences shorter than 1002 as maximum sequence length is now an input parameters. The new dataloader will now select the first 'n' features based on the specified sequence length. (Here, each feature corresponds to index of token)

Thanks for implementing this.

For the sequence length: I would except the maximum sequence length to refer to the number of amino acids. That way, the same proteins are included in the dataset for a given sequence length, no matter the encoding.

aditya0by0 added 10 commits July 16, 2024 00:22

basic data processing for go-uniprot dataset

876d946

Merge branch 'dev' into protein_prediction

b2d13e9

prepare_data : sequence added to graph creation process

4844380

prepare_data: filter out any rows without any True value

795c017

setup data phase : preprocessing

4f06b62

add reader for protein data

1367975

config : GO 50

f202579

Update setup.py

a07c020

fix - local permission error for swiss data

07e5114

go_uniprot : docstrings + variable namings

b334929

aditya0by0 changed the title ~~basic data processing for go-uniprot dataset~~ Protein function prediction with GO Jul 28, 2024

aditya0by0 self-assigned this Jul 28, 2024

chebi.py : additional/more specific docstrings

5cdc9b8

aditya0by0 requested a review from sfluegel05 July 31, 2024 21:57

aditya0by0 added 3 commits August 3, 2024 00:05

base class for datasets following new dynamics splits feature

0ee241a

update _ChEBIDataExtractor as per newly inherited _DynamicDataset bas…

d182a22

…e class

update _GOUniprotDataExtractor to inherit _DynamicDataset

25a9594

This was referenced Aug 7, 2024

Fix python and dependency versions #40

Merged

upgrade chebai to the latest lightning version #43

Open

aditya0by0 added 8 commits August 9, 2024 12:48

Merge branch 'dev' into protein_prediction

4ac6bc2

add load_processed_data to base

5a4860d

go data: changes

53daf97

- logic to select go data branch based on given input - update class hierarchy and raw data logic

update _graph_to_raw_dataset method

499fafc

- combines the swiss data with GO data

fix tokenizing process in reader class for protein

19c47c1

protein tokens - 20 natural amino acid tokens

ecb276a

- 20 natural amino acid notation tokens as per below wiki - https://en.wikipedia.org/wiki/Protein_primary_structure

minor updates

5f9ff93

filter out swiss protein as per given criterias in paper

b916994

- ambiguous_amino_acids - sequence_length - experimental_evidence_codes

aditya0by0 added 3 commits August 15, 2024 23:01

fixes: go_branch filtering, protein sequence

079269b

update logic to select go classes based on proteins dataset

638598a

aditya0by0 linked an issue Aug 18, 2024 that may be closed by this pull request

Protein function prediction with GO #36

Open

sfluegel05 reviewed Aug 24, 2024

View reviewed changes

chebai/preprocessing/datasets/go_uniprot.py Outdated Show resolved Hide resolved

aditya0by0 added 2 commits August 25, 2024 14:32

consistent prefix "GOUniProt" for all classes

f9c10f7

update go configs for new class names

f39916b

aditya0by0 mentioned this pull request Aug 29, 2024

PreProcessing unit tests #48

Merged

31 tasks

aditya0by0 added 2 commits September 9, 2024 11:38

extra documentation for ragged coll as per the comment

4db76ce

- #48 (comment)

minor changes

06ab981

aditya0by0 removed a link to an issue Sep 20, 2024

Protein function prediction with GO #36

Open

aditya0by0 added 3 commits September 21, 2024 10:48

parameter for maximum length (default: 1002)

62a3f45

remove label number for GO_UniProt classes

6f463de

trigrams / n-grams combining several amino acids into one token

108d9ca

sfluegel05 marked this pull request as ready for review September 24, 2024 07:38

sfluegel05 merged commit a95415b into dev Sep 24, 2024
2 checks passed

This was referenced Oct 1, 2024

Protein function prediction with GO - Part 2 #57

Merged

Protein function prediction with GO #36

Open

aditya0by0 linked an issue Oct 1, 2024 that may be closed by this pull request

Protein function prediction with GO #36

Open

sfluegel05 mentioned this pull request Oct 14, 2024

ChEBI datasets are missing raw data #59

Closed

4 tasks

aditya0by0 mentioned this pull request Nov 4, 2024

Protein function prediction with GO - Part 3 #64

Draft

schnamo pushed a commit to schnamo/python-chebai that referenced this pull request Dec 11, 2024

update readme - document PR ChEB-AI#39

0176517

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protein function prediction with GO #39

Protein function prediction with GO #39

aditya0by0 commented Jul 21, 2024 •

edited

Loading

sfluegel05 commented Aug 1, 2024

aditya0by0 commented Aug 2, 2024 •

edited

Loading

sfluegel05 commented Aug 7, 2024

aditya0by0 commented Aug 28, 2024 •

edited

Loading

aditya0by0 commented Aug 28, 2024 •

edited

Loading

aditya0by0 commented Sep 21, 2024

sfluegel05 commented Sep 24, 2024

aditya0by0 commented Oct 1, 2024

Shortening Input Sequence Lengths and Handling n-grams #36 (comment)

Protein function prediction with GO #39

Protein function prediction with GO #39

Conversation

aditya0by0 commented Jul 21, 2024 • edited Loading

PR for the Issue Protein function prediction with GO #36

sfluegel05 commented Aug 1, 2024

aditya0by0 commented Aug 2, 2024 • edited Loading

sfluegel05 commented Aug 7, 2024

aditya0by0 commented Aug 28, 2024 • edited Loading

Protein Preprocessing Statistics

aditya0by0 commented Aug 28, 2024 • edited Loading

aditya0by0 commented Sep 21, 2024

Shortening Input Sequence Lengths and Handling n-grams #36 (comment)

sfluegel05 commented Sep 24, 2024

aditya0by0 commented Oct 1, 2024

Shortening Input Sequence Lengths and Handling n-grams #36 (comment)

aditya0by0 commented Jul 21, 2024 •

edited

Loading

aditya0by0 commented Aug 2, 2024 •

edited

Loading

aditya0by0 commented Aug 28, 2024 •

edited

Loading

aditya0by0 commented Aug 28, 2024 •

edited

Loading