Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein function prediction with GO #39

Merged
merged 32 commits into from
Sep 24, 2024
Merged

Protein function prediction with GO #39

merged 32 commits into from
Sep 24, 2024

Conversation

aditya0by0
Copy link
Collaborator

@aditya0by0 aditya0by0 commented Jul 21, 2024

Tasks

  • Minimal dataset implementation: Build a dataset class that extracts proteins and labels from UniProtKB / GO and processes them into a dataset that can be used to train Electra

@aditya0by0 aditya0by0 changed the title basic data processing for go-uniprot dataset Protein function prediction with GO Jul 28, 2024
@aditya0by0 aditya0by0 self-assigned this Jul 28, 2024
@aditya0by0 aditya0by0 requested a review from sfluegel05 July 31, 2024 21:57
@sfluegel05
Copy link
Collaborator

Things we talked about today:

  • Currently, the dataset is using GO classes as samples. Instead, GO classes should only be labels and SwissProt proteins should only be samples (there is no 1-to-1 correspondence between the two).
    • You can do a sanity check for the data by comparing the number of samples / labels to the DeepGO paper
  • Users should be able to select a GO branch (biological processes (BP), molecular functions (MF) and cellular components (CC))
  • Since some of the functionality is not specific to this dataset (e.g. creating the data splits) and is also used in ChEBI, it should be outsourced into an intermediate class inherited by both
  • The generated tokens.txt is incomplete

@aditya0by0
Copy link
Collaborator Author

aditya0by0 commented Aug 2, 2024

RuntimeError: Error while merging hparams: the keys ['class_path', 'init_args'] are present in both the LightningModule's and LightningDataModule's hparams but have different values.


I wanted to mention an issue I encountered recently with the RuntimeError, which indicated that ['class_path', 'init_args'] were present in both the LightningModule's and LightningDataModule's hyperparameters with differing values.

It seems this error was related to the recent versions of the pytorch-lightning and lightning packages. I found that downgrading both packages from version 2.3.2 to 2.1.2 resolved the issue.

It might be helpful to review the compatibility of the latest versions with our current configuration at a later date to prevent similar issues in the future.

torchmd/torchmd-net#205
Lightning-AI/pytorch-lightning#9492

@sfluegel05
Copy link
Collaborator

Great that you were able to solve this issue. However, I still don't understand where this is coming from exactly. For me, I can't reproduce it with either lightning version 2.3.2, nor 2.1.2. Also, class_path and init_args are not hyperparameters I would expect to be present in the LightningModule or LightningDataModule. The parser should have had resolved those into classes.

In the _log_hyperparams call that raises the RuntimeError, I have the following hyperparameters:

lightning_hparams
"config":                {'vocab_size': 1400, 'max_position_embeddings': 1800, 'num_attention_heads': 8, 'num_hidden_layers': 6, 'type_vocab_size': 1, 'hidden_size': 256}
"load_prefix":           generator.
"optimizer_kwargs":      {'lr': 0.001}
"out_dim":               1511
"pass_loss_kwargs":      False
"pretrained_checkpoint": electra_pretrained.ckpt
datamodule_hparams
"balance_after_filter": None
"base_dir":             None
"batch_size":           10
"chebi_version":        200
"data_limit":           None
"fold_index":           None
"inner_k_folds":        -1
"label_filter":         None
"num_workers":          10
"prediction_kind":      test
"reader_kwargs":        None
"seed":                 42
"splits_file_path":     None
"train_split":          0.85

- logic to select go data branch based on given input
- update class hierarchy and raw data logic
- combines the swiss data with GO data
- ambiguous_amino_acids
- sequence_length
- experimental_evidence_codes
PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling frame.insert many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
  data_df[self.select_classes(g, data_df=data_df)] = False
@aditya0by0 aditya0by0 linked an issue Aug 18, 2024 that may be closed by this pull request
@aditya0by0
Copy link
Collaborator Author

aditya0by0 commented Aug 28, 2024

Protein Preprocessing Statistics

These are the statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids or sequence lengths greater than 1002, as per the guidelines outlined in the paper:

  • Number of proteins with non-valid amino acids: 2,672 (0.47% of the dataset)
  • Number of proteins with sequence length greater than 1002: 19,004 (3.32% of the dataset)
  • Number of proteins with both non-valid amino acids and length greater than 1002: 154
  • Total number of ignored proteins (either condition): 21,522 (3.76% of the dataset)
  • Original dataset size: 571,864 proteins

The number of ignored proteins is very insignificant in size compared to the whole dataset.

I have attached the CSV file which lists the IDs (and their relevant details) of the ignored proteins for reference.

proteins_with_issues.csv

@aditya0by0
Copy link
Collaborator Author

aditya0by0 commented Aug 28, 2024

Also, I have updated the Wiki for GOUniProt data folder structure, as suggested. Please review whenever possible.

@aditya0by0 aditya0by0 mentioned this pull request Aug 29, 2024
31 tasks
@aditya0by0
Copy link
Collaborator Author

Shortening Input Sequence Lengths and Handling n-grams #36 (comment)

  1. Input Sequence Length (Commit: 62a3f45):

    • Added a parameter for maximum input sequence length (default: 1002).
    • Removed the restriction that considered proteins with sequences shorter than 1002 as maximum sequence length is now an input parameters. The new dataloader will now select the first 'n' features based on the specified sequence length. (Here, each feature corresponds to index of token)
  2. Trigrams / n-grams (Commit: 108d9ca):

    • A new data.pt file is created for each n-gram.

    • Handling n-grams and Sequence Length: If we are using the dataloader to truncate the sequence based on the maximum sequence length, then when using trigrams, the sequence length will refer to the length of trigrams, not individual amino acid letters. The dataloader loads the data.pt file, which has the sequence numerically encoded in the features key based on the token's index position in tokens.txt. Is this the intended behavior?

    • Question: Do we need separate token.txt files for each n-gram, or can we have a single common file for all n-grams? For trigrams, there are at most 8,000 unique tokens (as there are 20 valid amino acids).

    • Vocabulary Issue: The current vocabulary size is 1,400. Since trigrams require handling up to 8,000 unique tokens, an increase in vocab_size is necessary. However, using 8,000 tokens is causing an error due to a mismatch with the pre-trained electra_pretrained.ckpt model, which is trained with a vocab size of 1,400.

    Error:

     Sanity Checking: |          | 0/? [00:00<?, ?it/s]
     Loading splits from data/GO_UniProt/GO250_BP/processed/splits.csv...
     G:\anaconda3\envs\env_chebai\lib\site-packages\torch\utils\data\dataloader.py:558: UserWarning: This DataLoader will create 9 worker processes in total. Our suggested max number of worker in current system is 8 (`cpuset` is not taken into account), which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
       warnings.warn(_create_warning_msg(
     Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
       File "G:\anaconda3\envs\env_chebai\lib\runpy.py", line 196, in _run_module_as_main
         return _run_code(code, main_globals, None,
       File "G:\anaconda3\envs\env_chebai\lib\runpy.py", line 86, in _run_code
         exec(code, run_globals)
       File "G:\github-aditya0by0\python-chebai\chebai\__main__.py", line 10, in <module>
         cli()
       File "G:\github-aditya0by0\python-chebai\chebai\cli.py", line 75, in cli
         r = ChebaiCLI(
       File "G:\github-aditya0by0\python-chebai\chebai\cli.py", line 31, in __init__
         super().__init__(trainer_class=CustomTrainer, *args, **kwargs)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\cli.py", line 386, in __init__
         self._run_subcommand(self.subcommand)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\cli.py", line 677, in _run_subcommand
         fn(**fn_kwargs)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 544, in fit
         call._call_and_handle_interrupt(
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\call.py", line 44, in _call_and_handle_interrupt
         return trainer_fn(*args, **kwargs)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 580, in _fit_impl
         self._run(model, ckpt_path=ckpt_path)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 989, in _run
         results = self._run_stage()
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 1033, in _run_stage
         self._run_sanity_check()
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 1062, in _run_sanity_check
         val_loop.run()
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\loops\utilities.py", line 182, in _decorator
         return loop_run(self, *args, **kwargs)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\loops\evaluation_loop.py", line 134, in run
         self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\loops\evaluation_loop.py", line 391, in _evaluation_step
         output = call._call_strategy_hook(trainer, hook_name, *step_args)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\trainer\call.py", line 309, in _call_strategy_hook
         output = fn(*args, **kwargs)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\lightning\pytorch\strategies\strategy.py", line 403, in validation_step
         return self.lightning_module.validation_step(*args, **kwargs)
       File "G:\github-aditya0by0\python-chebai\chebai\models\base.py", line 169, in validation_step
         return self._execute(
       File "G:\github-aditya0by0\python-chebai\chebai\models\base.py", line 234, in _execute
         model_output = self(data, **data.get("model_kwargs", dict()))
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
         return self._call_impl(*args, **kwargs)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
         return forward_call(*args, **kwargs)
       File "G:\github-aditya0by0\python-chebai\chebai\models\electra.py", line 326, in forward
         inp = self.electra.embeddings.forward(data["features"].int())
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\transformers\models\electra\modeling_electra.py", line 193, in forward
         inputs_embeds = self.word_embeddings(input_ids)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
         return self._call_impl(*args, **kwargs)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
         return forward_call(*args, **kwargs)
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\modules\sparse.py", line 163, in forward
         return F.embedding(
       File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\functional.py", line 2264, in embedding
         return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
     IndexError: index out of range in self
    

    It seems that this error occurs when trying to embed tokens with indexes exceeding the 1,400 limit of the pre-trained model.

@sfluegel05 sfluegel05 marked this pull request as ready for review September 24, 2024 07:38
@sfluegel05
Copy link
Collaborator

Thanks for implementing this.

  • For the sequence length: I would except the maximum sequence length to refer to the number of amino acids. That way, the same proteins are included in the dataset for a given sequence length, no matter the encoding.
  • Separate tokens.txt files for each n-gram: Definitely, since they have different sets of tokens (tokens always have length n for each n-gram). This should happen automatically if you change the name property of the reader.
  • Vocabulary size: That is easy to fix: Simply don't use a pretrained model. Since the pretraining has been done on SMILES, it makes no sense to use that model for protein sequences. (Maybe we will do pretraining for protein sequences in the future, then we will have to pretrain a model with vocab_size=8000)

I will merge this so we can use the classes for other PRs. Please open a new PR for this branch if you have new changes.

@sfluegel05 sfluegel05 merged commit a95415b into dev Sep 24, 2024
2 checks passed
@aditya0by0
Copy link
Collaborator Author

If 1002 is set as the maximum input sequence length, the updated behavior will truncate any protein sequences longer than 1002 amino acids, selecting only the first 1002. This may result in a partial representation of the protein, as the entire sequence may not be captured.
In contrast, the approach described in the DeepGO paper excludes any protein sequences that exceed the specified length threshold, skipping them entirely rather than truncating.

Shortening Input Sequence Lengths and Handling n-grams #36 (comment)

  1. Input Sequence Length (Commit: 62a3f45):

    • Added a parameter for maximum input sequence length (default: 1002).
    • Removed the restriction that considered proteins with sequences shorter than 1002 as maximum sequence length is now an input parameters. The new dataloader will now select the first 'n' features based on the specified sequence length. (Here, each feature corresponds to index of token)

Thanks for implementing this.

  • For the sequence length: I would except the maximum sequence length to refer to the number of amino acids. That way, the same proteins are included in the dataset for a given sequence length, no matter the encoding.

@aditya0by0 aditya0by0 linked an issue Oct 1, 2024 that may be closed by this pull request
schnamo pushed a commit to schnamo/python-chebai that referenced this pull request Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Protein function prediction with GO
2 participants