Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein function prediction with GO - Part 2 #57

Merged
merged 17 commits into from
Oct 30, 2024
Merged

Conversation

aditya0by0
Copy link
Collaborator

@aditya0by0 aditya0by0 commented Oct 1, 2024

Note: The above issue will be implemented in 2 PRs:

Tasks

  • Minimal dataset implementation: Build a dataset class that extracts proteins and labels from UniProtKB / GO and processes them into a dataset that can be used to train Electra

@aditya0by0 aditya0by0 self-assigned this Oct 1, 2024
@aditya0by0
Copy link
Collaborator Author

Changes to implement from: #39 (comment)

Thanks for implementing this.

  • For the sequence length: I would except the maximum sequence length to refer to the number of amino acids. That way, the same proteins are included in the dataset for a given sequence length, no matter the encoding.
  • Separate tokens.txt files for each n-gram: Definitely, since they have different sets of tokens (tokens always have length n for each n-gram). This should happen automatically if you change the name property of the reader.
  • Vocabulary size: That is easy to fix: Simply don't use a pretrained model. Since the pretraining has been done on SMILES, it makes no sense to use that model for protein sequences. (Maybe we will do pretraining for protein sequences in the future, then we will have to pretrain a model with vocab_size=8000)

I will merge this so we can use the classes for other PRs. Please open a new PR for this branch if you have new changes.

@aditya0by0 aditya0by0 linked an issue Oct 1, 2024 that may be closed by this pull request
@aditya0by0 aditya0by0 mentioned this pull request Oct 1, 2024
31 tasks
@aditya0by0 aditya0by0 linked an issue Oct 1, 2024 that may be closed by this pull request
@aditya0by0
Copy link
Collaborator Author

Config:

class_path: chebai.preprocessing.datasets.go_uniprot.GOUniProtOver250
init_args:
  go_branch: "BP"
  reader_kwargs: {n_gram: 3}

@aditya0by0 aditya0by0 requested a review from sfluegel05 October 1, 2024 21:10
@aditya0by0
Copy link
Collaborator Author

I have completed the changes suggested in our last meeting. Please review.

This was referenced Oct 26, 2024
@aditya0by0 aditya0by0 marked this pull request as ready for review October 30, 2024 18:26
@aditya0by0
Copy link
Collaborator Author

The next steps here are:

  • Pretraining: add filter for sequence length as hyperparameter
  • merge the feature branch into the dev branch

Merging this branch, as suggested in comment #36 (comment)

A new PR with same branch, will be created for the rest of the changes

@aditya0by0 aditya0by0 merged commit 20764f7 into dev Oct 30, 2024
2 checks passed
@aditya0by0 aditya0by0 linked an issue Nov 4, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Protein function prediction with GO
2 participants