Protein function prediction with GO - Part 2 #57

aditya0by0 · 2024-10-01T09:05:28Z

PR for the Issue Protein function prediction with GO #36

Note: The above issue will be implemented in 2 PRs:

Tasks

Minimal dataset implementation: Build a dataset class that extracts proteins and labels from UniProtKB / GO and processes them into a dataset that can be used to train Electra

aditya0by0 · 2024-10-01T09:07:02Z

Changes to implement from: #39 (comment)

Thanks for implementing this.

For the sequence length: I would except the maximum sequence length to refer to the number of amino acids. That way, the same proteins are included in the dataset for a given sequence length, no matter the encoding.

Separate tokens.txt files for each n-gram: Definitely, since they have different sets of tokens (tokens always have length n for each n-gram). This should happen automatically if you change the name property of the reader.

Vocabulary size: That is easy to fix: Simply don't use a pretrained model. Since the pretraining has been done on SMILES, it makes no sense to use that model for protein sequences. (Maybe we will do pretraining for protein sequences in the future, then we will have to pretrain a model with vocab_size=8000)

I will merge this so we can use the classes for other PRs. Please open a new PR for this branch if you have new changes.

aditya0by0 · 2024-10-01T20:34:38Z

Config:

class_path: chebai.preprocessing.datasets.go_uniprot.GOUniProtOver250
init_args:
  go_branch: "BP"
  reader_kwargs: {n_gram: 3}

aditya0by0 · 2024-10-12T11:08:41Z

I have completed the changes suggested in our last meeting. Please review.

… direct ones

chebai/preprocessing/datasets/go_uniprot.py

This reverts commit 2c446dc.

aditya0by0 · 2024-10-30T18:27:08Z

The next steps here are:

Pretraining: add filter for sequence length as hyperparameter

merge the feature branch into the dev branch

Merging this branch, as suggested in comment #36 (comment)

A new PR with same branch, will be created for the rest of the changes

Separate tokens.txt files for each n-gram

1a32757

aditya0by0 self-assigned this Oct 1, 2024

aditya0by0 mentioned this pull request Oct 1, 2024

Protein function prediction with GO #36

Open

aditya0by0 linked an issue Oct 1, 2024 that may be closed by this pull request

Protein function prediction with GO #36

Open

aditya0by0 mentioned this pull request Oct 1, 2024

PreProcessing unit tests #48

Merged

31 tasks

aditya0by0 removed a link to an issue Oct 1, 2024

Protein function prediction with GO #36

Open

aditya0by0 linked an issue Oct 1, 2024 that may be closed by this pull request

Protein function prediction with GO #36

Open

aditya0by0 added 3 commits October 1, 2024 15:31

Merge branch 'dev' into protein_prediction

b123e61

for ngram, truncate sequence to adhere to max no of AA

4b39bbb

3-gram token.txt

d7e8097

aditya0by0 requested a review from sfluegel05 October 1, 2024 21:10

aditya0by0 added 3 commits October 11, 2024 12:55

Merge branch 'dev' into protein_prediction

25177b3

ignore proteins exceeding max len in preprocessing

710d703

fix to access max seq len in name prop

383b210

fix: add all (including transitive) go-labels to data instead of only…

6511086

… direct ones

aditya0by0 commented Oct 19, 2024

View reviewed changes

chebai/preprocessing/datasets/go_uniprot.py Outdated Show resolved Hide resolved

sfluegel and others added 8 commits October 21, 2024 13:39

fix: dont count labels twice

f3ec947

make evidence code and invalid AA as global constants

e4a9e6c

protein pretrain data - rough implementation

a27e415

Final class + fixes

fa7b37b

new data reader for protein pretraining data

2c446dc

pretrain: add docstrings and typehints

ad4fc95

Revert "new data reader for protein pretraining data"

d8e2efb

This reverts commit 2c446dc.

pretrain: set labels to None instead of using new reader

fc50c31

This was referenced Oct 26, 2024

Check Tokens Consistency #63

Merged

Tutorial: Data Exploration #46

Merged

Update protein_pretraining.py

66dd504

aditya0by0 removed a link to an issue Oct 30, 2024

Protein function prediction with GO #36

Open

aditya0by0 marked this pull request as ready for review October 30, 2024 18:26

aditya0by0 merged commit 20764f7 into dev Oct 30, 2024
2 checks passed

aditya0by0 mentioned this pull request Nov 4, 2024

Protein function prediction with GO - Part 3 #64

Draft

aditya0by0 linked an issue Nov 4, 2024 that may be closed by this pull request

Protein function prediction with GO #36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protein function prediction with GO - Part 2 #57

Protein function prediction with GO - Part 2 #57

aditya0by0 commented Oct 1, 2024 •

edited

Loading

aditya0by0 commented Oct 1, 2024

aditya0by0 commented Oct 1, 2024

aditya0by0 commented Oct 12, 2024

aditya0by0 commented Oct 30, 2024

Protein function prediction with GO - Part 2 #57

Protein function prediction with GO - Part 2 #57

Conversation

aditya0by0 commented Oct 1, 2024 • edited Loading

PR for the Issue Protein function prediction with GO #36

aditya0by0 commented Oct 1, 2024

aditya0by0 commented Oct 1, 2024

aditya0by0 commented Oct 12, 2024

aditya0by0 commented Oct 30, 2024

aditya0by0 commented Oct 1, 2024 •

edited

Loading