-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protein function prediction with GO #39
Conversation
Things we talked about today:
|
I wanted to mention an issue I encountered recently with the It seems this error was related to the recent versions of the It might be helpful to review the compatibility of the latest versions with our current configuration at a later date to prevent similar issues in the future. |
Great that you were able to solve this issue. However, I still don't understand where this is coming from exactly. For me, I can't reproduce it with either lightning version 2.3.2, nor 2.1.2. Also, In the
|
- logic to select go data branch based on given input - update class hierarchy and raw data logic
- combines the swiss data with GO data
- 20 natural amino acid notation tokens as per below wiki - https://en.wikipedia.org/wiki/Protein_primary_structure
- ambiguous_amino_acids - sequence_length - experimental_evidence_codes
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy() data_df[self.select_classes(g, data_df=data_df)] = False
Protein Preprocessing StatisticsThese are the statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids or sequence lengths greater than 1002, as per the guidelines outlined in the paper:
The number of ignored proteins is very insignificant in size compared to the whole dataset. I have attached the CSV file which lists the IDs (and their relevant details) of the ignored proteins for reference. |
Also, I have updated the Wiki for GOUniProt data folder structure, as suggested. Please review whenever possible. |
Shortening Input Sequence Lengths and Handling n-grams #36 (comment)
|
Thanks for implementing this.
I will merge this so we can use the classes for other PRs. Please open a new PR for this branch if you have new changes. |
If 1002 is set as the maximum input sequence length, the updated behavior will truncate any protein sequences longer than 1002 amino acids, selecting only the first 1002. This may result in a partial representation of the protein, as the entire sequence may not be captured.
|
PR for the Issue Protein function prediction with GO #36
Tasks