-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PreProcessing unit tests #48
Conversation
A Test for
|
This is intended behaviour. In some training examples, we use a mixture of labelled and unlabelled data in combination with certain loss functions that allow for partially unlabelled data (e.g. fuzzy loss). In order to compute the usual metrics (F1, MSE etc), one needs to filter the predictions for unlabelled data and only compute them on labelled data. The indices of these data points are stored in the ' non_null_labeles' field and used by our implementations of Electra and MixedLoss. |
Therefore, the shape of |
Test Case Failing for
|
As discussed, here are some additional test cases (I also added them at the top):
|
To ensure the token order in the "real" Alternatively, we could verify the token order before and after any token insertion to ensure order consistency without the need for a duplicate file. However, this approach would be vulnerable to manual or direct changes in the Please let me know if you have any suggestions or alternative approaches to this method. |
@sfluegel05, can you please provide your suggestion/input on the respective comment.
|
I have added the test for protein pretraining. Now all the unit tests are working. Please review and merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finishing this. I removed the link to the unit test issue since we still have the toxicity-related unit tests which are not included in this PR.
Do you think it would be appropriate to include the unit tests related to Tox21MolNet in the same pull request or issue that addresses its rectification, specifically PR #56?
|
I agree. I added a note for that in #56 |
Issue Preprocessing unit tests #45
Dependency :
Unit Testing Checklist
reader.py
to_data()
with sample input values._read_data()
with sample SMILES strings._read_data()
with sample input values._read_data()
with sample SELFIES strings._read_data()
with sample protein sequences.collate.py
__call__()
with sample data.__call__()
with sample data.process_label_rows()
with sample data.datasets/base.py
_filter_labels()
with sample input values.get_test_split()
with sample data.get_train_val_splits_given_test()
with sample data.datasets/chebi.py
_extract_class_hierarchy()
with mock data._graph_to_raw_dataset()
with mock data._load_dict()
with mock data._setup_pruned_test_set()
with mock data.select_classes()
with sample data.extract_class_hierarchy()
with mock data.term_callback
term_callback()
with sample data.datasets/go_uniprot.py
_extract_class_hierarchy()
with mock data.term_callback()
with sample data._graph_to_raw_dataset()
with mock data._get_swiss_to_go_mapping()
with mock data._load_dict()
with mock data.select_classes()
with sample data.datasets/tox21.py
setup_processed()
with mock data._load_data_from_file()
using mock file operations._load_dict()
with mock data.datasets/protein_pretraining.py
_parse_protein_data_for_pretraining()
with mock data.Note: Tests for Tox21MolNet will be added later in seperate PR/branch after completion of the issue #53