Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PreProcessing unit tests #48

Merged
merged 58 commits into from
Nov 4, 2024
Merged

PreProcessing unit tests #48

merged 58 commits into from
Nov 4, 2024

Conversation

aditya0by0
Copy link
Collaborator

@aditya0by0 aditya0by0 commented Aug 29, 2024

Dependency :

Unit Testing Checklist

reader.py

  • DataReader:
    • Write unit tests for to_data() with sample input values.
  • ChemDataReader:
    • Write unit tests for _read_data() with sample SMILES strings.
  • DeepChemDataReader:
    • Write unit tests for _read_data() with sample input values.
  • SelfiesReader:
    • Write unit tests for _read_data() with sample SELFIES strings.
  • ProteinDataReader:
    • Write unit tests for _read_data() with sample protein sequences.

collate.py

  • DefaultCollator:
    • Write unit tests for __call__() with sample data.
  • RaggedCollator:
    • Write unit tests for __call__() with sample data.
    • Write unit tests for process_label_rows() with sample data.

datasets/base.py

  • XYBaseDataModule:
    • Write unit tests for _filter_labels() with sample input values.
  • DynamicDataset:
    • Write unit tests for get_test_split() with sample data.
    • Write unit tests for get_train_val_splits_given_test() with sample data.
    • Unit test that checks if the generated splits are stratified

datasets/chebi.py

  • _ChEBIDataExtractor:
    • Write unit tests for _extract_class_hierarchy() with mock data.
    • Write unit tests for _graph_to_raw_dataset() with mock data.
    • Write unit tests for _load_dict() with mock data.
    • Write unit tests for _setup_pruned_test_set() with mock data.
  • ChEBIOverX:
    • Write unit tests for select_classes() with sample data.
  • ChEBIOverXPartial:
  • term_callback
    • Write unit tests for term_callback() with sample data.

datasets/go_uniprot.py

  • _GOUniprotDataExtractor:
  • setup is failing (due to recent changes in the GO class)
    • Write unit tests for _extract_class_hierarchy() with mock data.
    • Write unit tests for term_callback() with sample data.
    • Write unit tests for _graph_to_raw_dataset() with mock data.
    • Write unit tests for _get_swiss_to_go_mapping() with mock data.
    • Write unit tests for _load_dict() with mock data.
  • _GoUniProtOverX:
    • Write unit tests for select_classes() with sample data.

datasets/tox21.py

  • Tox21Challenge:
    • Write unit tests for setup_processed() with mock data.
    • Write unit tests for _load_data_from_file() using mock file operations.
    • Write unit tests for _load_dict() with mock data.

datasets/protein_pretraining.py

  • _ProteinPretrainingData:
    • Write unit tests for _parse_protein_data_for_pretraining() with mock data.

Note: Tests for Tox21MolNet will be added later in seperate PR/branch after completion of the issue #53

Tox21MolNet:

  • [] Write unit tests for setup_processed() with mock data.
    • [] Check if output format is correct (the collator) expects a dict with features, labels, ident keys, features have to be>> able to be converted to a tensor
  • [] Write unit tests for _load_data_from_file() using mock file operations.

@aditya0by0 aditya0by0 self-assigned this Aug 29, 2024
@aditya0by0 aditya0by0 linked an issue Aug 29, 2024 that may be closed by this pull request
@aditya0by0
Copy link
Collaborator Author

aditya0by0 commented Aug 31, 2024

A Test for RaggedCollator is failing!

Issue Description

There is a potential misalignment issue in the RaggedCollator class when processing data where some labels are None. Currently, the code correctly omits None labels from the y list but does not simultaneously remove the corresponding features from the x list. This causes a misalignment between features and labels, leading to incorrect training or evaluation outcomes.

Failing Test Case

tests/unit/collators/testRaggedCollator.test_call_with_missing_entire_labels
Test Case

Currently, this test fails because the feature corresponding to the None label is not omitted, causing a misalignment in the result.x and result.y.

Please let me know if this test case is relevant and correctly aligned with the purpose of the RaggedCollator class. Additionally, confirm if the expected results in the test case are appropriate and consistent with the class's intended functionality.

Potential Solution

To fix the issue, the features (x) should also be filtered based on the non_null_labels index, ensuring that x and y remain aligned.

Here's the corrected portion of the code:

non_null_labels = [i for i, r in enumerate(y) if r is not None]
y = self.process_label_rows(
    tuple(ye for i, ye in enumerate(y) if i in non_null_labels)
)
x = [xe for i, xe in enumerate(x) if i in non_null_labels]  # Filter x based on non_null_labels
loss_kwargs["non_null_labels"] = non_null_labels

This ensures that both x and y contain only the valid (non-None) entries and that they remain properly aligned.

@MGlauer
Copy link
Collaborator

MGlauer commented Sep 2, 2024

There is a potential misalignment issue in the RaggedCollator class when processing data where some labels are None. Currently, the code correctly omits None labels from the y list but does not simultaneously remove the corresponding features from the x list. This causes a misalignment between features and labels, leading to incorrect training or evaluation outcomes.

This is intended behaviour. In some training examples, we use a mixture of labelled and unlabelled data in combination with certain loss functions that allow for partially unlabelled data (e.g. fuzzy loss). In order to compute the usual metrics (F1, MSE etc), one needs to filter the predictions for unlabelled data and only compute them on labelled data. The indices of these data points are stored in the ' non_null_labeles' field and used by our implementations of Electra and MixedLoss.

@MGlauer
Copy link
Collaborator

MGlauer commented Sep 2, 2024

Therefore, the shape of y should only align with x modulo non_null_labels.

@aditya0by0
Copy link
Collaborator Author

Test Case Failing for term_callback

A test case for term_callback is failing because it is not correctly ignoring/skipping obsolete ChEBI terms. As a result, the test cases for _extract_class_hierarchy and _graph_to_raw_dataset are also failing as output of term_callback are used by them.

Current Behavior:

  • Right now, this failure does not seem to affect the current pre-processing pipeline with Real data, because obsolete ChEBI terms typically do not have SMILES strings.
  • The _graph_to_raw_dataset method filters out data instances:
    • without SMILES strings:
    • without relationship to other instances
    data = data[~data["SMILES"].isnull()]
    data = data[data.iloc[:, self._LABELS_START_IDX:].any(axis=1)]
    So, even though obsolete terms are not specifically filtered, their lack of SMILES strings ensures they are excluded from the dataset.

Potential Future Issue:

  • In future versions of ChEBI, if any obsolete terms do have SMILES strings and maintain relationships with non-obsolete terms, it could become a problem.
  • Since the current filtering is based solely on non-null SMILES strings and relationships to other terms, there’s no explicit logic to filter obsolete terms.

Example of a Problematic Obsolete Term:

[Term]
id: CHEBI:77533
name: Compound G
is_a: CHEBI:99999
property_value: http://purl.obolibrary.org/obo/chebi/smiles "C1=C1Br" xsd:string
is_obsolete: true

If terms like this exist in future releases, the current approach could lead to errors because obsolete terms with SMILES strings might slip through the filters.

Proposed Solution:
We can update the term_callback logic to explicitly ignore obsolete terms by checking for the is_obsolete clause:

if isinstance(clause, fastobo.term.IsObsoleteClause):
    if clause.obsolete:
        # If the term document contains an "obsolete: true" clause, skip this term.
        return False

This solution would ensure that obsolete terms are skipped before they are processed, preventing potential future issues with the dataset.

@sfluegel05
Copy link
Collaborator

As discussed, here are some additional test cases (I also added them at the top):

  • Readers: Should also check if the "real" token order (as defined by tokens.txt) stays consistent
  • ChEBIOverXPartial: should cover one label scenario from PR Refactor ChEBIOverXPartial, Add 1-label stratified splits #54
  • DynamicDataset: Check for the data splits if their are stratified
  • setup_processed tests: should also check if the output has a structure that can be read by the collator (e.g., features should be tensor-able) -> expected to fail before Refactor Tox21MolNet #56 is resolved

@aditya0by0
Copy link
Collaborator Author

aditya0by0 commented Oct 5, 2024

  • Readers: Should also check if the "real" token order (as defined by tokens.txt) stays consistent

To ensure the token order in the "real" tokens.txt file remains consistent, we can maintain a corresponding duplicate tokens.txt file in the test directory. This duplicate file will serve as the reference for validating the order of tokens in the actual tokens.txt. During testing, we will compare the contents of the real file against this reference to check for consistency in both content and order.

Alternatively, we could verify the token order before and after any token insertion to ensure order consistency without the need for a duplicate file. However, this approach would be vulnerable to manual or direct changes in the tokens.txt file, which may not be detected.

Please let me know if you have any suggestions or alternative approaches to this method.

aditya0by0 added a commit that referenced this pull request Oct 5, 2024
@aditya0by0
Copy link
Collaborator Author

@sfluegel05, can you please provide your suggestion/input on the respective comment.

  • Readers: Should also check if the "real" token order (as defined by tokens.txt) stays consistent

To ensure the token order in the "real" tokens.txt file remains consistent, we can maintain a corresponding duplicate tokens.txt file in the test directory. This duplicate file will serve as the reference for validating the order of tokens in the actual tokens.txt. During testing, we will compare the contents of the real file against this reference to check for consistency in both content and order.

Alternatively, we could verify the token order before and after any token insertion to ensure order consistency without the need for a duplicate file. However, this approach would be vulnerable to manual or direct changes in the tokens.txt file, which may not be detected.

Please let me know if you have any suggestions or alternative approaches to this method.

@sfluegel05 sfluegel05 marked this pull request as ready for review October 30, 2024 09:03
@aditya0by0
Copy link
Collaborator Author

I have added the test for protein pretraining. Now all the unit tests are working. Please review and merge.

@sfluegel05 sfluegel05 removed a link to an issue Nov 4, 2024
Copy link
Collaborator

@sfluegel05 sfluegel05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finishing this. I removed the link to the unit test issue since we still have the toxicity-related unit tests which are not included in this PR.

@sfluegel05 sfluegel05 merged commit 716432c into dev Nov 4, 2024
2 checks passed
@aditya0by0
Copy link
Collaborator Author

Do you think it would be appropriate to include the unit tests related to Tox21MolNet in the same pull request or issue that addresses its rectification, specifically PR #56?

Thanks for finishing this. I removed the link to the unit test issue since we still have the toxicity-related unit tests which are not included in this PR.

@sfluegel05
Copy link
Collaborator

I agree. I added a note for that in #56

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants