PreProcessing unit tests #48

aditya0by0 · 2024-08-29T19:09:19Z

Issue Preprocessing unit tests #45

Dependency :

PR Protein function prediction with GO #39 should be merged before this PR
PR Protein function prediction with GO - Part 2 #57 should be merged before this PR
PR Refactor Chebi Term Callback #55 should be merged before this PR
PR Refactor ChEBIOverXPartial, Add 1-label stratified splits #54 should be merged before this PR

Unit Testing Checklist

`reader.py`

DataReader:
- Write unit tests for to_data() with sample input values.
ChemDataReader:
- Write unit tests for _read_data() with sample SMILES strings.
DeepChemDataReader:
- Write unit tests for _read_data() with sample input values.
SelfiesReader:
- Write unit tests for _read_data() with sample SELFIES strings.
ProteinDataReader:
- Write unit tests for _read_data() with sample protein sequences.

`collate.py`

DefaultCollator:
- Write unit tests for __call__() with sample data.
RaggedCollator:
- Write unit tests for __call__() with sample data.
- Write unit tests for process_label_rows() with sample data.

`datasets/base.py`

XYBaseDataModule:
- Write unit tests for _filter_labels() with sample input values.
DynamicDataset:
- Write unit tests for get_test_split() with sample data.
- Write unit tests for get_train_val_splits_given_test() with sample data.
- Unit test that checks if the generated splits are stratified

`datasets/chebi.py`

_ChEBIDataExtractor:
- Write unit tests for _extract_class_hierarchy() with mock data.
- Write unit tests for _graph_to_raw_dataset() with mock data.
- Write unit tests for _load_dict() with mock data.
- Write unit tests for _setup_pruned_test_set() with mock data.
ChEBIOverX:
- Write unit tests for select_classes() with sample data.
ChEBIOverXPartial:
- Write unit tests for extract_class_hierarchy() with mock data.
- Write unit test for single-label scenario (see PR Refactor ChEBIOverXPartial, Add 1-label stratified splits #54)
term_callback
- Write unit tests for term_callback() with sample data.

`datasets/go_uniprot.py`

_GOUniprotDataExtractor:
setup is failing (due to recent changes in the GO class)
- Write unit tests for _extract_class_hierarchy() with mock data.
- Write unit tests for term_callback() with sample data.
- Write unit tests for _graph_to_raw_dataset() with mock data.
- Write unit tests for _get_swiss_to_go_mapping() with mock data.
- Write unit tests for _load_dict() with mock data.
_GoUniProtOverX:
- Write unit tests for select_classes() with sample data.

`datasets/tox21.py`

Tox21Challenge:
- Write unit tests for setup_processed() with mock data.
- Write unit tests for _load_data_from_file() using mock file operations.
- Write unit tests for _load_dict() with mock data.

`datasets/protein_pretraining.py`

_ProteinPretrainingData:
- Write unit tests for _parse_protein_data_for_pretraining() with mock data.

Note: Tests for Tox21MolNet will be added later in seperate PR/branch after completion of the issue #53

Tox21MolNet:

[] Write unit tests for setup_processed() with mock data.

[] Check if output format is correct (the collator) expects a dict with features, labels, ident keys, features have to be>> able to be converted to a tensor

[] Write unit tests for _load_data_from_file() using mock file operations.

aditya0by0 · 2024-08-31T21:17:16Z

A Test for `RaggedCollator` is failing!

Issue Description

There is a potential misalignment issue in the RaggedCollator class when processing data where some labels are None. Currently, the code correctly omits None labels from the y list but does not simultaneously remove the corresponding features from the x list. This causes a misalignment between features and labels, leading to incorrect training or evaluation outcomes.

Failing Test Case

tests/unit/collators/testRaggedCollator.test_call_with_missing_entire_labels
Test Case

Currently, this test fails because the feature corresponding to the None label is not omitted, causing a misalignment in the result.x and result.y.

Please let me know if this test case is relevant and correctly aligned with the purpose of the RaggedCollator class. Additionally, confirm if the expected results in the test case are appropriate and consistent with the class's intended functionality.

Potential Solution

To fix the issue, the features (x) should also be filtered based on the non_null_labels index, ensuring that x and y remain aligned.

Here's the corrected portion of the code:

non_null_labels = [i for i, r in enumerate(y) if r is not None]
y = self.process_label_rows(
    tuple(ye for i, ye in enumerate(y) if i in non_null_labels)
)
x = [xe for i, xe in enumerate(x) if i in non_null_labels]  # Filter x based on non_null_labels
loss_kwargs["non_null_labels"] = non_null_labels

This ensures that both x and y contain only the valid (non-None) entries and that they remain properly aligned.

…ance method

MGlauer · 2024-09-02T10:32:52Z

There is a potential misalignment issue in the RaggedCollator class when processing data where some labels are None. Currently, the code correctly omits None labels from the y list but does not simultaneously remove the corresponding features from the x list. This causes a misalignment between features and labels, leading to incorrect training or evaluation outcomes.

This is intended behaviour. In some training examples, we use a mixture of labelled and unlabelled data in combination with certain loss functions that allow for partially unlabelled data (e.g. fuzzy loss). In order to compute the usual metrics (F1, MSE etc), one needs to filter the predictions for unlabelled data and only compute them on labelled data. The indices of these data points are stored in the ' non_null_labeles' field and used by our implementations of Electra and MixedLoss.

MGlauer · 2024-09-02T10:34:09Z

Therefore, the shape of y should only align with x modulo non_null_labels.

- #48 (comment)

aditya0by0 · 2024-09-05T21:17:47Z

Test Case Failing for `term_callback`

A test case for term_callback is failing because it is not correctly ignoring/skipping obsolete ChEBI terms. As a result, the test cases for _extract_class_hierarchy and _graph_to_raw_dataset are also failing as output of term_callback are used by them.

Current Behavior:

Right now, this failure does not seem to affect the current pre-processing pipeline with Real data, because obsolete ChEBI terms typically do not have SMILES strings.
The _graph_to_raw_dataset method filters out data instances:
- without SMILES strings:
- without relationship to other instances
```
data = data[~data["SMILES"].isnull()]
data = data[data.iloc[:, self._LABELS_START_IDX:].any(axis=1)]
```
So, even though obsolete terms are not specifically filtered, their lack of SMILES strings ensures they are excluded from the dataset.

Potential Future Issue:

In future versions of ChEBI, if any obsolete terms do have SMILES strings and maintain relationships with non-obsolete terms, it could become a problem.
Since the current filtering is based solely on non-null SMILES strings and relationships to other terms, there’s no explicit logic to filter obsolete terms.

Example of a Problematic Obsolete Term:

[Term]
id: CHEBI:77533
name: Compound G
is_a: CHEBI:99999
property_value: http://purl.obolibrary.org/obo/chebi/smiles "C1=C1Br" xsd:string
is_obsolete: true

If terms like this exist in future releases, the current approach could lead to errors because obsolete terms with SMILES strings might slip through the filters.

Proposed Solution:
We can update the term_callback logic to explicitly ignore obsolete terms by checking for the is_obsolete clause:

if isinstance(clause, fastobo.term.IsObsoleteClause):
    if clause.obsolete:
        # If the term document contains an "obsolete: true" clause, skip this term.
        return False

This solution would ensure that obsolete terms are skipped before they are processed, preventing potential future issues with the dataset.

- #48 (comment)

sfluegel05 · 2024-10-02T09:30:36Z

As discussed, here are some additional test cases (I also added them at the top):

Readers: Should also check if the "real" token order (as defined by tokens.txt) stays consistent
ChEBIOverXPartial: should cover one label scenario from PR Refactor ChEBIOverXPartial, Add 1-label stratified splits #54
DynamicDataset: Check for the data splits if their are stratified
setup_processed tests: should also check if the output has a structure that can be read by the collator (e.g., features should be tensor-able) -> expected to fail before Refactor Tox21MolNet #56 is resolved

tests/unit/dataset_classes/testTox21Challenge.py

aditya0by0 · 2024-10-05T16:20:41Z

Readers: Should also check if the "real" token order (as defined by tokens.txt) stays consistent

To ensure the token order in the "real" tokens.txt file remains consistent, we can maintain a corresponding duplicate tokens.txt file in the test directory. This duplicate file will serve as the reference for validating the order of tokens in the actual tokens.txt. During testing, we will compare the contents of the real file against this reference to check for consistency in both content and order.

Alternatively, we could verify the token order before and after any token insertion to ensure order consistency without the need for a duplicate file. However, this approach would be vulnerable to manual or direct changes in the tokens.txt file, which may not be detected.

Please let me know if you have any suggestions or alternative approaches to this method.

- #48

aditya0by0 · 2024-10-12T17:25:51Z

@sfluegel05, can you please provide your suggestion/input on the respective comment.

Readers: Should also check if the "real" token order (as defined by tokens.txt) stays consistent

To ensure the token order in the "real" tokens.txt file remains consistent, we can maintain a corresponding duplicate tokens.txt file in the test directory. This duplicate file will serve as the reference for validating the order of tokens in the actual tokens.txt. During testing, we will compare the contents of the real file against this reference to check for consistency in both content and order.

Alternatively, we could verify the token order before and after any token insertion to ensure order consistency without the need for a duplicate file. However, this approach would be vulnerable to manual or direct changes in the tokens.txt file, which may not be detected.

Please let me know if you have any suggestions or alternative approaches to this method.

- this test will be added in another branch later once #53 is completed

This reverts commit e4caae8.

aditya0by0 · 2024-11-02T15:50:50Z

I have added the test for protein pretraining. Now all the unit tests are working. Please review and merge.

sfluegel05

Thanks for finishing this. I removed the link to the unit test issue since we still have the toxicity-related unit tests which are not included in this PR.

aditya0by0 · 2024-11-04T14:46:22Z

Do you think it would be appropriate to include the unit tests related to Tox21MolNet in the same pull request or issue that addresses its rectification, specifically PR #56?

Thanks for finishing this. I removed the link to the unit test issue since we still have the toxicity-related unit tests which are not included in this PR.

sfluegel05 · 2024-11-04T14:50:08Z

I agree. I added a note for that in #56

move previous tests to integration dir

cc5bc08

aditya0by0 self-assigned this Aug 29, 2024

unit dir + test for ChemDataReader

5af0351

aditya0by0 linked an issue Aug 29, 2024 that may be closed by this pull request

Preprocessing unit tests #45

Closed

aditya0by0 added 6 commits August 29, 2024 21:15

Test for DataReader

a0810a2

tests for DeepChemReader

1b3836d

Test for SelfiesReader

aa467c6

test for ProteinDataReader

b6f5e51

test for DefaultCollator

73f05c0

test for RaggedColllator

8007f37

aditya0by0 added 5 commits August 31, 2024 23:55

modify tests to use setUpClass class method instead of setUp inst…

248eaa7

…ance method

bool labels instead of numeric, for realistic data

3e57d78

test for XYBaseDataModule

f9ca653

test for DynamicDataset

d8016aa

add relevant msg to each assert statement

0c7c5b8

aditya0by0 added 5 commits September 4, 2024 17:34

test data class for chebi ontology

c0aaeea

test for term callback + mock data changes

764216e

test for chebidataextractor + changes in mock data

1dd8428

mock reader for all + test_setup_pruned_test_set changes

f3519b5

fix for misalignment between x an y in RaggedCollator

fc0fd47

- #48 (comment)

aditya0by0 added 4 commits September 6, 2024 12:24

test for ChebiOverX

f7f1631

test for ChebiXOverPartial

bf45bb5

Mock data for GOUniProt

17bf584

test for GOUniProtDataExtractor

c6c5a59

aditya0by0 added a commit that referenced this pull request Sep 9, 2024

extra documentation for ragged coll as per the comment

4db76ce

- #48 (comment)

Merge branch 'protein_prediction' into additional_unit_tests

78f5289

sfluegel05 reviewed Oct 2, 2024

View reviewed changes

tests/unit/dataset_classes/testTox21Challenge.py Outdated Show resolved Hide resolved

aditya0by0 added 5 commits October 5, 2024 12:54

remove absolete path for mocked open func

b479d5a

test single label split scenario implemented in #54

adedc09

test output format for Tox21MolNet._load_data_from_file

65c2d9b

Merge branch 'dev' into additional_unit_tests

72dd50f

DynamicDataset: check split stratification

a63c010

aditya0by0 added a commit that referenced this pull request Oct 5, 2024

set weights_only parameter of torch.load to False

7fc96a9

- #48

aditya0by0 added 3 commits October 11, 2024 12:54

Merge branch 'dev' into additional_unit_tests

309daed

Merge branch 'protein_prediction' into additional_unit_tests

e38d1ab

fix testcase for GO

e3c4b6e

sfluegel05 mentioned this pull request Oct 17, 2024

Add Github Actions for checking tokens.txt and reader constants consistency #60

Closed

aditya0by0 added 4 commits October 19, 2024 23:06

Merge branch 'protein_prediction' into additional_unit_tests

1470e93

update testcase as per transitive go ids

c1ddd17

remove test for tox21mol net

bf6bc4a

- this test will be added in another branch later once #53 is completed

Revert "add group key + convert generator to list"

b915b0d

This reverts commit e4caae8.

sfluegel05 marked this pull request as ready for review October 30, 2024 09:03

aditya0by0 added 4 commits November 2, 2024 00:05

Merge branch 'dev' into additional_unit_tests

282bc09

update swiss data for pretraining test

a71b199

add test for protein pretraining class

8abd14d

test : reformat with precommit

aae57d3

sfluegel05 removed a link to an issue Nov 4, 2024

Preprocessing unit tests #45

Closed

sfluegel05 approved these changes Nov 4, 2024

View reviewed changes

sfluegel05 merged commit 716432c into dev Nov 4, 2024
2 checks passed

sfluegel05 mentioned this pull request Nov 6, 2024

Add actions for unittests #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PreProcessing unit tests #48

PreProcessing unit tests #48

aditya0by0 commented Aug 29, 2024 •

edited

Loading

aditya0by0 commented Aug 31, 2024 •

edited

Loading

MGlauer commented Sep 2, 2024

MGlauer commented Sep 2, 2024

aditya0by0 commented Sep 5, 2024

sfluegel05 commented Oct 2, 2024

aditya0by0 commented Oct 5, 2024 •

edited

Loading

aditya0by0 commented Oct 12, 2024

aditya0by0 commented Nov 2, 2024

sfluegel05 left a comment

aditya0by0 commented Nov 4, 2024

sfluegel05 commented Nov 4, 2024

PreProcessing unit tests #48

PreProcessing unit tests #48

Conversation

aditya0by0 commented Aug 29, 2024 • edited Loading

Issue Preprocessing unit tests #45

Dependency :

Unit Testing Checklist

reader.py

collate.py

datasets/base.py

datasets/chebi.py

datasets/go_uniprot.py

datasets/tox21.py

datasets/protein_pretraining.py

aditya0by0 commented Aug 31, 2024 • edited Loading

A Test for RaggedCollator is failing!

Issue Description

Failing Test Case

Potential Solution

MGlauer commented Sep 2, 2024

MGlauer commented Sep 2, 2024

aditya0by0 commented Sep 5, 2024

Test Case Failing for term_callback

sfluegel05 commented Oct 2, 2024

aditya0by0 commented Oct 5, 2024 • edited Loading

aditya0by0 commented Oct 12, 2024

aditya0by0 commented Nov 2, 2024

sfluegel05 left a comment

Choose a reason for hiding this comment

aditya0by0 commented Nov 4, 2024

sfluegel05 commented Nov 4, 2024

aditya0by0 commented Aug 29, 2024 •

edited

Loading

`reader.py`

`collate.py`

`datasets/base.py`

`datasets/chebi.py`

`datasets/go_uniprot.py`

`datasets/tox21.py`

`datasets/protein_pretraining.py`

aditya0by0 commented Aug 31, 2024 •

edited

Loading

A Test for `RaggedCollator` is failing!

Test Case Failing for `term_callback`

aditya0by0 commented Oct 5, 2024 •

edited

Loading