Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Interaction and Better Testing #71

Closed
wants to merge 32 commits into from
Closed

Conversation

shenoynikhil
Copy link
Collaborator

@shenoynikhil shenoynikhil commented Mar 26, 2024

Fixes #63

💯 Shoutout for already existing code from Danny which made working on this super easy 💯

Dataset Preprocessing

  • Updated config-factory for L7, X40, Splinter and MetCalf and performed preliminary checks.
  • Changed L7 reading from yaml file code and the read_raw_entries for both L7 and X40
  • Changed Splinter None setting to -1 (@mcneela please help here if you know any other place you did this)
  • MetCalf Processing steps were missing, so added a function that extracts raw files and then calls read_raw_entries()

Simplified the Interaction Dataset class by using

  • pkl_data_keys functionality: extended it in BaseInteractionDataset to enforce n_atoms_first is present in the data_dict
  • Removed x[x==None] = -1 done in save_preprocess by doing the changes in read_raw_entries. For now I noticed only Splinter doing this.
  • Add dummy interaction dataset for testing

Testing

  • Add tests for DummyInteractionDataset

Feel free to drop suggestions @prtos @mcneela @FNTwin

Checklist:

  • Was this PR discussed in a issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
  • Update the API documentation is a new function is added or an existing one is deleted.

@shenoynikhil shenoynikhil changed the base branch from main to develop March 26, 2024 19:58
@shenoynikhil shenoynikhil changed the title Better Testing Refactor Interaction and Better Testing Mar 26, 2024
openqdc/datasets/base.py Show resolved Hide resolved
openqdc/datasets/interaction/X40.py Show resolved Hide resolved
openqdc/datasets/interaction/base.py Outdated Show resolved Hide resolved
openqdc/datasets/interaction/base.py Show resolved Hide resolved
openqdc/datasets/interaction/base.py Show resolved Hide resolved
openqdc/datasets/interaction/dummy.py Show resolved Hide resolved
openqdc/datasets/interaction/splinter.py Outdated Show resolved Hide resolved
tests/test_interaction.py Outdated Show resolved Hide resolved
openqdc/datasets/interaction/base.py Show resolved Hide resolved
@shenoynikhil shenoynikhil requested a review from mcneela March 29, 2024 01:33
@shenoynikhil shenoynikhil changed the base branch from develop to release April 3, 2024 17:45
openqdc/datasets/base.py Outdated Show resolved Hide resolved
openqdc/datasets/base.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@FNTwin FNTwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the PR, make stuff more clean and add a way to fetch the raw files.

Beside some small comments, I think that we should refactor the DES classes. Currently we have 4 files for DES:
DES370K, DES5M(DES370K), DES66 , DES66x8

I would move DES5M and DES370K classes in a single DES file.

I would do the same with DES66 and DES66x8 into a DES66 file and make one of them inherit the other one to have a more complex type that we could use in case.

More importantly. I think there is an error in the DES5M read_raw_entries(self) method.
Currently the class is a child of DES370K:

 def read_raw_entries(self) -> List[Dict]:
       return DES5M._read_raw_entries()

This seems wrong as it seems to me that it is recurrently calling itself. It should be:

 def read_raw_entries(self) -> List[Dict]:
       return super()._read_raw_entries()

or

 def read_raw_entries(self) -> List[Dict]:
       return DES370K._read_raw_entries()

tests/test_dummy.py Outdated Show resolved Hide resolved
openqdc/datasets/interaction/dummy.py Outdated Show resolved Hide resolved
openqdc/datasets/interaction/dummy.py Outdated Show resolved Hide resolved
openqdc/datasets/interaction/base.py Show resolved Hide resolved
openqdc/datasets/interaction/base.py Outdated Show resolved Hide resolved
openqdc/datasets/interaction/base.py Outdated Show resolved Hide resolved
openqdc/datasets/interaction/base.py Outdated Show resolved Hide resolved
openqdc/datasets/interaction/base.py Show resolved Hide resolved
openqdc/datasets/interaction/X40.py Outdated Show resolved Hide resolved
@FNTwin
Copy link
Collaborator

FNTwin commented Apr 6, 2024

I don't have thoroughly tested it currently. I'll try to run/debug the classes better sunday or monday.
I still think that most of the read_raw_entries() functions are way too long and complex (read as: doing too many things). I'm not a fun of SOLID principles but ... 😂

Copy link
Collaborator

@mcneela mcneela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple minor comments, feel free to address those, but otherwise looks good I think!

openqdc/datasets/interaction/X40.py Outdated Show resolved Hide resolved
openqdc/datasets/interaction/base.py Outdated Show resolved Hide resolved
openqdc/datasets/interaction/splinter.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@FNTwin FNTwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interaction dataset module is getting cleaner and I quite like it. I know there is some issues on naming, restraining the data keys to property methods but I think overall this kind of inheritance will help us along the way as the preprocess methods should be hardly touched at all and if we need to touch them, it means that we did something wrong at the beginning.

Nonetheless, there are some issues on changing the keys (+ some name kerfuffles) that requires reproprecessing the datasets and repushing them to the cloud. Currently because of that I m unable to give you a more in depth review as my testing can be quite limited.

Side note:

  1. We should stop calling __name__ directly and provide a property name that sanitize the name to avoid problems
  2. We really need more solid tests

As soon as we fix the main issues, I can do a round of testing for a final approval if you think it is needed. This should be a good time to actually clean the push methods as we currently are forced to always push to cloud and it is a not intended behaviour

@@ -61,7 +35,7 @@ def __getitem__(self, idx: int):
)
name = self.__smiles_converter__(self.data["name"][idx])
subset = self.data["subset"][idx]
n_atoms_first = self.data["n_atoms_first"][idx]
n_atoms_ptr = self.data["n_atoms_ptr"][idx]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the datasets are currently using the n_atom_first keyword that is processed in the read_raw_entries . We need to recompute and push the data with this new keywork

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to recompute and push, before merging these changes.

Copy link
Collaborator

@FNTwin FNTwin Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently trying to load any interaction dataset will get you an error due to the:
if not self.is_preprocessed() failing due to the naming.

In the bucket they were written L7 and X40 (upper case). We should always have the sanitize name on lower case. As we need to postprocess it again to have the new keys. It will fix itself

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, need to push new changes.

"name": str,
"subset": str,
"n_atoms": np.int32,
"n_atoms_ptr": np.int32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the n_atoms_ptr will make the read_preprocess fail for the interaction datasets because we currently don't have that key inside (see my other comments).

More in depth, Line 393: assert all([key in all_pkl_keys for key in self.pkl_data_keys])

will give you :

['name', 'subset', 'n_atoms', 'n_atoms_ptr']
[True, True, True, False]

@@ -78,68 +52,18 @@ def __getitem__(self, idx: int):
name=name,
subset=subset,
forces=forces,
n_atoms_first=n_atoms_first,
n_atoms_ptr=n_atoms_ptr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change n_atom_first to n_atom_ptr and preprocess it again + push

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change n_atom_first to n_atom_ptr and preprocess it again + push

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change n_atom_first to n_atom_ptr and preprocess it again + push

openqdc/datasets/interaction/splinter.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change n_atom_first to n_atom_ptr and preprocess it again + push

@FNTwin
Copy link
Collaborator

FNTwin commented Apr 13, 2024

This should be a good time to actually clean the push methods as we currently are forced to always push to cloud and it is a not intended behaviour

Adressed this right now in the qol branch with also a new cli method to easily preprocess and optionally push datasets

https://github.com/OpenDrugDiscovery/openQDC/tree/qol

@FNTwin
Copy link
Collaborator

FNTwin commented Apr 18, 2024

As I preprocessed and merged this branch in QoL, I m going to close it

@FNTwin FNTwin closed this Apr 18, 2024
@FNTwin FNTwin deleted the testing branch July 10, 2024 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More Tests
4 participants