-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implemented outlier removal #36
Conversation
@S-Thaler @prtos @shenoynikhil please also take a look when you get a chance! |
Thanks for the implementation Danny! 2 more general remarks:
Really hard issue overall... Properly cleaning the data would require a lot of manual labour and still might miss corrupt samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before merging we should discuss a little bit about the values we are using as a default and test it on all the datasets.
How does this statistical analysis impact datasets like GDML where we have strange distributions ? (If i remember right GDML is an Ab initio MD of 10 separate molecules so I would expect (need to recheck) it to have separate distribustions....even the mean in that dataset would probably be a problem)
src/openqdc/datasets/base.py
Outdated
def _remove_outliers( | ||
self, | ||
formation_E: np.array, | ||
mean_or_median: str = "median", | ||
num_stds: float = 3.0, | ||
) -> np.array: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently there is no way control the mean_or_median, num_stds parameters used in the computation.
I would change mean_or_median
name to mode
or something similar. Even a boolean I think would be better in this case.
How did we decide the num_stds and how does it impact across the different datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's change it to mode
and then the user can provide a string key to a dictionary that selects the relevant numpy function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be ok with it, we can also just leave the current if
and just renaming the variable otherwise
src/openqdc/datasets/base.py
Outdated
formation_E = formation_E[~mask] # TODO: Christian, your formation E values are different than the ones I calculated yesterday, not sure why? | ||
for key in self.data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure either. We should check together. Are we sure we are confronting the same units?
src/openqdc/datasets/base.py
Outdated
if self.remove_outliers: | ||
E = self._remove_outliers(np.squeeze(E.T)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorporate this if inside the remove_outliers function.
Yes, we should probably take the formation energy per atom as metric to have less impact on larger systems. Probably it wouldn't change too much but it could be an improvement for dataset like GEOM where we have a conformation with a long tail due big samples with lots of conformations.
We should grab few samples from the outliers and inspect them. |
Agree, I think we should try formation energy per atom as well. We can do some analysis to select the best defaults, but I don't think it's necessarily a blocker to merging as users can change the |
It seems to me that as it is right now the user cannot change the parameters without changing the code. Can you double check? |
as
You are right, I forgot to add it to the |
@mcneela Can you dump the indices of the removed outliers for some datasets (Spice, QMugs, GDML, TMQM)? |
I updated the code to make the requested changes. After further thought, I think we should make outlier removal |
I agree, by default we shouldn't remove outliers |
@FNTwin There are something like 30k indices removed for SPICE, so we should only enable this for the datasets with known outliers such as |
Added function to perform outlier removal based on formation energies...
A couple issues remain to be resolved:
data['atomic_inputs']
array. Please let me know the best way to do this.