implemented outlier removal #36

mcneela · 2024-02-21T15:40:02Z

Added function to perform outlier removal based on formation energies...

A couple issues remain to be resolved:

I don't yet have a way to map my mask to the correct indices in the data['atomic_inputs'] array. Please let me know the best way to do this.
The outliers/formation energies being calculated here are different than the ones I calculated after the fact yesterday. We should examine why this is the case.

mcneela · 2024-02-21T19:01:45Z

@S-Thaler @prtos @shenoynikhil please also take a look when you get a chance!

S-Thaler · 2024-02-21T19:31:11Z

Thanks for the implementation Danny! 2 more general remarks:

Should we maybe take the formation energy per atom as the relevant metric? Due to size extensivity of the formation energy, we would probably bias ourself to remove especially large molecules.
Plusminus 3 sigma deviation seems a bit narrow maybe, since in a well-maintained dataset, we ideally wouldn't want to delete samples, yet, there are likely correct samples outside the 3 simga confidence interval.
At the same time, there might be corrupt samples that fall within the confidence interval by chance.

Really hard issue overall... Properly cleaning the data would require a lot of manual labour and still might miss corrupt samples.

FNTwin

I think before merging we should discuss a little bit about the values we are using as a default and test it on all the datasets.

How does this statistical analysis impact datasets like GDML where we have strange distributions ? (If i remember right GDML is an Ab initio MD of 10 separate molecules so I would expect (need to recheck) it to have separate distribustions....even the mean in that dataset would probably be a problem)

FNTwin · 2024-02-21T20:41:00Z

src/openqdc/datasets/base.py

+    def _remove_outliers(
+            self, 
+            formation_E: np.array, 
+            mean_or_median: str = "median", 
+            num_stds: float = 3.0,
+        ) -> np.array:


Currently there is no way control the mean_or_median, num_stds parameters used in the computation.

I would change mean_or_median name to mode or something similar. Even a boolean I think would be better in this case.

How did we decide the num_stds and how does it impact across the different datasets?

OK, let's change it to mode and then the user can provide a string key to a dictionary that selects the relevant numpy function?

I would be ok with it, we can also just leave the current if and just renaming the variable otherwise

FNTwin · 2024-02-21T20:42:45Z

src/openqdc/datasets/base.py

+        formation_E = formation_E[~mask] # TODO: Christian, your formation E values are different than the ones I calculated yesterday, not sure why?
+        for key in self.data:


Not sure either. We should check together. Are we sure we are confronting the same units?

FNTwin · 2024-02-21T20:43:33Z

src/openqdc/datasets/base.py

+        if self.remove_outliers:
+            E = self._remove_outliers(np.squeeze(E.T))
+


Incorporate this if inside the remove_outliers function.

src/openqdc/datasets/base.py

FNTwin · 2024-02-21T20:51:14Z

Should we maybe take the formation energy per atom as the relevant metric? Due to size extensivity of the formation energy, we would probably bias ourself to remove especially large molecules.

Yes, we should probably take the formation energy per atom as metric to have less impact on larger systems. Probably it wouldn't change too much but it could be an improvement for dataset like GEOM where we have a conformation with a long tail due big samples with lots of conformations.
It shouldn't take too much to just grab 2-3 datasets (GEOM, Spice, QMugs) and test with both of them and see the difference.

Plusminus 3 sigma deviation seems a bit narrow maybe, since in a well-maintained dataset, we ideally wouldn't want to delete samples, yet, there are likely correct samples outside the 3 simga confidence interval.
At the same time, there might be corrupt samples that fall within the confidence interval by chance.

We should grab few samples from the outliers and inspect them.

mcneela · 2024-02-22T14:21:59Z

Thanks for the implementation Danny! 2 more general remarks:

Should we maybe take the formation energy per atom as the relevant metric? Due to size extensivity of the formation energy, we would probably bias ourself to remove especially large molecules.

Plusminus 3 sigma deviation seems a bit narrow maybe, since in a well-maintained dataset, we ideally wouldn't want to delete samples, yet, there are likely correct samples outside the 3 simga confidence interval.
At the same time, there might be corrupt samples that fall within the confidence interval by chance.

Really hard issue overall... Properly cleaning the data would require a lot of manual labour and still might miss corrupt samples.

Agree, I think we should try formation energy per atom as well.

We can do some analysis to select the best defaults, but I don't think it's necessarily a blocker to merging as users can change the num_stds as well as switch between mean/median according to their preference.

FNTwin · 2024-02-22T14:26:33Z

users can change the num_stds as well as switch between mean/median according to their preference

It seems to me that as it is right now the user cannot change the parameters without changing the code. Can you double check?

mcneela · 2024-02-22T15:37:51Z

as

users can change the num_stds as well as switch between mean/median according to their preference

It seems to me that as it is right now the user cannot change the parameters without changing the code. Can you double check?

You are right, I forgot to add it to the __init__ method. I've now pushed a change to fix that.

FNTwin · 2024-02-22T15:56:32Z

@mcneela Can you dump the indices of the removed outliers for some datasets (Spice, QMugs, GDML, TMQM)?

mcneela · 2024-02-22T16:04:16Z

I updated the code to make the requested changes. After further thought, I think we should make outlier removal False by default and only default it to True for the datasets which we know have a number of outliers such as NablaDFT and Molecule3D.

FNTwin · 2024-02-22T16:07:38Z

I updated the code to make the requested changes. After further thought, I think we should make outlier removal False by default and only default it to True for the datasets which we know have a number of outliers such as NablaDFT and Molecule3D.

I agree, by default we shouldn't remove outliers

…forces

mcneela · 2024-02-22T16:16:05Z

@FNTwin There are something like 30k indices removed for SPICE, so we should only enable this for the datasets with known outliers such as NablaDFT and Molecule3D.

FNTwin · 2024-02-22T16:18:30Z

Also for reference, this is the formation energy plot for the GDML dataset calculated with the default seaborn histplot fuction.,

There are something like 30k indices removed for SPICE, so we should only enable this for the datasets with known outliers such as NablaDFT and Molecule3D.

I agree but I would like to do a visual check of the outliers we are removing

mcneela · 2024-02-22T16:26:53Z

Also for reference, this is the formation energy plot for the GDML dataset calculated with the default seaborn histplot fuction.,

There are something like 30k indices removed for SPICE, so we should only enable this for the datasets with known outliers such as NablaDFT and Molecule3D.

I agree but I would like to do a visual check of the outliers we are removing

Right, outlier removal would be quite contraindicated for this dataset

implemented outlier removal

8ed9a00

mcneela requested review from prtos, shenoynikhil and FNTwin and removed request for prtos and shenoynikhil February 21, 2024 15:40

mcneela added enhancement New feature or request question Further information is requested labels Feb 21, 2024

FNTwin changed the base branch from main to develop February 21, 2024 18:07

FNTwin requested changes Feb 21, 2024

View reviewed changes

added _remove_outliers args to __init__

bc4c747

added docstring to _remove_outliers

9c1010a

update to use avg formation E in outlier removal

3d1cb52

added additional logging to _remove_outliers and skipped removal for …

34387d4

…forces

default remove_outliers to False

b072ddd

prtos closed this Mar 13, 2024

FNTwin deleted the remove_outliers branch July 10, 2024 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implemented outlier removal #36

implemented outlier removal #36

mcneela commented Feb 21, 2024

mcneela commented Feb 21, 2024

S-Thaler commented Feb 21, 2024

FNTwin left a comment

FNTwin Feb 21, 2024

mcneela Feb 22, 2024

FNTwin Feb 22, 2024

FNTwin Feb 21, 2024

FNTwin Feb 21, 2024

FNTwin commented Feb 21, 2024

mcneela commented Feb 22, 2024

FNTwin commented Feb 22, 2024

mcneela commented Feb 22, 2024

FNTwin commented Feb 22, 2024

mcneela commented Feb 22, 2024

FNTwin commented Feb 22, 2024

mcneela commented Feb 22, 2024

FNTwin commented Feb 22, 2024

mcneela commented Feb 22, 2024

		formation_E = formation_E[~mask] # TODO: Christian, your formation E values are different than the ones I calculated yesterday, not sure why?
		for key in self.data:

		if self.remove_outliers:
		E = self._remove_outliers(np.squeeze(E.T))

implemented outlier removal #36

implemented outlier removal #36

Conversation

mcneela commented Feb 21, 2024

mcneela commented Feb 21, 2024

S-Thaler commented Feb 21, 2024

FNTwin left a comment

Choose a reason for hiding this comment

FNTwin Feb 21, 2024

Choose a reason for hiding this comment

mcneela Feb 22, 2024

Choose a reason for hiding this comment

FNTwin Feb 22, 2024

Choose a reason for hiding this comment

FNTwin Feb 21, 2024

Choose a reason for hiding this comment

FNTwin Feb 21, 2024

Choose a reason for hiding this comment

FNTwin commented Feb 21, 2024

mcneela commented Feb 22, 2024

FNTwin commented Feb 22, 2024

mcneela commented Feb 22, 2024

FNTwin commented Feb 22, 2024

mcneela commented Feb 22, 2024

FNTwin commented Feb 22, 2024

mcneela commented Feb 22, 2024

FNTwin commented Feb 22, 2024

mcneela commented Feb 22, 2024