Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implemented outlier removal #36

Closed
wants to merge 6 commits into from
Closed

implemented outlier removal #36

wants to merge 6 commits into from

Conversation

mcneela
Copy link
Collaborator

@mcneela mcneela commented Feb 21, 2024

Added function to perform outlier removal based on formation energies...

A couple issues remain to be resolved:

  1. I don't yet have a way to map my mask to the correct indices in the data['atomic_inputs'] array. Please let me know the best way to do this.
  2. The outliers/formation energies being calculated here are different than the ones I calculated after the fact yesterday. We should examine why this is the case.

@mcneela mcneela requested review from prtos, shenoynikhil and FNTwin and removed request for prtos and shenoynikhil February 21, 2024 15:40
@mcneela mcneela added enhancement New feature or request question Further information is requested labels Feb 21, 2024
@FNTwin FNTwin changed the base branch from main to develop February 21, 2024 18:07
@mcneela
Copy link
Collaborator Author

mcneela commented Feb 21, 2024

@S-Thaler @prtos @shenoynikhil please also take a look when you get a chance!

@S-Thaler
Copy link
Collaborator

Thanks for the implementation Danny! 2 more general remarks:

  1. Should we maybe take the formation energy per atom as the relevant metric? Due to size extensivity of the formation energy, we would probably bias ourself to remove especially large molecules.
  2. Plusminus 3 sigma deviation seems a bit narrow maybe, since in a well-maintained dataset, we ideally wouldn't want to delete samples, yet, there are likely correct samples outside the 3 simga confidence interval.
    At the same time, there might be corrupt samples that fall within the confidence interval by chance.

Really hard issue overall... Properly cleaning the data would require a lot of manual labour and still might miss corrupt samples.

Copy link
Collaborator

@FNTwin FNTwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before merging we should discuss a little bit about the values we are using as a default and test it on all the datasets.

How does this statistical analysis impact datasets like GDML where we have strange distributions ? (If i remember right GDML is an Ab initio MD of 10 separate molecules so I would expect (need to recheck) it to have separate distribustions....even the mean in that dataset would probably be a problem)

Comment on lines 169 to 174
def _remove_outliers(
self,
formation_E: np.array,
mean_or_median: str = "median",
num_stds: float = 3.0,
) -> np.array:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there is no way control the mean_or_median, num_stds parameters used in the computation.

I would change mean_or_median name to mode or something similar. Even a boolean I think would be better in this case.

How did we decide the num_stds and how does it impact across the different datasets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's change it to mode and then the user can provide a string key to a dictionary that selects the relevant numpy function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be ok with it, we can also just leave the current if and just renaming the variable otherwise

Comment on lines 183 to 184
formation_E = formation_E[~mask] # TODO: Christian, your formation E values are different than the ones I calculated yesterday, not sure why?
for key in self.data:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure either. We should check together. Are we sure we are confronting the same units?

Comment on lines 205 to 207
if self.remove_outliers:
E = self._remove_outliers(np.squeeze(E.T))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporate this if inside the remove_outliers function.

src/openqdc/datasets/base.py Show resolved Hide resolved
@FNTwin
Copy link
Collaborator

FNTwin commented Feb 21, 2024

  1. Should we maybe take the formation energy per atom as the relevant metric? Due to size extensivity of the formation energy, we would probably bias ourself to remove especially large molecules.

Yes, we should probably take the formation energy per atom as metric to have less impact on larger systems. Probably it wouldn't change too much but it could be an improvement for dataset like GEOM where we have a conformation with a long tail due big samples with lots of conformations.
It shouldn't take too much to just grab 2-3 datasets (GEOM, Spice, QMugs) and test with both of them and see the difference.

  1. Plusminus 3 sigma deviation seems a bit narrow maybe, since in a well-maintained dataset, we ideally wouldn't want to delete samples, yet, there are likely correct samples outside the 3 simga confidence interval.
    At the same time, there might be corrupt samples that fall within the confidence interval by chance.

We should grab few samples from the outliers and inspect them.

@mcneela
Copy link
Collaborator Author

mcneela commented Feb 22, 2024

Thanks for the implementation Danny! 2 more general remarks:

  1. Should we maybe take the formation energy per atom as the relevant metric? Due to size extensivity of the formation energy, we would probably bias ourself to remove especially large molecules.
  2. Plusminus 3 sigma deviation seems a bit narrow maybe, since in a well-maintained dataset, we ideally wouldn't want to delete samples, yet, there are likely correct samples outside the 3 simga confidence interval.
    At the same time, there might be corrupt samples that fall within the confidence interval by chance.

Really hard issue overall... Properly cleaning the data would require a lot of manual labour and still might miss corrupt samples.

Agree, I think we should try formation energy per atom as well.

We can do some analysis to select the best defaults, but I don't think it's necessarily a blocker to merging as users can change the num_stds as well as switch between mean/median according to their preference.

@FNTwin
Copy link
Collaborator

FNTwin commented Feb 22, 2024

users can change the num_stds as well as switch between mean/median according to their preference

It seems to me that as it is right now the user cannot change the parameters without changing the code. Can you double check?

@mcneela
Copy link
Collaborator Author

mcneela commented Feb 22, 2024

as

users can change the num_stds as well as switch between mean/median according to their preference

It seems to me that as it is right now the user cannot change the parameters without changing the code. Can you double check?

You are right, I forgot to add it to the __init__ method. I've now pushed a change to fix that.

@FNTwin
Copy link
Collaborator

FNTwin commented Feb 22, 2024

@mcneela Can you dump the indices of the removed outliers for some datasets (Spice, QMugs, GDML, TMQM)?

@mcneela
Copy link
Collaborator Author

mcneela commented Feb 22, 2024

I updated the code to make the requested changes. After further thought, I think we should make outlier removal False by default and only default it to True for the datasets which we know have a number of outliers such as NablaDFT and Molecule3D.

@FNTwin
Copy link
Collaborator

FNTwin commented Feb 22, 2024

I updated the code to make the requested changes. After further thought, I think we should make outlier removal False by default and only default it to True for the datasets which we know have a number of outliers such as NablaDFT and Molecule3D.

I agree, by default we shouldn't remove outliers

@mcneela
Copy link
Collaborator Author

mcneela commented Feb 22, 2024

@FNTwin There are something like 30k indices removed for SPICE, so we should only enable this for the datasets with known outliers such as NablaDFT and Molecule3D.

@FNTwin
Copy link
Collaborator

FNTwin commented Feb 22, 2024

Also for reference, this is the formation energy plot for the GDML dataset calculated with the default seaborn histplot fuction.,
image

There are something like 30k indices removed for SPICE, so we should only enable this for the datasets with known outliers such as NablaDFT and Molecule3D.

I agree but I would like to do a visual check of the outliers we are removing

@mcneela
Copy link
Collaborator Author

mcneela commented Feb 22, 2024

Also for reference, this is the formation energy plot for the GDML dataset calculated with the default seaborn histplot fuction., image

There are something like 30k indices removed for SPICE, so we should only enable this for the datasets with known outliers such as NablaDFT and Molecule3D.

I agree but I would like to do a visual check of the outliers we are removing

Right, outlier removal would be quite contraindicated for this dataset

@prtos prtos closed this Mar 13, 2024
@FNTwin FNTwin deleted the remove_outliers branch July 10, 2024 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants