Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implemented outlier removal #36

Closed
wants to merge 6 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion src/openqdc/datasets/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,9 +103,11 @@ def __init__(
distance_unit: Optional[str] = None,
overwrite_local_cache: bool = False,
cache_dir: Optional[str] = None,
remove_outliers: bool = True,
) -> None:
set_cache_dir(cache_dir)
self.data = None
self.remove_outliers = remove_outliers
if not self.is_preprocessed():
raise DatasetNotAvailableError(self.__name__)
else:
Expand Down Expand Up @@ -164,6 +166,27 @@ def _precompute_statistics(self, overwrite_local_cache: bool = False):
def _compute_average_nb_atoms(self):
self.__average_nb_atoms__ = np.mean(self.data["n_atoms"])

def _remove_outliers(
self,
mcneela marked this conversation as resolved.
Show resolved Hide resolved
formation_E: np.array,
mean_or_median: str = "median",
num_stds: float = 3.0,
) -> np.array:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there is no way control the mean_or_median, num_stds parameters used in the computation.

I would change mean_or_median name to mode or something similar. Even a boolean I think would be better in this case.

How did we decide the num_stds and how does it impact across the different datasets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's change it to mode and then the user can provide a string key to a dictionary that selects the relevant numpy function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be ok with it, we can also just leave the current if and just renaming the variable otherwise

assert(
mean_or_median in ["mean", "median"],
f"{mean_or_median} is not a valid option, should be one of ['mean', 'median']"
)
logger.info(f"Removing outliers outside {mean_or_median} +/- {num_stds} stds")
fn = np.median if mean_or_median == "median" else np.mean
mid = fn(formation_E)
mask = np.logical_or(formation_E < mid - num_stds * formation_E.std(), formation_E > mid + num_stds * formation_E.std())
formation_E = formation_E[~mask] # TODO: Christian, your formation E values are different than the ones I calculated yesterday, not sure why?
for key in self.data:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure either. We should check together. Are we sure we are confronting the same units?

# TODO: We need a way to map the mask to the 'atomic_inputs' array
if key != "atomic_inputs":
self.data[key] = self.data[key][~mask,...]
return np.expand_dims(formation_E, axis=0).T

def _precompute_E(self):
splits_idx = self.data["position_idx_range"][:, 1]
s = np.array(self.data["atomic_inputs"][:, :2], dtype=int)
Expand All @@ -176,7 +199,12 @@ def _precompute_E(self):
c = np.cumsum(np.append([0], matrix))[splits_idx]
c[1:] = c[1:] - c[:-1]
E.append(converted_energy_data[:, i] - c)
E = np.array(E).T
E = np.array(E).T # this is the formation energy

# remove outliers if requested in __init__
if self.remove_outliers:
E = self._remove_outliers(np.squeeze(E.T))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporate this if inside the remove_outliers function.

inter_E_mean = np.nanmean(E / self.data["n_atoms"][:, None], axis=0)
inter_E_std = np.nanstd(E / self.data["n_atoms"][:, None], axis=0)
formation_E_mean = np.nanmean(E, axis=0)
Expand Down
Loading