Statistics abstraction pattern #74

FNTwin · 2024-04-02T15:23:16Z

Checklist:

Was this PR discussed in a issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
Update the API documentation is a new function is added or an existing one is deleted.

Abstract the regression for linear atom energies
Clean code in base
Add mixin class to extend properties in base class
Abstract energy statistics to state pattern + interfaces
Automatic saving loading of the right files for statistics
Add formation energy and per atom formation energy to the getitem object return

openqdc/datasets/energies.py

shenoynikhil · 2024-04-02T15:39:27Z

openqdc/datasets/properties.py

+from openqdc.utils.exceptions import StatisticsNotAvailableError
+
+
+class DatasetPropertyMixIn:


What is the benefit of this class?

Less stuff in base

I think with statistics manager and descriptor calculation, the base is already reduced in size, I'm unsure if we should create a new class just for these 4-5 straightforward methods.

I see your point but in this case we can just have a better granularity on what type of property we wish to add and keep it a bit more clean.
We also can just update a file outside of the base.py class to add new properties so further PR will be easier to solve as it is not a really important python file.

Okay. I would still be more likely to add methods to the base class than this one. But feel free to close this thread if you think this is valuable.

I like the class but agree with @shenoynikhil . It is not needed but it makes it easier to navigate the codebase

I don't have a strong opinion on keeping the class in . I just used it to remove a bit of clutter from the base class. If @shenoynikhil or @prtos want to have the properties in the main class I'm fine removing the mixin and reimplementing them into the base class.
I see it being useful because we can further separate properties between the baseclass for the potential energy datasets from the interaction datasets and allows us to implement new properties without touching the base.py file

openqdc/datasets/statistics.py

shenoynikhil

Can you add docstrings, usage instructions for StatisticsManager, Descriptors, etc?

In some instances, you can use better variable names like deps in StatisticsManager is a bit confusing.

Also, tests please.

FNTwin · 2024-04-02T17:51:32Z

Adding tests and then I 'm probably done

openqdc/datasets/base.py

openqdc/datasets/energies.py

prtos · 2024-04-03T14:54:27Z

openqdc/datasets/energies.py

+    """
+
+    def _post_init(self):
+        self._e0_matrixs = [IsolatedAtomEnergyFactory.get_matrix(en_method) for en_method in self.data.energy_methods]


needs to be updated now

This will be done while solving the merge issues

prtos · 2024-04-03T18:54:15Z

openqdc/datasets/properties.py

+from openqdc.utils.exceptions import StatisticsNotAvailableError
+
+
+class DatasetPropertyMixIn:


I like the class but agree with @shenoynikhil . It is not needed but it makes it easier to navigate the codebase

prtos · 2024-04-03T21:05:12Z

openqdc/datasets/statistics.py

+        force_mean = np.nanmean(converted_force_data, axis=0)
+        force_std = np.nanstd(converted_force_data, axis=0)
+        force_rms = np.sqrt(np.nanmean(converted_force_data**2, axis=0))
+        return ForceStatistics(


I am confused with this. the component part

On our multitask losses we need to have some informations about the rms of the forces in the dataset on the x.y,z components of the force vectors.

FNTwin added 22 commits March 22, 2024 12:59

WIP

9a11358

WIP

9745b4c

Moved function to utils.io

8835b8f

Additional dependencies

9a2e1d8

Removed Mixin

6dd9be6

WIP Preprocess + property

78c9f15

Refactored Descriptor class

55a6692

Removed not used funcs

5b6ec7e

Added ACSF

736c12a

Import fix

8503b98

Fixes

3b4823e

Many-body Tensor Representation

356adb1

Included dscribe in main deps, removed torch and jax

20caa4f

wip

38675f3

MTBR incorrect signature Fix (Thanks Danny)

4dd2af8

State Manager and Chain of Management

cb30c16

WIP

571b706

Separated Statistics and Atom Energies

e6d51b7

WIP

7a20193

Parallelized function, better calculate specs, docstrings

cef2b35

descriptor tests

dbdd985

Regression + Statistics abstraction

c7ab437