Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interaction Datasets #40

Merged
merged 51 commits into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
bd3fcf9
started splitting datasets into 'interaction' and 'potential'
mcneela Mar 1, 2024
a800ea5
add num_unique_molecules property
mcneela Mar 1, 2024
9d6fca6
added logging
mcneela Mar 1, 2024
794e63f
started base interaction dataset
mcneela Mar 1, 2024
0db4765
add interaction __init__ file and revise potential __init__ file
mcneela Mar 4, 2024
6e5a002
add des370k interaction to config_factory.py
mcneela Mar 4, 2024
8e1e003
have BaseInteractionDataset inherit BaseDataset
mcneela Mar 4, 2024
d68bae6
implemented read_raw_entries for DES370K
mcneela Mar 4, 2024
5e94d67
finished implementation of DES370K interaction
mcneela Mar 4, 2024
3c9508b
finished implementation of DES370K interaction
mcneela Mar 4, 2024
768fb2e
update BaseDataset import path
mcneela Mar 4, 2024
8aeadd8
added Metcalf dataset
mcneela Mar 5, 2024
9cf6034
updated DES370K based on Prudencio's comments
mcneela Mar 5, 2024
ce2c53b
Merge branch 'interaction' into metcalf
mcneela Mar 5, 2024
6206665
added const molecule_groups lookup for DES370K dataset
mcneela Mar 5, 2024
5cb57d9
updated subsets for DES370K
mcneela Mar 5, 2024
e18b710
added download url for des5m_interaction
mcneela Mar 5, 2024
54cadbf
updated README with new datasets
mcneela Mar 5, 2024
7f83eb5
Merge branch 'metcalf' into interaction
mcneela Mar 5, 2024
a922ef7
Added DES5M dataset
mcneela Mar 5, 2024
2146058
added des_s66 dataset
mcneela Mar 6, 2024
4d9a4ba
added DESS66x8 dataset
mcneela Mar 6, 2024
c2229e3
small update to __init__ file
mcneela Mar 6, 2024
9349454
added L7 dataset
mcneela Mar 6, 2024
c3bdc64
added X40 dataset
mcneela Mar 6, 2024
23c0739
add new datasets to __init__.py
mcneela Mar 6, 2024
74f87a6
added splinter dataset
mcneela Mar 7, 2024
f046ea9
fixed a couple splinter things
mcneela Mar 7, 2024
3c84ee9
update default data shapes for interaction datasets
mcneela Mar 7, 2024
04c81ae
updated test_dummy.py with new import structure
mcneela Mar 7, 2024
11e2858
fix test_import.py
mcneela Mar 7, 2024
78f0423
code cleanup for the linter
mcneela Mar 8, 2024
bd58fdf
fix ani import
mcneela Mar 8, 2024
5dfcf55
Merge branch 'refactoring' into interaction
mcneela Mar 8, 2024
4bc3a49
fix base dataset import
mcneela Mar 8, 2024
b046eea
black formatting
mcneela Mar 8, 2024
fe54044
ran precommit
mcneela Mar 8, 2024
ef2528c
removed DES from datasets/__init__.py
mcneela Mar 8, 2024
c0ef5b1
removed DES from datasets/__init__.py
mcneela Mar 8, 2024
ad55296
fix X40 energy methods
mcneela Mar 8, 2024
0a51e7c
added interaction dataset docstrings
mcneela Mar 8, 2024
b6c3a6a
update readme with all interaction datasets
mcneela Mar 8, 2024
07f70b8
update metcalf __energy_methods__
mcneela Mar 8, 2024
1443450
refactored des370k and des5m
mcneela Mar 8, 2024
802b70b
update base interaction dataset to add n_atoms_first property
mcneela Mar 8, 2024
e969b54
update L7 and X40 to use python base yaml package
mcneela Mar 12, 2024
5725fed
modify interaction/base.py to save keys other than force/energy in pr…
mcneela Mar 13, 2024
6c6b286
fix base dataset issue
mcneela Mar 13, 2024
46c5ebe
fix circular imports
mcneela Mar 13, 2024
d5ec053
merge origin/develop into interaction
mcneela Mar 13, 2024
cb9987c
removed print statements
mcneela Mar 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ openqdc download Spice QMugs
6. QM Level of Theory
-->

We provide support for the following publicly available QM Datasets.
We provide support for the following publicly available QM Potential Energy Datasets.

# Potential Energy

Expand Down Expand Up @@ -82,3 +82,11 @@ We provide support for the following publicly available QM Datasets.
| --- | --- | --- | --- | --- | --- | --- | --- |
| [DES370K](https://www.nature.com/articles/s41597-021-00833-x) | 3,700 | 370,000 | 100 | No | 20 | CCSD(T) | Yes |
| [DES5M](https://www.nature.com/articles/s41597-021-00833-x) | 3,700 | 5,000,000 | 1351 | No | 20 | SNS-MP2 | Yes |

We also provide support for the following publicly available QM Noncovalent Interaction Energy Datasets.

| Dataset |
| --- |
| [DES370K](https://www.nature.com/articles/s41597-021-00833-x) |
| [DES5M](https://www.nature.com/articles/s41597-021-00833-x) |
| [Metcalf](https://pubs.aip.org/aip/jcp/article/152/7/074103/1059677/Approaches-for-machine-learning-intermolecular) |
1 change: 0 additions & 1 deletion openqdc/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@
"ani1ccx": ANI1CCX,
"ani1x": ANI1X,
"comp6": COMP6,
"des": DES,
"gdml": GDML,
"geom": GEOM,
"iso17": ISO17,
Expand Down
70 changes: 70 additions & 0 deletions openqdc/datasets/interaction/L7.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import os
from typing import Dict, List

import numpy as np
from loguru import logger
from ruamel.yaml import YAML

mcneela marked this conversation as resolved.
Show resolved Hide resolved
from openqdc.datasets.interaction import BaseInteractionDataset
from openqdc.utils.molecule import atom_table


class L7(BaseInteractionDataset):
__name__ = "L7"
__energy_unit__ = "hartree"
__distance_unit__ = "ang"
__forces_unit__ = "hartree/ang"
__energy_methods__ = [
"CSD(T) | QCISD(T)",
"DLPNO-CCSD(T)",
"MP2/CBS",
"MP2C/CBS",
"fixed",
"DLPNO-CCSD(T0)",
"LNO-CCSD(T)",
"FN-DMC",
]

energy_target_names = []

def read_raw_entries(self) -> List[Dict]:
yaml_fpath = os.path.join(self.root, "l7.yaml")
logger.info(f"Reading L7 interaction data from {self.root}")
yaml_file = open(yaml_fpath, "r")
yaml = YAML()
data = []
data_dict = yaml.load(yaml_file)
charge0 = int(data_dict["description"]["global_setup"]["molecule_a"]["charge"])
charge1 = int(data_dict["description"]["global_setup"]["molecule_b"]["charge"])

for idx, item in enumerate(data_dict["items"]):
energies = []
name = np.array([item["shortname"]])
fname = item["geometry"].split(":")[1]
energies.append(item["reference_value"])
xyz_file = open(os.path.join(self.root, f"{fname}.xyz"), "r")
lines = list(map(lambda x: x.strip().split(), xyz_file.readlines()))
lines.pop(1)
n_atoms = np.array([int(lines[0][0])], dtype=np.int32)
n_atoms_first = np.array([int(item["setup"]["molecule_a"]["selection"].split("-")[1])], dtype=np.int32)
subset = np.array([item["group"]])
energies += [float(val[idx]) for val in list(data_dict["alternative_reference"].values())]
energies = np.array([energies], dtype=np.float32)
pos = np.array(lines[1:])[:, 1:].astype(np.float32)
elems = np.array(lines[1:])[:, 0]
atomic_nums = np.expand_dims(np.array([atom_table.GetAtomicNumber(x) for x in elems]), axis=1)
natoms0 = n_atoms_first[0]
natoms1 = n_atoms[0] - natoms0
charges = np.expand_dims(np.array([charge0] * natoms0 + [charge1] * natoms1), axis=1)
atomic_inputs = np.concatenate((atomic_nums, charges, pos), axis=-1, dtype=np.float32)

item = dict(
energies=energies,
subset=subset,
n_atoms=n_atoms,
n_atoms_first=n_atoms_first,
atomic_inputs=atomic_inputs,
name=name,
)
data.append(item)
return data
67 changes: 67 additions & 0 deletions openqdc/datasets/interaction/X40.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
import os
from typing import Dict, List

import numpy as np
from loguru import logger
from ruamel.yaml import YAML

mcneela marked this conversation as resolved.
Show resolved Hide resolved
from openqdc.datasets.interaction import BaseInteractionDataset
from openqdc.utils.molecule import atom_table


class X40(BaseInteractionDataset):
__name__ = "X40"
__energy_unit__ = "hartree"
__distance_unit__ = "ang"
__forces_unit__ = "hartree/ang"
__energy_methods__ = [
"default",
"MP2/CBS",
"dCCSD(T)/haDZ",
"dCCSD(T)/haTZ",
"MP2.5/CBS(aDZ)",
]

energy_target_names = []

def read_raw_entries(self) -> List[Dict]:
yaml_fpath = os.path.join(self.root, "x40.yaml")
logger.info(f"Reading X40 interaction data from {self.root}")
yaml_file = open(yaml_fpath, "r")
yaml = YAML()
data = []
data_dict = yaml.load(yaml_file)
charge0 = int(data_dict["description"]["global_setup"]["molecule_a"]["charge"])
charge1 = int(data_dict["description"]["global_setup"]["molecule_b"]["charge"])

for idx, item in enumerate(data_dict["items"]):
energies = []
name = np.array([item["shortname"]])
energies.append(float(item["reference_value"]))
xyz_file = open(os.path.join(self.root, f"{item['shortname']}.xyz"), "r")
lines = list(map(lambda x: x.strip().split(), xyz_file.readlines()))
setup = lines.pop(1)
n_atoms = np.array([int(lines[0][0])], dtype=np.int32)
n_atoms_first = setup[0].split("-")[1]
n_atoms_first = np.array([int(n_atoms_first)], dtype=np.int32)
subset = np.array([item["group"]])
energies += [float(val[idx]) for val in list(data_dict["alternative_reference"].values())]
energies = np.array([energies], dtype=np.float32)
pos = np.array(lines[1:])[:, 1:].astype(np.float32)
elems = np.array(lines[1:])[:, 0]
atomic_nums = np.expand_dims(np.array([atom_table.GetAtomicNumber(x) for x in elems]), axis=1)
natoms0 = n_atoms_first[0]
natoms1 = n_atoms[0] - natoms0
charges = np.expand_dims(np.array([charge0] * natoms0 + [charge1] * natoms1), axis=1)
atomic_inputs = np.concatenate((atomic_nums, charges, pos), axis=-1, dtype=np.float32)

item = dict(
energies=energies,
subset=subset,
n_atoms=n_atoms,
n_atoms_first=n_atoms_first,
atomic_inputs=atomic_inputs,
name=name,
)
data.append(item)
return data
70 changes: 68 additions & 2 deletions openqdc/datasets/interaction/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,69 @@
from .des import DES
import importlib
import os
from typing import TYPE_CHECKING # noqa F401

AVAILABLE_DATASETS = {"des": DES}
# The below lazy import logic is coming from openff-toolkit:
# https://github.com/openforcefield/openff-toolkit/blob/b52879569a0344878c40248ceb3bd0f90348076a/openff/toolkit/__init__.py#L44

# Dictionary of objects to lazily import; maps the object's name to its module path

_lazy_imports_obj = {
"BaseInteractionDataset": "openqdc.datasets.interaction.base",
"DES370K": "openqdc.datasets.interaction.des370k",
"DES5M": "openqdc.datasets.interaction.des5m",
"Metcalf": "openqdc.datasets.interaction.metcalf",
"DESS66": "openqdc.datasets.interaction.dess66",
"DESS66x8": "openqdc.datasets.interaction.dess66x8",
"L7": "openqdc.datasets.interaction.L7",
"X40": "openqdc.datasets.interaction.X40",
"Splinter": "openqdc.datasets.interaction.splinter",
}

_lazy_imports_mod = {}


def __getattr__(name):
"""Lazily import objects from _lazy_imports_obj or _lazy_imports_mod

Note that this method is only called by Python if the name cannot be found
in the current module."""
obj_mod = _lazy_imports_obj.get(name)
if obj_mod is not None:
mod = importlib.import_module(obj_mod)
return mod.__dict__[name]

lazy_mod = _lazy_imports_mod.get(name)
if lazy_mod is not None:
return importlib.import_module(lazy_mod)

raise AttributeError(f"module {__name__!r} has no attribute {name!r}")


def __dir__():
"""Add _lazy_imports_obj and _lazy_imports_mod to dir(<module>)"""
keys = (*globals().keys(), *_lazy_imports_obj.keys(), *_lazy_imports_mod.keys())
return sorted(keys)


if TYPE_CHECKING or os.environ.get("OPENQDC_DISABLE_LAZY_LOADING", "0") == "1":
from .base import BaseInteractionDataset
from .des5m import DES5M
from .des370k import DES370K
from .dess66 import DESS66
from .dess66x8 import DESS66x8
from .L7 import L7
from .metcalf import Metcalf
from .splinter import Splinter
from .X40 import X40

__all__ = [
"BaseInteractionDataset",
"DES370K",
"DES5M",
"Metcalf",
"DESS66",
"DESS66x8",
"L7",
"X40",
"Splinter",
]
48 changes: 48 additions & 0 deletions openqdc/datasets/interaction/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
from typing import Dict, List, Optional

import numpy as np

from openqdc.datasets.base import BaseDataset
from openqdc.utils.constants import NB_ATOMIC_FEATURES


class BaseInteractionDataset(BaseDataset):
def __init__(
self,
energy_unit: Optional[str] = None,
distance_unit: Optional[str] = None,
overwrite_local_cache: bool = False,
cache_dir: Optional[str] = None,
) -> None:
super().__init__(
energy_unit=energy_unit,
distance_unit=distance_unit,
overwrite_local_cache=overwrite_local_cache,
cache_dir=cache_dir,
)

def collate_list(self, list_entries: List[Dict]):
# concatenate entries
print(list_entries[0])
res = {
key: np.concatenate([r[key] for r in list_entries if r is not None], axis=0)
for key in list_entries[0]
if not isinstance(list_entries[0][key], dict)
}

csum = np.cumsum(res.get("n_atoms"))
print(csum)
x = np.zeros((csum.shape[0], 2), dtype=np.int32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the print

x[1:, 0], x[:, 1] = csum[:-1], csum
res["position_idx_range"] = x

return res

@property
def data_shapes(self):
return {
"atomic_inputs": (-1, NB_ATOMIC_FEATURES),
"position_idx_range": (-1, 2),
"energies": (-1, len(self.__energy_methods__)),
"forces": (-1, 3, len(self.force_target_names)),
}
93 changes: 0 additions & 93 deletions openqdc/datasets/interaction/des.py

This file was deleted.

Loading
Loading