Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interaction Datasets #40

Merged
merged 51 commits into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
bd3fcf9
started splitting datasets into 'interaction' and 'potential'
mcneela Mar 1, 2024
a800ea5
add num_unique_molecules property
mcneela Mar 1, 2024
9d6fca6
added logging
mcneela Mar 1, 2024
794e63f
started base interaction dataset
mcneela Mar 1, 2024
0db4765
add interaction __init__ file and revise potential __init__ file
mcneela Mar 4, 2024
6e5a002
add des370k interaction to config_factory.py
mcneela Mar 4, 2024
8e1e003
have BaseInteractionDataset inherit BaseDataset
mcneela Mar 4, 2024
d68bae6
implemented read_raw_entries for DES370K
mcneela Mar 4, 2024
5e94d67
finished implementation of DES370K interaction
mcneela Mar 4, 2024
3c9508b
finished implementation of DES370K interaction
mcneela Mar 4, 2024
768fb2e
update BaseDataset import path
mcneela Mar 4, 2024
8aeadd8
added Metcalf dataset
mcneela Mar 5, 2024
9cf6034
updated DES370K based on Prudencio's comments
mcneela Mar 5, 2024
ce2c53b
Merge branch 'interaction' into metcalf
mcneela Mar 5, 2024
6206665
added const molecule_groups lookup for DES370K dataset
mcneela Mar 5, 2024
5cb57d9
updated subsets for DES370K
mcneela Mar 5, 2024
e18b710
added download url for des5m_interaction
mcneela Mar 5, 2024
54cadbf
updated README with new datasets
mcneela Mar 5, 2024
7f83eb5
Merge branch 'metcalf' into interaction
mcneela Mar 5, 2024
a922ef7
Added DES5M dataset
mcneela Mar 5, 2024
2146058
added des_s66 dataset
mcneela Mar 6, 2024
4d9a4ba
added DESS66x8 dataset
mcneela Mar 6, 2024
c2229e3
small update to __init__ file
mcneela Mar 6, 2024
9349454
added L7 dataset
mcneela Mar 6, 2024
c3bdc64
added X40 dataset
mcneela Mar 6, 2024
23c0739
add new datasets to __init__.py
mcneela Mar 6, 2024
74f87a6
added splinter dataset
mcneela Mar 7, 2024
f046ea9
fixed a couple splinter things
mcneela Mar 7, 2024
3c84ee9
update default data shapes for interaction datasets
mcneela Mar 7, 2024
04c81ae
updated test_dummy.py with new import structure
mcneela Mar 7, 2024
11e2858
fix test_import.py
mcneela Mar 7, 2024
78f0423
code cleanup for the linter
mcneela Mar 8, 2024
bd58fdf
fix ani import
mcneela Mar 8, 2024
5dfcf55
Merge branch 'refactoring' into interaction
mcneela Mar 8, 2024
4bc3a49
fix base dataset import
mcneela Mar 8, 2024
b046eea
black formatting
mcneela Mar 8, 2024
fe54044
ran precommit
mcneela Mar 8, 2024
ef2528c
removed DES from datasets/__init__.py
mcneela Mar 8, 2024
c0ef5b1
removed DES from datasets/__init__.py
mcneela Mar 8, 2024
ad55296
fix X40 energy methods
mcneela Mar 8, 2024
0a51e7c
added interaction dataset docstrings
mcneela Mar 8, 2024
b6c3a6a
update readme with all interaction datasets
mcneela Mar 8, 2024
07f70b8
update metcalf __energy_methods__
mcneela Mar 8, 2024
1443450
refactored des370k and des5m
mcneela Mar 8, 2024
802b70b
update base interaction dataset to add n_atoms_first property
mcneela Mar 8, 2024
e969b54
update L7 and X40 to use python base yaml package
mcneela Mar 12, 2024
5725fed
modify interaction/base.py to save keys other than force/energy in pr…
mcneela Mar 13, 2024
6c6b286
fix base dataset issue
mcneela Mar 13, 2024
46c5ebe
fix circular imports
mcneela Mar 13, 2024
d5ec053
merge origin/develop into interaction
mcneela Mar 13, 2024
cb9987c
removed print statements
mcneela Mar 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ pytest
6. QM Level of Theory
-->

We provide support for the following publicly available QM Datasets.
We provide support for the following publicly available QM Potential Energy Datasets.

| Dataset | # Molecules | # Conformers | Average Conformers per Molecule | Force Labels | Atom Types | QM Level of Theory | Off-Equilibrium Conformations|
| --- | --- | --- | --- | --- | --- | --- | --- |
Expand All @@ -46,3 +46,11 @@ We provide support for the following publicly available QM Datasets.
| [OrbNet Denali](https://arxiv.org/abs/2107.00299) | 212,905 | 2,300,000 | 11 | No | 16 | GFN1-xTB | Yes |
| [SN2RXN](https://pubs.acs.org/doi/10.1021/acs.jctc.9b00181) | 39 | 452709 | 11,600 | Yes | 6 | DSD-BLYP-D3(BJ)/def2-TZVP | |
| [QM7X](https://www.nature.com/articles/s41597-021-00812-2) | 6,950 | 4,195,237 | 603 | Yes | 7 | PBE0+MBD | Yes |

We also provide support for the following publicly available QM Noncovalent Interaction Energy Datasets.

| Dataset |
| --- |
| [DES370K](https://www.nature.com/articles/s41597-021-00833-x) |
| [DES5M](https://www.nature.com/articles/s41597-021-00833-x) |
| [Metcalf](https://pubs.aip.org/aip/jcp/article/152/7/074103/1059677/Approaches-for-machine-learning-intermolecular) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the other datasets here as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They've now been added! We still need to calculate the statistics for each dataset.

73 changes: 73 additions & 0 deletions src/openqdc/datasets/interaction/L7.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
import os
import numpy as np
import pandas as pd

from typing import Dict, List

from tqdm import tqdm
from rdkit import Chem
from ruamel.yaml import YAML
from loguru import logger
from openqdc.datasets.interaction import BaseInteractionDataset
from openqdc.utils.molecule import atom_table, molecule_groups


class L7(BaseInteractionDataset):
__name__ = "L7"
__energy_unit__ = "hartree"
__distance_unit__ = "ang"
__forces_unit__ = "hartree/ang"
__energy_methods__ = [
mcneela marked this conversation as resolved.
Show resolved Hide resolved
"CSD(T) | QCISD(T)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "CSD(T) | QCISD(T)"? Pick the one most used in the other datasets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this if the comment above make sense

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I annotated the energy this way because in the paper they said they computed the energies using CSD(T) or QCISD(T) but they didn't provide labels in the dataset showing which dimer pairs were computed with CSD(T) and which were computed using QCISD(T).

"DLPNO-CCSD(T)",
"MP2/CBS",
"MP2C/CBS",
"fixed",
"DLPNO-CCSD(T0)",
"LNO-CCSD(T)",
"FN-DMC",
]

energy_target_names = []

def read_raw_entries(self) -> List[Dict]:
yaml_fpath = os.path.join(self.root, "l7.yaml")
logger.info(f"Reading L7 interaction data from {self.root}")
yaml_file = open(yaml_fpath, "r")
yaml = YAML()
data = []
data_dict = yaml.load(yaml_file)
charge0 = int(data_dict["description"]["global_setup"]["molecule_a"]["charge"])
charge1 = int(data_dict["description"]["global_setup"]["molecule_b"]["charge"])

for idx, item in enumerate(data_dict["items"]):
energies = []
name = np.array([item["shortname"]])
fname = item["geometry"].split(":")[1]
energies.append(item["reference_value"])
xyz_file = open(os.path.join(self.root, f"{fname}.xyz"), "r")
lines = list(map(lambda x: x.strip().split(), xyz_file.readlines()))
lines.pop(1)
n_atoms = np.array([int(lines[0][0])], dtype=np.int32)
n_atoms_first = np.array([int(item["setup"]["molecule_a"]["selection"].split("-")[1])], dtype=np.int32)
subset = np.array([item["group"]])
energies += [float(val[idx]) for val in list(data_dict["alternative_reference"].values())]
energies = np.array([energies], dtype=np.float32)
pos = np.array(lines[1:])[:, 1:].astype(np.float32)
elems = np.array(lines[1:])[:, 0]
atomic_nums = np.expand_dims(np.array([atom_table.GetAtomicNumber(x) for x in elems]), axis=1)
natoms0 = n_atoms_first[0]
natoms1 = n_atoms[0] - natoms0
charges = np.expand_dims(np.array([charge0] * natoms0 + [charge1] * natoms1), axis=1)
atomic_inputs = np.concatenate((atomic_nums, charges, pos), axis=-1, dtype=np.float32)

item = dict(
energies=energies,
subset=subset,
n_atoms=n_atoms,
n_atoms_first=n_atoms_first,
atomic_inputs=atomic_inputs,
name=name,
)
data.append(item)
return data
70 changes: 70 additions & 0 deletions src/openqdc/datasets/interaction/X40.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import os
import numpy as np
import pandas as pd

from typing import Dict, List

from tqdm import tqdm
from rdkit import Chem
from ruamel.yaml import YAML
from loguru import logger
from openqdc.datasets.interaction import BaseInteractionDataset
from openqdc.utils.molecule import atom_table, molecule_groups


class X40(BaseInteractionDataset):
__name__ = "X40"
__energy_unit__ = "hartree"
__distance_unit__ = "ang"
__forces_unit__ = "hartree/ang"
__energy_methods__ = [
"default",
mcneela marked this conversation as resolved.
Show resolved Hide resolved
"MP2/CBS",
"dCCSD(T)/haDZ",
"dCCSD(T)/haTZ",
"MP2.5/CBS(aDZ)",
]

energy_target_names = []

def read_raw_entries(self) -> List[Dict]:
yaml_fpath = os.path.join(self.root, "x40.yaml")
logger.info(f"Reading X40 interaction data from {self.root}")
yaml_file = open(yaml_fpath, "r")
yaml = YAML()
data = []
data_dict = yaml.load(yaml_file)
charge0 = int(data_dict["description"]["global_setup"]["molecule_a"]["charge"])
charge1 = int(data_dict["description"]["global_setup"]["molecule_b"]["charge"])

for idx, item in enumerate(data_dict["items"]):
energies = []
name = np.array([item["shortname"]])
energies.append(float(item["reference_value"]))
xyz_file = open(os.path.join(self.root, f"{item['shortname']}.xyz"), "r")
lines = list(map(lambda x: x.strip().split(), xyz_file.readlines()))
setup = lines.pop(1)
n_atoms = np.array([int(lines[0][0])], dtype=np.int32)
n_atoms_first = setup[0].split("-")[1]
n_atoms_first = np.array([int(n_atoms_first)], dtype=np.int32)
subset = np.array([item["group"]])
energies += [float(val[idx]) for val in list(data_dict["alternative_reference"].values())]
energies = np.array([energies], dtype=np.float32)
pos = np.array(lines[1:])[:, 1:].astype(np.float32)
elems = np.array(lines[1:])[:, 0]
atomic_nums = np.expand_dims(np.array([atom_table.GetAtomicNumber(x) for x in elems]), axis=1)
natoms0 = n_atoms_first[0]
natoms1 = n_atoms[0] - natoms0
charges = np.expand_dims(np.array([charge0] * natoms0 + [charge1] * natoms1), axis=1)
atomic_inputs = np.concatenate((atomic_nums, charges, pos), axis=-1, dtype=np.float32)

item = dict(
energies=energies,
subset=subset,
n_atoms=n_atoms,
n_atoms_first=n_atoms_first,
atomic_inputs=atomic_inputs,
name=name,
)
data.append(item)
return data
69 changes: 69 additions & 0 deletions src/openqdc/datasets/interaction/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import importlib
import os
from typing import TYPE_CHECKING # noqa F401

# The below lazy import logic is coming from openff-toolkit:
# https://github.com/openforcefield/openff-toolkit/blob/b52879569a0344878c40248ceb3bd0f90348076a/openff/toolkit/__init__.py#L44

# Dictionary of objects to lazily import; maps the object's name to its module path

_lazy_imports_obj = {
"BaseInteractionDataset": "openqdc.datasets.interaction.base",
"DES370K": "openqdc.datasets.interaction.des370k",
"DES5M": "openqdc.datasets.interaction.des5m",
"Metcalf": "openqdc.datasets.interaction.metcalf",
"DESS66": "openqdc.datasets.interaction.dess66",
"DESS66x8": "openqdc.datasets.interaction.dess66x8",
"L7": "openqdc.datasets.interaction.L7",
"X40": "openqdc.datasets.interaction.X40",
"Splinter": "openqdc.datasets.interaction.splinter",
}

_lazy_imports_mod = {}


def __getattr__(name):
"""Lazily import objects from _lazy_imports_obj or _lazy_imports_mod

Note that this method is only called by Python if the name cannot be found
in the current module."""
obj_mod = _lazy_imports_obj.get(name)
if obj_mod is not None:
mod = importlib.import_module(obj_mod)
return mod.__dict__[name]

lazy_mod = _lazy_imports_mod.get(name)
if lazy_mod is not None:
return importlib.import_module(lazy_mod)

raise AttributeError(f"module {__name__!r} has no attribute {name!r}")


def __dir__():
"""Add _lazy_imports_obj and _lazy_imports_mod to dir(<module>)"""
keys = (*globals().keys(), *_lazy_imports_obj.keys(), *_lazy_imports_mod.keys())
return sorted(keys)


if TYPE_CHECKING or os.environ.get("OPENQDC_DISABLE_LAZY_LOADING", "0") == "1":
from .base import BaseInteractionDataset
from .des370k import DES370K
from .des5m import DES5M
from .metcalf import Metcalf
from .dess66 import DESS66
from .dess66x8 import DESS66x8
from .L7 import L7
from .X40 import X40
from .splinter import Splinter

__all__ = [
"BaseInteractionDataset",
"DES370K",
"DES5M",
"Metcalf",
"DESS66",
"DESS66x8",
"L7",
"X40",
"Splinter",
]
57 changes: 57 additions & 0 deletions src/openqdc/datasets/interaction/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
from typing import Dict, List, Optional, Union
from openqdc.utils.io import (
copy_exists,
dict_to_atoms,
get_local_cache,
load_hdf5_file,
load_pkl,
pull_locally,
push_remote,
set_cache_dir,
)
from openqdc.datasets.potential.base import BaseDataset
from openqdc.utils.constants import (
NB_ATOMIC_FEATURES
)

from loguru import logger

import numpy as np

class BaseInteractionDataset(BaseDataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write and read prepossessed must be changed here no? There are news keys been added so the base class must adapt those functions no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must also change the logic to avoid of a few other functions to avoid the normalization of interaction energies no @FNTwin.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I still need to update the preprocessing functions to add the new keys. I'm not familiar with the normalization of the energies, Cristian will probably be able to help with that.

def __init__(
self,
energy_unit: Optional[str] = None,
distance_unit: Optional[str] = None,
overwrite_local_cache: bool = False,
cache_dir: Optional[str] = None,
) -> None:
super().__init__(
energy_unit=energy_unit,
distance_unit=distance_unit,
overwrite_local_cache=overwrite_local_cache,
cache_dir=cache_dir
)

def collate_list(self, list_entries: List[Dict]):
# concatenate entries
print(list_entries[0])
res = {key: np.concatenate([r[key] for r in list_entries if r is not None], axis=0) \
for key in list_entries[0] if not isinstance(list_entries[0][key], dict)}

csum = np.cumsum(res.get("n_atoms"))
print(csum)
x = np.zeros((csum.shape[0], 2), dtype=np.int32)
x[1:, 0], x[:, 1] = csum[:-1], csum
res["position_idx_range"] = x

return res

@property
def data_shapes(self):
return {
"atomic_inputs": (-1, NB_ATOMIC_FEATURES),
"position_idx_range": (-1, 2),
"energies": (-1, len(self.__energy_methods__)),
"forces": (-1, 3, len(self.force_target_names)),
}
Loading
Loading