Skip to content

Commit

Permalink
Merge pull request #39 from OpenDrugDiscovery/docstring_enhancements
Browse files Browse the repository at this point in the history
Updated Readme, Docstrings datasets and small additions notebook
  • Loading branch information
FNTwin authored Mar 5, 2024
2 parents cedf344 + d695cde commit 9b70b84
Show file tree
Hide file tree
Showing 20 changed files with 206 additions and 53 deletions.
23 changes: 18 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,17 +52,30 @@ openqdc download --datasets Spice QMugs

We provide support for the following publicly available QM Datasets.

# Potential Energy

| Dataset | # Molecules | # Conformers | Average Conformers per Molecule | Force Labels | Atom Types | QM Level of Theory | Off-Equilibrium Conformations|
| --- | --- | --- | --- | --- | --- | --- | --- |
| [ANI](https://pubs.rsc.org/en/content/articlelanding/2017/SC/C6SC05720A) | 57,462 | 20,000,000 | 348 | No | 4 | ωB97x:6-31G(d) | Yes |
| [GEOM](https://www.nature.com/articles/s41597-022-01288-4) | 450,000 | 37,000,000 | 82 | No | 18 | GFN2-xTB | No |
| [Molecule3D](https://arxiv.org/abs/2110.01717) | 3,899,647 | 3,899,647 | 1 | No | 5 | B3LYP/6-31G* | No |
| [NablaDFT](https://pubs.rsc.org/en/content/articlelanding/2022/CP/D2CP03966D) | 1,000,000 | 5,000,000 | 5 | No | 6 | ωB97X-D/def2-SVP | |
| [OrbNet Denali](https://arxiv.org/abs/2107.00299) | 212,905 | 2,300,000 | 11 | No | 16 | GFN1-xTB | Yes |
| [PCQM_PM6](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.0c00740) | | | 1| No| | PM6 | No
| [PCQM_B3LYP](https://arxiv.org/abs/2305.18454) | 85,938,443|85,938,443 | 1| No| | B3LYP/6-31G* | No
| [QMugs](https://www.nature.com/articles/s41597-022-01390-7) | 665,000 | 2,000,000 | 3 | No | 10 | GFN2-xTB, ωB97X-D/def2-SVP | No |
| [QM7X](https://www.nature.com/articles/s41597-021-00812-2) | 6,950 | 4,195,237 | 603 | Yes | 7 | PBE0+MBD | Yes |
| [SN2RXN](https://pubs.acs.org/doi/10.1021/acs.jctc.9b00181) | 39 | 452709 | 11,600 | Yes | 6 | DSD-BLYP-D3(BJ)/def2-TZVP | |
| [SolvatedPeptides](https://doi.org/10.1021/acs.jctc.9b00181) | | 2,731,180 | | Yes | | revPBE-D3(BJ)/def2-TZVP | |
| [Spice](https://arxiv.org/abs/2209.10702) | 19,238 | 1,132,808 | 59 | Yes | 15 | ωB97M-D3(BJ)/def2-TZVPPD | Yes |
| [ANI](https://pubs.rsc.org/en/content/articlelanding/2017/SC/C6SC05720A) | 57,462 | 20,000,000 | 348 | No | 4 | ωB97x:6-31G(d) | Yes |
| [tmQM](https://pubs.acs.org/doi/10.1021/acs.jcim.0c01041) | 86,665 | | | No | | TPSSh-D3BJ/def2-SVP | |
| [tmQM](https://pubs.acs.org/doi/10.1021/acs.jcim.0c01041) | 86,665 | 86,665| 1| No | | TPSSh-D3BJ/def2-SVP | |
| [Transition1X](https://www.nature.com/articles/s41597-022-01870-w) | | 9,654,813| | Yes | | ωB97x/6–31 G(d) | Yes |
| [WaterClusters](https://doi.org/10.1063/1.5128378) | 1 | 4,464,740| | No | 2 | TTM2.1-F | Yes|


# Interaction energy

| Dataset | # Molecules | # Conformers | Average Conformers per Molecule | Force Labels | Atom Types | QM Level of Theory | Off-Equilibrium Conformations|
| --- | --- | --- | --- | --- | --- | --- | --- |
| [DES370K](https://www.nature.com/articles/s41597-021-00833-x) | 3,700 | 370,000 | 100 | No | 20 | CCSD(T) | Yes |
| [DES5M](https://www.nature.com/articles/s41597-021-00833-x) | 3,700 | 5,000,000 | 1351 | No | 20 | SNS-MP2 | Yes |
| [OrbNet Denali](https://arxiv.org/abs/2107.00299) | 212,905 | 2,300,000 | 11 | No | 16 | GFN1-xTB | Yes |
| [SN2RXN](https://pubs.acs.org/doi/10.1021/acs.jctc.9b00181) | 39 | 452709 | 11,600 | Yes | 6 | DSD-BLYP-D3(BJ)/def2-TZVP | |
| [QM7X](https://www.nature.com/articles/s41597-021-00812-2) | 6,950 | 4,195,237 | 603 | Yes | 7 | PBE0+MBD | Yes |
26 changes: 16 additions & 10 deletions docs/tutorials/usage.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# OpenQDC Hands On tutorial\n",
"# OpenQDC Hands-on Tutorial\n",
"\n",
"## Instantiate and GO!\n",
"\n",
"If you don't have the dataset downloaded it will be downloaded automatically and cached. You just instantiate the class and you are ready to go.\n",
"Change of units are done automatically on loading based on the units in the dataset."
"If you don't have the dataset downloaded, it will be downloaded automatically and cached. You just instantiate the class and you are ready to go.\n",
"Change of units is done automatically upon loading based on the units of the dataset.\n",
"\n",
"Supported energy units: [\"kcal/mol\", \"kj/mol\", \"hartree\", \"ev\"]\n",
"\n",
"Supported distance units: [\"ang\", \"nm\", \"bohr\"]"
]
},
{
Expand Down Expand Up @@ -53,7 +57,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Items from the dataset object class are obtained through the get method.\n",
"### Items from the dataset object class are obtained through the \"get\" method.\n",
"\n",
"The dictionary of the item contains different important keys:\n",
"- 'positions' : numpy array of the 3d atomic positions (n x 3)\n",
Expand All @@ -63,7 +67,7 @@
"- 'energies': potential energy of the molecule (n_level_of_theries)\n",
"- 'name': name or smiles (is present) of the molecule\n",
"- 'subset': subset of the dataset the molecule belongs to\n",
"- 'forces': if presentes the forces on the atoms (n x 3 x n_level_of_theories_forces)"
"- 'forces': if present, the forces on the atoms (n x 3 x n_level_of_theories_forces)"
]
},
{
Expand Down Expand Up @@ -257,9 +261,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Alternatevely we can also retrieve the data from the dataset object class as ase.Atoms using the get_ase_atoms\n",
"\n",
"The dictionary of the item contains different important keys:"
"### Alternatively, we can also retrieve the data from the dataset object class as ase.Atoms using the get_ase_atoms!"
]
},
{
Expand Down Expand Up @@ -444,7 +446,7 @@
"source": [
"### Iterators \n",
"\n",
"The method as_iter(atoms=False) returns an iterator over the dataset. If atoms is True the iterator returns the data as ase.Atoms objects. Otherwise it returns the dictionary of the item."
"The method as_iter(atoms=False) returns an iterator over the dataset. If atoms is True, the iterator returns the data as an ase.Atoms objects. Otherwise, it returns the dictionary of the item."
]
},
{
Expand Down Expand Up @@ -651,7 +653,11 @@
"source": [
"### Isolated atoms energies [e0s]\n",
"\n",
"The isolated atoms energies are automatically used inside the datasets for the correct level of theory but you can also use them directly by accessing the IsolatedAtomEnergyFactor class."
"The potential energy of the system can be decomposed into the sum of isolated atom energies and the formation energy.\n",
"\n",
"$U(A_1, A_2, ...) = \\sum_{i_1}^N e_0(A_i) + e(A_1, A_2, ...)$\n",
"\n",
"The isolated atoms energies are automatically used inside the datasets for the correct level of theory, but you can also use them directly by accessing the IsolatedAtomEnergyFactor class."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion openqdc/datasets/ani.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ def __smiles_converter__(self, x):

class ANI1X(ANI1):
"""
The ANI-1X dataset consists of ANI-1 molecules + some molecules added using active learning which leads to
The ANI-1X dataset consists of ANI-1 molecules + some molecules added using active learning, which leads to
a total of 5,496,771 conformers with 63,865 unique molecules.
Usage
Expand Down
20 changes: 18 additions & 2 deletions openqdc/datasets/base.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""The BaseDataset defining shared functionality between all datasets."""

import os
import pickle as pkl
from copy import deepcopy
Expand Down Expand Up @@ -40,7 +42,7 @@
from openqdc.utils.units import get_conversion


def extract_entry(
def _extract_entry(
df: pd.DataFrame,
i: int,
subset: str,
Expand Down Expand Up @@ -73,11 +75,12 @@ def extract_entry(
def read_qc_archive_h5(
raw_path: str, subset: str, energy_target_names: List[str], force_target_names: List[str]
) -> List[Dict[str, np.ndarray]]:
"""Extracts data from the HDF5 archive file."""
data = load_hdf5_file(raw_path)
data_t = {k2: data[k1][k2][:] for k1 in data.keys() for k2 in data[k1].keys()}

n = len(data_t["molecule_id"])
samples = [extract_entry(data_t, i, subset, energy_target_names, force_target_names) for i in tqdm(range(n))]
samples = [_extract_entry(data_t, i, subset, energy_target_names, force_target_names) for i in tqdm(range(n))]
return samples


Expand Down Expand Up @@ -108,6 +111,19 @@ def __init__(
overwrite_local_cache: bool = False,
cache_dir: Optional[str] = None,
) -> None:
"""
Parameters
----------
energy_unit
Energy unit to convert dataset to. Supported units: ["kcal/mol", "kj/mol", "hartree", "ev"]
distance_unit
Distance unit to convert dataset to. Supported units: ["ang", "nm", "bohr"]
overwrite_local_cache
Whether to overwrite the locally cached dataset.
cache_dir
Cache directory location. Defaults to "~/.cache/openqdc"
"""
set_cache_dir(cache_dir)
self.data = None
if not self.is_preprocessed():
Expand Down
4 changes: 2 additions & 2 deletions openqdc/datasets/iso_17.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
class ISO17(BaseDataset):
"""
ISO17 dataset consists of the largest set of isomers from the QM9 dataset that consists of a fixed
composition of atoms (C7O2H10) arranged in different chemically valid structures. It consists of consist
of 129 molecules each containing 5,000 conformational geometries, energies and forces with a resolution
composition of atoms (C7O2H10) arranged in different chemically valid structures. It consist
of 129 molecules, each containing 5,000 conformational geometries, energies and forces with a resolution
of 1 femtosecond in the molecular dynamics trajectories. The simulations were carried out using the
Perdew-Burke-Ernzerhof (PBE) functional and the Tkatchenko-Scheffler (TS) van der Waals correction method.
Expand Down
2 changes: 1 addition & 1 deletion openqdc/datasets/molecule3d.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def _read_sdf(sdf_path: str, properties_path: str) -> List[Dict[str, np.ndarray]
class Molecule3D(BaseDataset):
"""
Molecule3D dataset consists of 3,899,647 molecules with ground state geometries and energies
calculated at B3LYP/6-31G* level of theory. The molecules are extracted from the
calculated at the B3LYP/6-31G* level of theory. The molecules are extracted from the
PubChem database and cleaned by removing invalid molecule files.
Usage:
Expand Down
19 changes: 19 additions & 0 deletions openqdc/datasets/qm7x.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,25 @@ def read_mol(mol_h5, mol_name, energy_target_names, force_target_names):


class QM7X(BaseDataset):
"""
QM7X is a collection of almost 4.2 million conformers from 6,950 unique molecules. It contains DFT
energy and force labels at the PBE0+MBD level of theory. It consists of structures for molecules with
up to seven heavy (C, N, O, S, Cl) atoms from the GDB13 database. For each molecule, (meta-)stable
equilibrium structures including constitutional/structural isomers and stereoisomers are
searched using density-functional tight binding (DFTB). Then, for each (meta-)stable structure, 100
off-equilibrium structures are obtained and labeled with PBE0+MBD.
Usage:
```python
from openqdc.datasets import QM7X
dataset = QM7X()
```
References:
- https://arxiv.org/abs/2006.15139
- https://zenodo.org/records/4288677
"""

__name__ = "qm7x"

__energy_methods__ = ["pbe0/mbd", "dft3b"]
Expand Down
41 changes: 17 additions & 24 deletions openqdc/datasets/sn2_rxn.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,23 @@


class SN2RXN(BaseDataset):
"""
This dataset probes chemical reactions of methyl halides with halide anions, i.e.
X- + CH3Y -> CH3X + Y-, and contains structures for all possible combinations of
X,Y = F, Cl, Br, I. It contains energy and forces for 452709 conformations calculated
at the DSD-BLYP-D3(BJ)/def2-TZVP level of theory.
Usage:
```python
from openqdc.datasets import SN2RXN
dataset = SN2RXN()
```
References:
- https://doi.org/10.1021/acs.jctc.9b00181
- https://zenodo.org/records/2605341
"""

__name__ = "sn2_rxn"

__energy_methods__ = [
Expand Down Expand Up @@ -33,30 +50,6 @@ def __smiles_converter__(self, x):

def read_raw_entries(self):
raw_path = p_join(self.root, "sn2_rxn.h5")

# raw_path = p_join(self.root, "sn2_reactions.npz")
# data = np.load(raw_path)

# # as example for accessing individual entries, print the data for entry idx=0
# idx = 0
# print("Data for entry " + str(idx)+":")
# print("Number of atoms")
# print(data["N"][idx])
# print("Energy [eV]")
# print(data["E"][idx])
# print("Total charge")
# print(data["Q"][idx])
# print("Dipole moment vector (with respect to [0.0 0.0 0.0]) [eA]")
# print(data["D"][idx,:])
# print("Nuclear charges")
# print(data["Z"][idx,:data["N"][idx]])
# print("Cartesian coordinates [A]")
# print(data["R"][idx,:data["N"][idx],:])
# print("Forces [eV/A]")
# print(data["F"][idx,:data["N"][idx],:])

# exit()

samples = read_qc_archive_h5(raw_path, "sn2_rxn", self.energy_target_names, self.force_target_names)

return samples
17 changes: 17 additions & 0 deletions openqdc/datasets/solvated_peptides.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,23 @@


class SolvatedPeptides(BaseDataset):
"""
The solvated protein fragments dataset probes many-body intermolecular
interactions between "protein fragments" and water molecules.
It contains energy and forces for 2731180 structures calculated
at the revPBE-D3(BJ)/def2-TZVP level of theory.
Usage:
```python
from openqdc.datasets import SolvatedPeptides
dataset = SolvatedPeptides()
```
References:
- https://doi.org/10.1021/acs.jctc.9b00181
- https://zenodo.org/records/2605372
"""

__name__ = "solvated_peptides"

__energy_methods__ = [
Expand Down
4 changes: 2 additions & 2 deletions openqdc/datasets/spice.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@ def read_record(r):

class Spice(BaseDataset):
"""
Spice Dataset consists of 1.1 million conformations for a diverse set of 19k unique molecules consisting of
The Spice dataset consists of 1.1 million conformations for a diverse set of 19k unique molecules consisting of
small molecules, dimers, dipeptides, and solvated amino acids. It consists of both forces and energies calculated
at {\omega}B97M-D3(BJ)/def2-TZVPPD level of theory.
at the {\omega}B97M-D3(BJ)/def2-TZVPPD level of theory.
Usage:
```python
Expand Down
17 changes: 17 additions & 0 deletions openqdc/datasets/tmqm.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,23 @@ def read_xyz(fname, e_map):


class TMQM(BaseDataset):
"""
The tmQM dataset contains the geometries of a large transition metal-organic
compound space with a large variety of organic ligands and 30 transition metals.
It contains energy labels for 86,665 mononuclear complexe calculated
at the TPSSh-D3BJ/def2-SV DFT level of theory.
Usage:
```python
from openqdc.datasets import TMQM
dataset = TMQM()
```
References:
- https://pubs.acs.org/doi/10.1021/acs.jcim.0c01041
- https://github.com/bbskjelstad/tmqm
"""

__name__ = "tmqm"

__energy_methods__ = ["tpssh/def2-tzvp"]
Expand Down
16 changes: 16 additions & 0 deletions openqdc/datasets/transition1x.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,22 @@ def read_record(r, group):


class Transition1X(BaseDataset):
"""
The Transition1x dataset contains structures from 10k organic reaction pathways of various types.
It contains DFT energy and force labels for 9.6 mio. conformers calculated at the
wB97x/6-31-G(d) level of theory.
Usage:
```python
from openqdc.datasets import Transition1X
dataset = Transition1X()
```
References:
- https://www.nature.com/articles/s41597-022-01870-w
- https://gitlab.com/matschreiner/Transition1x
"""

__name__ = "transition1x"

__energy_methods__ = [
Expand Down
17 changes: 17 additions & 0 deletions openqdc/datasets/waterclusters3_30.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,23 @@ def read_xyz(fname, n_waters):


class WaterClusters(BaseDataset):
"""
The WaterClusters dataset contains putative minima and low energy networks for water
clusters of sizes n = 3 - 30. The cluster structures are derived and labeled with
the TTM2.1-F ab-initio based interaction potential for water.
It contains approximately 4.5 mil. structures.
Usage:
```python
from openqdc.datasets import WaterClusters
dataset = WaterClusters()
```
References:
- https://doi.org/10.1063/1.5128378
- https://sites.uw.edu/wdbase/database-of-water-clusters/
"""

__name__ = "waterclusters3_30"

# Energy in hartree, all zeros by default
Expand Down
2 changes: 2 additions & 0 deletions openqdc/raws/pubchemqc.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Download funtionalities for PubChemQC."""

import hashlib
import os
import pickle as pkl
Expand Down
2 changes: 2 additions & 0 deletions openqdc/utils/atomization_energies.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Look-up tables for isolated atom energies."""

from typing import Dict, Tuple

import numpy as np
Expand Down
Loading

0 comments on commit 9b70b84

Please sign in to comment.