Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release #76

Merged
merged 160 commits into from
Jun 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
160 commits
Select commit Hold shift + click to select a range
9a11358
WIP
Mar 22, 2024
9745b4c
WIP
Mar 22, 2024
8835b8f
Moved function to utils.io
Mar 22, 2024
9a2e1d8
Additional dependencies
Mar 22, 2024
6dd9be6
Removed Mixin
Mar 22, 2024
78c9f15
WIP Preprocess + property
Mar 22, 2024
55a6692
Refactored Descriptor class
Mar 22, 2024
5b6ec7e
Removed not used funcs
Mar 22, 2024
cbb494c
Added support to get torch or jax tensors
Mar 24, 2024
e22651f
Added support to get torch or jax tensors
Mar 24, 2024
0f59a04
undo change
Mar 24, 2024
29929cf
Added transform functionality test
Mar 24, 2024
736c12a
Added ACSF
Mar 25, 2024
73a3443
Conversion dataset to extxyz
Mar 25, 2024
1c77666
Updated code based on comments
Mar 25, 2024
327e363
Added list check from convert dict keys
Mar 25, 2024
5c16f79
First implementation XYZ reader
Mar 25, 2024
8503b98
Import fix
Mar 25, 2024
5139a57
test skipping if package noot present
Mar 25, 2024
3fb274c
XYZDataset, FromFileDataset, Write interaction xyz
Mar 25, 2024
3b4823e
Fixes
Mar 25, 2024
356adb1
Many-body Tensor Representation
Mar 25, 2024
20caa4f
Included dscribe in main deps, removed torch and jax
Mar 25, 2024
38675f3
wip
Mar 26, 2024
4dd2af8
MTBR incorrect signature Fix (Thanks Danny)
Mar 26, 2024
fa141f5
Added jax/tensor support to interaction datasets
Mar 26, 2024
31beb71
refactor interaction and initial testing
Mar 26, 2024
dccf676
minor changes
Mar 26, 2024
2ab64aa
dummy modification
Mar 26, 2024
5197e32
removed redundant line
Mar 26, 2024
d10b15b
minor change
Mar 26, 2024
cb30c16
State Manager and Chain of Management
Mar 27, 2024
ce8e2b5
Adressed comments + xyz file tests
Mar 27, 2024
dc74dd6
create method enums and refactor the rest accordingly
prtos Mar 28, 2024
c34e53f
adding new files
prtos Mar 28, 2024
ef4581d
fix import issues and update API doc files
prtos Mar 28, 2024
571b706
WIP
Mar 28, 2024
e6d51b7
Separated Statistics and Atom Energies
Mar 28, 2024
05d39c9
ATOM_TABLE import fix
prtos Mar 28, 2024
189ab90
undo changes in interaction dataset, and minor change in shape
Mar 29, 2024
5ee18c2
QmInteractionMethod added
prtos Mar 29, 2024
69c081f
rename QmMethod classes
prtos Mar 29, 2024
7a20193
WIP
Mar 30, 2024
b97cf25
adding a missing file
prtos Apr 1, 2024
4eff5f4
pre-commit
Apr 1, 2024
282dc91
changed super class to BaseInteractionDataset
Apr 2, 2024
cef2b35
Parallelized function, better calculate specs, docstrings
Apr 2, 2024
dbdd985
descriptor tests
Apr 2, 2024
c7ab437
Regression + Statistics abstraction
Apr 2, 2024
5cebb80
Docs+names
Apr 2, 2024
e3c4fcf
added correction method
prtos Apr 2, 2024
fe4b86a
Regressor tests
Apr 2, 2024
b968257
Fixed test
Apr 2, 2024
096b1a0
:/
Apr 2, 2024
c0b6131
Except linalgerror
Apr 2, 2024
a0ad3ab
Merge branch 'refactoring_by_P' of https://github.com/OpenDrugDiscove…
prtos Apr 2, 2024
fe2d9bd
added correction method
prtos Apr 2, 2024
496267f
fixed pre-commits
Apr 2, 2024
b4bb0f3
missing import
Apr 2, 2024
085b933
str fix in cli
Apr 2, 2024
5fe28ce
updated python version since strenum from py>3.11
Apr 2, 2024
05a6dbf
py3.8 compatibility, manual fixes to atom energies
Apr 3, 2024
65a8a12
pkgutils
Apr 3, 2024
3b4199f
some debugging
Apr 3, 2024
2513695
Merge pull request #75 from OpenDrugDiscovery/corrections
FNTwin Apr 3, 2024
6eabc0d
Merge branch 'refactor' into pattern
Apr 3, 2024
2f6eddb
Solved merged issues, added NONE PotentialMethod
Apr 3, 2024
9574974
Merge pull request #72 from OpenDrugDiscovery/refactoring_by_P
FNTwin Apr 3, 2024
701ef1e
Merge branch 'release' into testing
Apr 3, 2024
afea053
further simplified and rebase
Apr 3, 2024
34922c3
Adressed comments, fixed NullEnergy e0s_matrix
Apr 4, 2024
71113c0
Updated stats vector shape to atleast_2d
Apr 4, 2024
955e787
Solved merge issues, added some fixes
Apr 4, 2024
1f9bb94
fixes to xyz
Apr 4, 2024
41a52d6
Added log message
Apr 4, 2024
247b0e1
Merge pull request #74 from OpenDrugDiscovery/pattern
FNTwin Apr 4, 2024
e331cc6
Merge branch 'release' into dataloader
Apr 4, 2024
1c10566
Updated array stuff for xyz dataset
Apr 4, 2024
f1769b3
Merge branch 'release' into dataloader
Apr 4, 2024
ba22ee1
fix bug during rebase and tests
Apr 4, 2024
ac593e3
array test debug
Apr 4, 2024
6f0d46f
undo test change and reset state
Apr 4, 2024
a8d0016
cleaner variant
Apr 4, 2024
ac299a5
Merge pull request #55 from OpenDrugDiscovery/dataloader
shenoynikhil Apr 5, 2024
40d900d
simplified component-wise-force stats calculation and bug-fix
Apr 5, 2024
a21cb18
Loading stats with the right format
Apr 5, 2024
23f8a8b
Bug fix in convert_array for interaction
Apr 5, 2024
d5a139b
Better stats conversion, fixed a reference leak
FNTwin Apr 5, 2024
908ec35
Test dataset
FNTwin Apr 5, 2024
55d9e68
Merge branch 'simplify' of https://github.com/OpenDrugDiscovery/openQ…
FNTwin Apr 5, 2024
ebc2adf
fixes
Apr 5, 2024
7ffd0b1
Merge remote-tracking branch 'origin/release' into testing
Apr 5, 2024
a9c8f66
removed ravel
FNTwin Apr 5, 2024
a71f4d7
Merge pull request #78 from OpenDrugDiscovery/simplify
FNTwin Apr 5, 2024
d15e9cf
Merge remote-tracking branch 'origin/release' into testing
Apr 5, 2024
ed8e264
Updated metcalf
Apr 5, 2024
18bc79c
bug fix and simplifying interaction dataset
Apr 6, 2024
2a6e3ef
Updated tests for interaction datasets
Apr 6, 2024
7493273
removed stale stats in dummy interaction
Apr 6, 2024
ed73e7d
changes based on comments
Apr 6, 2024
0359022
Clean metcalf
FNTwin Apr 6, 2024
33fa342
Simplification
FNTwin Apr 6, 2024
cd486a8
cleaned des
FNTwin Apr 6, 2024
80d7371
Simplified des dataset
FNTwin Apr 6, 2024
f3d205c
removed redundant dataset files
FNTwin Apr 6, 2024
da4fece
DES inerithance
FNTwin Apr 6, 2024
71ff741
Removed des and improved des naming
FNTwin Apr 6, 2024
f6e12e1
DES fixes
FNTwin Apr 6, 2024
3328a65
Removed comments
FNTwin Apr 6, 2024
8b28d59
X40 and L70
FNTwin Apr 6, 2024
8595fd8
Safe opening
FNTwin Apr 6, 2024
ca1b4af
Moved X40 in L7 and removed x40.py
FNTwin Apr 6, 2024
4bec82d
Moved Yaml utils to _utils.py, L7 + X40 interface
FNTwin Apr 7, 2024
a5ced0a
Merge testing + Add imports
FNTwin Apr 8, 2024
a21963e
Merge pull request #79 from OpenDrugDiscovery/interaction_impr
shenoynikhil Apr 8, 2024
3303f95
better convert function and n_body_first to ptr
Apr 12, 2024
c8d245f
Preprocess cli + optional upload to preprocess
FNTwin Apr 13, 2024
6f033cf
Updated splinter reading from -1 to nan
Apr 15, 2024
3d81df2
Merge branch 'testing' into local_fetch
Apr 16, 2024
6027ab0
Cli exception
Apr 16, 2024
970082d
Fixes to x40,l7 preproc
Apr 16, 2024
486a59f
atom.txt packaging
Apr 16, 2024
77e5e26
Added init exc F405, F401 to toml
Apr 16, 2024
69df015
Datasets from data generation
Apr 17, 2024
b0d8e0c
Fixes for uploading
Apr 18, 2024
798f861
Append to extend, metcalf
Apr 18, 2024
565dc26
Dummy fix
Apr 18, 2024
dcc1b6b
SpiceVL2
Apr 18, 2024
c88e18e
Merge pull request #81 from OpenDrugDiscovery/spicevl2
FNTwin Apr 18, 2024
7f7b651
WIP float64 conv
Apr 22, 2024
828e765
fix small bug with DES subsets
mcneela Apr 22, 2024
0b9404e
Updated to float64
Apr 22, 2024
95f926d
Interaction float32
Apr 22, 2024
d2cd5be
updated DES dataset subset handling
mcneela Apr 23, 2024
7a82b59
Updated spiceV2 subsets
Apr 23, 2024
82de349
Merge pull request #82 from OpenDrugDiscovery/subset-fix
mcneela Apr 23, 2024
603496b
Updated ani read_raw_energies
Apr 23, 2024
7eb6a1e
Fixes + MD22
Apr 24, 2024
2a292ab
Remove DataConfig WIP
Apr 24, 2024
be82226
Fixed gdml read_raw_entries
Apr 24, 2024
6a65923
Fixed comp6 read_raw_entries
Apr 24, 2024
50823d0
Added links
Apr 24, 2024
22d0bf5
Logging, fixes to qmugs
Apr 24, 2024
051c084
Removed wip files
Apr 26, 2024
703cca3
Removed dataloader, converted mmap test files
Apr 26, 2024
0c3f54a
Merge pull request #83 from OpenDrugDiscovery/float64
FNTwin Apr 26, 2024
cb6dbaf
Conversion of en fixes + datastructure for e0s_dict and retriaval of …
FNTwin May 1, 2024
da767ef
Tests
FNTwin May 1, 2024
b8791d6
Added _original_unit to xyz
FNTwin May 1, 2024
6abe893
Disable mkdocs
May 1, 2024
8d21e15
I hate mkdocs errors
May 1, 2024
46bc652
mkdocs action
May 1, 2024
2f4b692
Docstrings + naming
FNTwin May 2, 2024
dcd1bea
Merge pull request #85 from OpenDrugDiscovery/atom_ener_structure
FNTwin May 2, 2024
2669d21
docstring
FNTwin May 2, 2024
fcaed00
Issue fix
May 6, 2024
34f36b0
fix metcalf dataset energies and reupload
mcneela May 30, 2024
e85839a
Merge pull request #91 from OpenDrugDiscovery/fix-metcalf
mcneela Jun 1, 2024
318e9c5
Merge pull request #84 from OpenDrugDiscovery/downloader
prtos Jun 8, 2024
10dc009
Merge pull request #88 from OpenDrugDiscovery/forcemaskfix
prtos Jun 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,11 @@ jobs:
- name: Install library
run: python -m pip install --no-deps .

- name: Check directory
run: ls

- name: Run tests
run: pytest
run: python -m pytest

- name: Test building the doc
run: mkdocs build
#- name: Test building the doc
# run: mkdocs build
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,8 @@ We also provide support for the following publicly available QM Noncovalent Inte
| [Splinter](https://www.nature.com/articles/s41597-023-02443-1) |
| [X40](https://pubs.acs.org/doi/10.1021/ct300647k) |
| [L7](https://pubs.acs.org/doi/10.1021/ct400036b) |

# How to cite
All data presented in the OpenQDC are already published in scientific journals, full reference to the respective paper is attached to each dataset class. When citing data obtained from OpenQDC, you should cite both the original paper(s) the data come from and our paper on OpenQDC itself. The reference is:

ADD REF HERE LATER
5 changes: 0 additions & 5 deletions docs/API/isolated_atom_energies.md

This file was deleted.

3 changes: 3 additions & 0 deletions docs/API/methods.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# QM Methods

::: openqdc.methods
11 changes: 6 additions & 5 deletions docs/tutorials/usage.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -657,7 +657,7 @@
"\n",
"$U(A_1, A_2, ...) = \\sum_{i_1}^N e_0(A_i) + e(A_1, A_2, ...)$\n",
"\n",
"The isolated atoms energies are automatically used inside the datasets for the correct level of theory, but you can also use them directly by accessing the IsolatedAtomEnergyFactor class."
"The isolated atoms energies are automatically associated with the correct level of theory, and you can get access as follow"
]
},
{
Expand Down Expand Up @@ -715,10 +715,11 @@
}
],
"source": [
"from openqdc.utils.atomization_energies import IsolatedAtomEnergyFactory\n",
"from openqdc.methods import QmMethod\n",
"\n",
"# Get the hasmap of isolated atom energies for the b3lyp/6-31g* method\n",
"IsolatedAtomEnergyFactory.get(\"b3lyp/6-31g*\")"
"# Get the b3lyp/6-31g* method\n",
"method = QmMethod.B3LYP_6_31G_D\n",
"method.atom_energies_dict"
]
},
{
Expand All @@ -745,7 +746,7 @@
],
"source": [
"# Get the matrix of atomization energies for the b3lyp/6-31g* method\n",
"IsolatedAtomEnergyFactory.get_matrix(\"b3lyp/6-31g*\")"
"method.atom_energies_matrix"
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ docs_dir: "docs"
nav:
- Overview: index.md
- Available Datasets: datasets.md
- Tutorials:
- Really hard example: tutorials/usage.ipynb
#- Tutorials:
# #- Really hard example: tutorials/usage.ipynb
- API:
- Datasets: API/available_datasets.md
- Isolated Atoms Energies: API/isolated_atom_energies.md
Expand Down
80 changes: 55 additions & 25 deletions openqdc/__init__.py
Original file line number Diff line number Diff line change
@@ -1,39 +1,61 @@
import importlib
import os
from typing import TYPE_CHECKING # noqa F401
from typing import TYPE_CHECKING

# The below lazy import logic is coming from openff-toolkit:
# https://github.com/openforcefield/openff-toolkit/blob/b52879569a0344878c40248ceb3bd0f90348076a/openff/toolkit/__init__.py#L44


# Dictionary of objects to lazily import; maps the object's name to its module path
def get_project_root():
return os.path.dirname(os.path.dirname(os.path.abspath(__file__)))


_lazy_imports_obj = {
"__version__": "openqdc._version",
"BaseDataset": "openqdc.datasets.base",
# POTENTIAL
"ANI1": "openqdc.datasets.potential.ani",
"ANI1CCX": "openqdc.datasets.potential.ani",
"ANI1CCX_V2": "openqdc.datasets.potential.ani",
"ANI1X": "openqdc.datasets.potential.ani",
"Spice": "openqdc.datasets.potential.spice",
"SpiceV2": "openqdc.datasets.potential.spice",
"SpiceVL2": "openqdc.datasets.potential.spice",
"GEOM": "openqdc.datasets.potential.geom",
"QMugs": "openqdc.datasets.potential.qmugs",
"QMugs_V2": "openqdc.datasets.potential.qmugs",
"ISO17": "openqdc.datasets.potential.iso_17",
"COMP6": "openqdc.datasets.potential.comp6",
"GDML": "openqdc.datasets.potential.gdml",
"Molecule3D": "openqdc.datasets.potential.molecule3d",
"OrbnetDenali": "openqdc.datasets.potential.orbnet_denali",
"SN2RXN": "openqdc.datasets.potential.sn2_rxn",
"QM7X": "openqdc.datasets.potential.qm7x",
"QM7X_V2": "openqdc.datasets.potential.qm7x",
"NablaDFT": "openqdc.datasets.potential.nabladft",
"SolvatedPeptides": "openqdc.datasets.potential.solvated_peptides",
"WaterClusters": "openqdc.datasets.potential.waterclusters3_30",
"TMQM": "openqdc.datasets.potential.tmqm",
"Dummy": "openqdc.datasets.potential.dummy",
"PCQM_B3LYP": "openqdc.datasets.potential.pcqm",
"PCQM_PM6": "openqdc.datasets.potential.pcqm",
"RevMD17": "openqdc.datasets.potential.revmd17",
"MD22": "openqdc.datasets.potential.md22",
"Transition1X": "openqdc.datasets.potential.transition1x",
"MultixcQM9": "openqdc.datasets.potential.multixcqm9",
"MultixcQM9_V2": "openqdc.datasets.potential.multixcqm9",
# INTERACTION
"DES5M": "openqdc.datasets.interaction.des",
"DES370K": "openqdc.datasets.interaction.des",
"DESS66": "openqdc.datasets.interaction.des",
"DESS66x8": "openqdc.datasets.interaction.des",
"L7": "openqdc.datasets.interaction.l7",
"X40": "openqdc.datasets.interaction.x40",
"Metcalf": "openqdc.datasets.interaction.metcalf",
"Splinter": "openqdc.datasets.interaction.splinter",
# DEBUG
"Dummy": "openqdc.datasets.potential.dummy",
# ALL
"AVAILABLE_DATASETS": "openqdc.datasets",
"AVAILABLE_POTENTIAL_DATASETS": "openqdc.datasets.potential",
"AVAILABLE_INTERACTION_DATASETS": "openqdc.datasets.interaction",
Expand Down Expand Up @@ -68,26 +90,34 @@ def __dir__():
if TYPE_CHECKING or os.environ.get("OPENQDC_DISABLE_LAZY_LOADING", "0") == "1":
# These types are imported lazily at runtime, but we need to tell type
# checkers what they are.
from ._version import __version__ # noqa
from .datasets import AVAILABLE_DATASETS # noqa
from .datasets.base import BaseDataset # noqa
from .datasets.potential.ani import ANI1, ANI1CCX, ANI1X # noqa
from .datasets.potential.comp6 import COMP6 # noqa
from .datasets.potential.dummy import Dummy # noqa
from .datasets.potential.gdml import GDML # noqa
from .datasets.potential.geom import GEOM # noqa
from .datasets.potential.iso_17 import ISO17 # noqa
from .datasets.potential.molecule3d import Molecule3D # noqa
from .datasets.potential.multixcqm9 import MultixcQM9 # noqa
from .datasets.potential.nabladft import NablaDFT # noqa
from .datasets.potential.orbnet_denali import OrbnetDenali # noqa
from .datasets.potential.pcqm import PCQM_B3LYP, PCQM_PM6 # noqa
from .datasets.potential.qm7x import QM7X # noqa
from .datasets.potential.qmugs import QMugs # noqa
from .datasets.potential.revmd17 import RevMD17 # noqa
from .datasets.potential.sn2_rxn import SN2RXN # noqa
from .datasets.potential.solvated_peptides import SolvatedPeptides # noqa
from .datasets.potential.spice import Spice, SpiceV2 # noqa
from .datasets.potential.tmqm import TMQM # noqa
from .datasets.potential.transition1x import Transition1X # noqa
from .datasets.potential.waterclusters3_30 import WaterClusters # noqa
from ._version import __version__
from .datasets import AVAILABLE_DATASETS
from .datasets.base import BaseDataset

# INTERACTION
from .datasets.interaction.des import DES5M, DES370K, DESS66, DESS66x8
from .datasets.interaction.l7 import L7
from .datasets.interaction.metcalf import Metcalf
from .datasets.interaction.splinter import Splinter
from .datasets.interaction.x40 import X40
from .datasets.potential.ani import ANI1, ANI1CCX, ANI1CCX_V2, ANI1X
from .datasets.potential.comp6 import COMP6
from .datasets.potential.dummy import Dummy
from .datasets.potential.gdml import GDML
from .datasets.potential.geom import GEOM
from .datasets.potential.iso_17 import ISO17
from .datasets.potential.md22 import MD22
from .datasets.potential.molecule3d import Molecule3D
from .datasets.potential.multixcqm9 import MultixcQM9, MultixcQM9_V2
from .datasets.potential.nabladft import NablaDFT
from .datasets.potential.orbnet_denali import OrbnetDenali
from .datasets.potential.pcqm import PCQM_B3LYP, PCQM_PM6
from .datasets.potential.qm7x import QM7X, QM7X_V2
from .datasets.potential.qmugs import QMugs, QMugs_V2
from .datasets.potential.revmd17 import RevMD17
from .datasets.potential.sn2_rxn import SN2RXN
from .datasets.potential.solvated_peptides import SolvatedPeptides
from .datasets.potential.spice import Spice, SpiceV2, SpiceVL2
from .datasets.potential.tmqm import TMQM
from .datasets.potential.transition1x import Transition1X
from .datasets.potential.waterclusters3_30 import WaterClusters
92 changes: 77 additions & 15 deletions openqdc/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,15 @@
import typer
from loguru import logger
from prettytable import PrettyTable
from rich import print
from typing_extensions import Annotated

from openqdc import AVAILABLE_DATASETS, AVAILABLE_POTENTIAL_DATASETS
from openqdc.raws.config_factory import DataConfigFactory
from openqdc.raws.fetch import DataDownloader
from openqdc.datasets import COMMON_MAP_POTENTIALS # noqa
from openqdc.datasets import (
AVAILABLE_DATASETS,
AVAILABLE_INTERACTION_DATASETS,
AVAILABLE_POTENTIAL_DATASETS,
)

app = typer.Typer(help="OpenQDC CLI")

Expand All @@ -20,10 +24,12 @@ def exist_dataset(dataset):


def format_entry(empty_dataset):
if len(empty_dataset.__energy_methods__) > 10:
entry = ",".join(empty_dataset.__energy_methods__[:10]) + "..."
energy_methods = [str(x) for x in empty_dataset.__energy_methods__]
max_num_to_display = 6
if len(energy_methods) > 6:
entry = ",".join(energy_methods[:max_num_to_display]) + "..."
else:
entry = ",".join(empty_dataset.__energy_methods__[:10])
entry = ",".join(energy_methods[:max_num_to_display])
return entry


Expand Down Expand Up @@ -65,7 +71,7 @@ def datasets():
table = PrettyTable(["Name", "Type of Energy", "Forces", "Level of theory"])
for dataset in AVAILABLE_DATASETS:
empty_dataset = AVAILABLE_DATASETS[dataset].no_init()
has_forces = False if not empty_dataset.__force_methods__ else True
has_forces = False if not any(empty_dataset.force_mask) else True
en_type = "Potential" if dataset in AVAILABLE_POTENTIAL_DATASETS else "Interaction"
table.add_row(
[
Expand All @@ -80,22 +86,78 @@ def datasets():


@app.command()
def fetch(datasets: List[str]):
def fetch(
datasets: List[str],
overwrite: Annotated[
bool,
typer.Option(
help="Whether to overwrite or force the re-download of the files.",
),
] = False,
cache_dir: Annotated[
Optional[str],
typer.Option(
help="Path to the cache. If not provided, the default cache directory (.cache/openqdc/) will be used.",
),
] = None,
):
"""
Download the raw datasets files from the main openQDC hub.
Special case: if the dataset is "all", all available datasets will be downloaded.

overwrite: bool = False,
If True, the files will be re-downloaded and overwritten.
cache_dir: Optional[str] = None,
Path to the cache. If not provided, the default cache directory will be used.
Special case: if the dataset is "all", "potential", "interaction".
all: all available datasets will be downloaded.
potential: all the potential datasets will be downloaded
interaction: all the interaction datasets will be downloaded
Example:
openqdc fetch Spice
"""
if datasets[0] == "all":
dataset_names = DataConfigFactory.available_datasets
if datasets[0].lower() == "all":
dataset_names = AVAILABLE_DATASETS
elif datasets[0].lower() == "potential":
dataset_names = AVAILABLE_POTENTIAL_DATASETS
elif datasets[0].lower() == "interaction":
dataset_names = AVAILABLE_INTERACTION_DATASETS
else:
dataset_names = datasets

for dataset_name in dataset_names:
dd = DataDownloader()
dd.from_name(dataset_name)
for dataset in list(map(lambda x: x.lower().replace("_", ""), dataset_names)):
if exist_dataset(dataset):
try:
AVAILABLE_DATASETS[dataset].fetch(cache_dir, overwrite)
except Exception as e:
logger.error(f"Something unexpected happended while fetching {dataset}: {repr(e)}")


@app.command()
def preprocess(
datasets: List[str],
overwrite: Annotated[
bool,
typer.Option(
help="Whether to overwrite or force the re-download of the datasets.",
),
] = True,
upload: Annotated[
bool,
typer.Option(
help="Whether to try the upload to the remote storage.",
),
] = False,
):
"""
Preprocess a raw dataset (previously fetched) into a openqdc dataset and optionally push it to remote.
"""
for dataset in list(map(lambda x: x.lower().replace("_", ""), datasets)):
if exist_dataset(dataset):
logger.info(f"Preprocessing {AVAILABLE_DATASETS[dataset].__name__}")
try:
AVAILABLE_DATASETS[dataset].no_init().preprocess(upload=upload, overwrite=overwrite)
except Exception as e:
logger.error(f"Error while preprocessing {dataset}. {e}. Did you fetch the dataset first?")
raise e


if __name__ == "__main__":
Expand Down
28 changes: 26 additions & 2 deletions openqdc/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,28 @@
from .interaction import AVAILABLE_INTERACTION_DATASETS # noqa
from .potential import AVAILABLE_POTENTIAL_DATASETS # noqa
from .interaction import *
from .potential import *

AVAILABLE_DATASETS = {**AVAILABLE_POTENTIAL_DATASETS, **AVAILABLE_INTERACTION_DATASETS}


def _level_of_theory_overlap(dataset_collection):
import itertools
from itertools import groupby

dataset_map = {}
for dataset in dataset_collection:
dataset_map[dataset.lower().replace("_", "")] = dataset_collection[dataset].no_init().energy_methods

common_values_dict = {}

for key, values in dataset_map.items():
for value in values:
if value in common_values_dict:
common_values_dict[value].append(key)
else:
common_values_dict[value] = [key]

return dict(filter(lambda x: len(x[1]) > 1, common_values_dict.items()))


COMMON_MAP_POTENTIALS = _level_of_theory_overlap(AVAILABLE_POTENTIAL_DATASETS)
COMMON_MAP_INTERACTIONS = _level_of_theory_overlap(AVAILABLE_INTERACTION_DATASETS)
Loading
Loading