Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preprocessing training data #344

Merged
merged 6 commits into from
Nov 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ janus phonons
janus eos
janus train
janus descriptors
janus preprocess
```

For example, a single point calcuation (using the [MACE-MP](https://github.com/ACEsuit/mace-mp) "small" force-field) can be performed by running:
Expand Down
20 changes: 20 additions & 0 deletions docs/source/apidoc/janus_core.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,16 @@ janus\_core.cli.phonons module
:undoc-members:
:show-inheritance:

janus\_core.cli.preprocess module
---------------------------------

.. automodule:: janus_core.cli.preprocess
:members:
:special-members:
:private-members:
:undoc-members:
:show-inheritance:

janus\_core.cli.singlepoint module
----------------------------------

Expand Down Expand Up @@ -289,6 +299,16 @@ janus\_core.processing.symmetry module
:undoc-members:
:show-inheritance:

janus\_core.training.preprocess module
--------------------------------------

.. automodule:: janus_core.training.preprocess
:members:
:special-members:
:private-members:
:undoc-members:
:show-inheritance:

janus\_core.training.train module
---------------------------------

Expand Down
23 changes: 22 additions & 1 deletion docs/source/user_guide/command_line.rst
Original file line number Diff line number Diff line change
Expand Up @@ -428,7 +428,7 @@ Training and fine-tuning MLIPs
------------------------------

.. note::
Currently only MACE models are supported. See the `MACE CLI <https://github.com/ACEsuit/mace/blob/main/mace/cli/run_train.py>`_ for further configuration details
Currently only MACE models are supported. See the `MACE run_train CLI <https://github.com/ACEsuit/mace/blob/main/mace/cli/run_train.py>`_ for further configuration details

Models can be trained by passing a configuration file to the MLIP's command line interface:

Expand All @@ -446,6 +446,27 @@ Foundational models can also be fine-tuned, by including the ``foundation_model`
janus train --mlip-config /path/to/fine/tuning/config.yml --fine-tune


Preprocessing training data
----------------------------

.. note::
Currently only MACE models are supported. See the `MACE preprocess_data CLI <https://github.com/ACEsuit/mace/blob/main/mace/cli/preprocess_data.py>`_ for further configuration details

Large datasets, which may not fit into GPU memory, can be preprocessed,
converting xyz training, test, and validation files into HDF5 files that can then be used for on-line data loading.

This can be done by passing a configuration file to the MLIP's command line interface:

.. code-block:: bash

janus preprocess --mlip-config /path/to/preprocessing/config.yml

For MACE, this will create separate folders for ``train``, ``val`` and ``test`` HDF5 data files, when relevant,
as well as saving the statistics of your data in ``statistics.json``, if requested.

Additionally, a log file, ``preprocess-log.yml``, and summary file, ``preprocess-summary.yml``, will be generated.


Calculate descriptors
---------------------

Expand Down
2 changes: 2 additions & 0 deletions janus_core/cli/janus.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from janus_core.cli.geomopt import geomopt
from janus_core.cli.md import md
from janus_core.cli.phonons import phonons
from janus_core.cli.preprocess import preprocess
from janus_core.cli.singlepoint import singlepoint
from janus_core.cli.train import train

Expand All @@ -30,6 +31,7 @@
app.command(help="Calculate equation of state.")(eos)
app.command(help="Calculate MLIP descriptors.")(descriptors)
app.command(help="Running training for an MLIP.")(train)
app.command(help="Running preprocessing for an MLIP.")(preprocess)


@app.callback(invoke_without_command=True, help="")
Expand Down
74 changes: 74 additions & 0 deletions janus_core/cli/preprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# noqa: I002, FA102
"""Set up MLIP preprocessing commandline interface."""

# Issues with future annotations and typer
# c.f. https://github.com/maxb2/typer-config/issues/295
# from __future__ import annotations

from pathlib import Path
from typing import Annotated

from typer import Option, Typer

app = Typer()


@app.command()
def preprocess(
mlip_config: Annotated[
Path, Option(help="Configuration file to pass to MLIP CLI.")
],
log: Annotated[Path, Option(help="Path to save logs to.")] = Path(
"preprocess-log.yml"
),
tracker: Annotated[
bool, Option(help="Whether to save carbon emissions of calculation")
] = True,
summary: Annotated[
Path,
Option(
help=(
"Path to save summary of inputs, start/end time, and carbon emissions."
)
),
] = Path("preprocess-summary.yml"),
):
"""
Convert training data to hdf5 by passing a configuration file to the MLIP's CLI.

Parameters
----------
mlip_config : Path
Configuration file to pass to MLIP CLI.
log : Optional[Path]
Path to write logs to. Default is Path("preprocess-log.yml").
tracker : bool
Whether to save carbon emissions of calculation in log file and summary.
Default is True.
summary : Optional[Path]
Path to save summary of inputs, start/end time, and carbon emissions. Default
is Path("preprocess-summary.yml").
"""
from janus_core.cli.utils import carbon_summary, end_summary, start_summary
from janus_core.training.preprocess import preprocess as run_preprocess

inputs = {"mlip_config": str(mlip_config)}

# Save summary information before preprocessing begins
start_summary(command="preprocess", summary=summary, inputs=inputs)
ElliottKasoar marked this conversation as resolved.
Show resolved Hide resolved

log_kwargs = {"filemode": "w"}
if log:
log_kwargs["filename"] = log

# Run preprocessing
run_preprocess(
mlip_config, attach_logger=True, log_kwargs=log_kwargs, track_carbon=tracker
)

# Save carbon summary
if tracker:
carbon_summary(summary=summary, log=log)

# Save time after preprocessing has finished
end_summary(summary)
23 changes: 23 additions & 0 deletions janus_core/helpers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -409,3 +409,26 @@ def track_progress(sequence: Sequence | Iterable, description: str) -> Iterable:

with progress:
yield from progress.track(sequence, description=description)


def check_files_exist(config: dict, req_file_keys: Sequence[PathLike]) -> None:
ElliottKasoar marked this conversation as resolved.
Show resolved Hide resolved
"""
Check files specified in a dictionary read from a configuration file exist.

Parameters
----------
config : dict
Dictionary read from configuration file.
req_file_keys : Sequence[Pathlike]
Files that must exist if defined in the configuration file.

Raises
------
FileNotFoundError
If a key from `req_file_keys` is in the configuration file, but the
file corresponding to the configuration value do not exist.
"""
for file_key in config.keys() & req_file_keys:
# Only check if file key is in the configuration file
if not Path(config[file_key]).exists():
raise FileNotFoundError(f"{config[file_key]} does not exist")
87 changes: 87 additions & 0 deletions janus_core/training/preprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
"""Preprocess MLIP training data."""

from __future__ import annotations

from collections.abc import Sequence
from typing import Any

from mace.cli.preprocess_data import run
from mace.tools import build_preprocess_arg_parser as mace_parser
import yaml

from janus_core.helpers.janus_types import PathLike
from janus_core.helpers.log import config_logger, config_tracker
from janus_core.helpers.utils import check_files_exist, none_to_dict


def preprocess(
mlip_config: PathLike,
req_file_keys: Sequence[PathLike] = ("train_file", "test_file", "valid_file"),
attach_logger: bool = False,
log_kwargs: dict[str, Any] | None = None,
track_carbon: bool = True,
tracker_kwargs: dict[str, Any] | None = None,
) -> None:
"""
Convert training data to hdf5 by passing a configuration file to the MLIP's CLI.

Currently only supports MACE models, but this can be extended by replacing the
argument parsing.

Parameters
----------
mlip_config : PathLike
Configuration file to pass to MLIP.
req_file_keys : Sequence[PathLike]
List of files that must exist if defined in the configuration file.
Default is ("train_file", "test_file", "valid_file").
attach_logger : bool
Whether to attach a logger. Default is False.
log_kwargs : dict[str, Any] | None
Keyword arguments to pass to `config_logger`. Default is {}.
track_carbon : bool
Whether to track carbon emissions of calculation. Default is True.
tracker_kwargs : dict[str, Any] | None
Keyword arguments to pass to `config_tracker`. Default is {}.
"""
log_kwargs, tracker_kwargs = none_to_dict(log_kwargs, tracker_kwargs)

# Validate inputs
with open(mlip_config, encoding="utf8") as file:
options = yaml.safe_load(file)
check_files_exist(options, req_file_keys)

# Configure logging
if attach_logger:
log_kwargs.setdefault("filename", "preprocess-log.yml")
log_kwargs.setdefault("name", __name__)
logger = config_logger(**log_kwargs)
tracker = config_tracker(logger, track_carbon, **tracker_kwargs)

if logger and "foundation_model" in options:
logger.info("Fine tuning model: %s", options["foundation_model"])

# Parse options from config, as MACE cannot read config file yet
args = []
oerc0122 marked this conversation as resolved.
Show resolved Hide resolved
for key, value in options.items():
if isinstance(value, bool):
if value is True:
args.append(f"--{key}")
oerc0122 marked this conversation as resolved.
Show resolved Hide resolved
else:
args.append(f"--{key}")
args.append(f"{value}")

mlip_args = mace_parser().parse_args(args)

if logger:
logger.info("Starting preprocessing")
if tracker:
tracker.start_task("Preprocessing")

run(mlip_args)

if logger:
logger.info("Preprocessing complete")
if tracker:
tracker.stop_task()
tracker.stop()
51 changes: 15 additions & 36 deletions janus_core/training/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,47 +2,26 @@

from __future__ import annotations

from pathlib import Path
from collections.abc import Sequence
from typing import Any

try:
from mace.cli.run_train import run as run_train
except ImportError as e:
raise NotImplementedError("Please update MACE to use this module.") from e
from mace.cli.run_train import run
from mace.tools import build_default_arg_parser as mace_parser
import yaml

from janus_core.helpers.janus_types import PathLike
from janus_core.helpers.log import config_logger, config_tracker
from janus_core.helpers.utils import none_to_dict


def check_files_exist(config: dict, req_file_keys: list[PathLike]) -> None:
"""
Check files specified in the MLIP configuration file exist.

Parameters
----------
config : dict
MLIP configuration file options.
req_file_keys : list[Pathlike]
List of files that must exist if defined in the configuration file.

Raises
------
FileNotFoundError
If a key from `req_file_keys` is in the configuration file, but the
file corresponding to the configuration value do not exist.
"""
for file_key in req_file_keys:
# Only check if file key is in the configuration file
if file_key in config and not Path(config[file_key]).exists():
raise FileNotFoundError(f"{config[file_key]} does not exist")
from janus_core.helpers.utils import check_files_exist, none_to_dict


def train(
mlip_config: PathLike,
req_file_keys: list[PathLike] | None = None,
req_file_keys: Sequence[PathLike] = (
"train_file",
"test_file",
"valid_file",
"statistics_file",
),
attach_logger: bool = False,
log_kwargs: dict[str, Any] | None = None,
track_carbon: bool = True,
Expand All @@ -58,9 +37,9 @@ def train(
----------
mlip_config : PathLike
Configuration file to pass to MLIP.
req_file_keys : list[PathLike] | None
req_file_keys : Sequence[PathLike]
List of files that must exist if defined in the configuration file.
Default is ["train_file", "test_file", "valid_file", "statistics_file"].
Default is ("train_file", "test_file", "valid_file", "statistics_file").
attach_logger : bool
Whether to attach a logger. Default is False.
log_kwargs : dict[str, Any] | None
Expand All @@ -72,9 +51,6 @@ def train(
"""
log_kwargs, tracker_kwargs = none_to_dict(log_kwargs, tracker_kwargs)

if req_file_keys is None:
req_file_keys = ["train_file", "test_file", "valid_file", "statistics_file"]

# Validate inputs
with open(mlip_config, encoding="utf8") as file:
options = yaml.safe_load(file)
Expand All @@ -92,11 +68,14 @@ def train(

# Path must be passed as a string
mlip_args = mace_parser().parse_args(["--config", str(mlip_config)])

if logger:
logger.info("Starting training")
if tracker:
tracker.start_task("Training")
run_train(mlip_args)

run(mlip_args)

if logger:
logger.info("Training complete")
if tracker:
Expand Down
11 changes: 11 additions & 0 deletions tests/data/mlip_preprocess.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
train_file: "tests/data/mlip_train.xyz"
valid_file: "tests/data/mlip_valid.xyz"
test_file: "tests/data/mlip_test.xyz"
energy_key: 'dft_energy'
forces_key: 'dft_forces'
stress_key: 'dft_stress'
r_max: 4.0
scaling: 'rms_forces_scaling'
batch_size: 4
seed: 2024
compute_statistics: False
Loading