Skip to content

Commit

Permalink
Add preprocessing training data (#344)
Browse files Browse the repository at this point in the history
* Tidy training

* Add preprocessing

* Refactor file check

* Fix preprocess config

* Add preprocess docs

* Apply suggestions from code review

Co-authored-by: Jacob Wilkins <[email protected]>

---------

Co-authored-by: Jacob Wilkins <[email protected]>
  • Loading branch information
ElliottKasoar and oerc0122 authored Nov 14, 2024
1 parent bdeaca9 commit ca1d799
Show file tree
Hide file tree
Showing 12 changed files with 525 additions and 94 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ janus phonons
janus eos
janus train
janus descriptors
janus preprocess
```

For example, a single point calcuation (using the [MACE-MP](https://github.com/ACEsuit/mace-mp) "small" force-field) can be performed by running:
Expand Down
20 changes: 20 additions & 0 deletions docs/source/apidoc/janus_core.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,16 @@ janus\_core.cli.phonons module
:undoc-members:
:show-inheritance:

janus\_core.cli.preprocess module
---------------------------------

.. automodule:: janus_core.cli.preprocess
:members:
:special-members:
:private-members:
:undoc-members:
:show-inheritance:

janus\_core.cli.singlepoint module
----------------------------------

Expand Down Expand Up @@ -289,6 +299,16 @@ janus\_core.processing.symmetry module
:undoc-members:
:show-inheritance:

janus\_core.training.preprocess module
--------------------------------------

.. automodule:: janus_core.training.preprocess
:members:
:special-members:
:private-members:
:undoc-members:
:show-inheritance:

janus\_core.training.train module
---------------------------------

Expand Down
23 changes: 22 additions & 1 deletion docs/source/user_guide/command_line.rst
Original file line number Diff line number Diff line change
Expand Up @@ -428,7 +428,7 @@ Training and fine-tuning MLIPs
------------------------------

.. note::
Currently only MACE models are supported. See the `MACE CLI <https://github.com/ACEsuit/mace/blob/main/mace/cli/run_train.py>`_ for further configuration details
Currently only MACE models are supported. See the `MACE run_train CLI <https://github.com/ACEsuit/mace/blob/main/mace/cli/run_train.py>`_ for further configuration details

Models can be trained by passing a configuration file to the MLIP's command line interface:

Expand All @@ -446,6 +446,27 @@ Foundational models can also be fine-tuned, by including the ``foundation_model`
janus train --mlip-config /path/to/fine/tuning/config.yml --fine-tune
Preprocessing training data
----------------------------

.. note::
Currently only MACE models are supported. See the `MACE preprocess_data CLI <https://github.com/ACEsuit/mace/blob/main/mace/cli/preprocess_data.py>`_ for further configuration details

Large datasets, which may not fit into GPU memory, can be preprocessed,
converting xyz training, test, and validation files into HDF5 files that can then be used for on-line data loading.

This can be done by passing a configuration file to the MLIP's command line interface:

.. code-block:: bash
janus preprocess --mlip-config /path/to/preprocessing/config.yml
For MACE, this will create separate folders for ``train``, ``val`` and ``test`` HDF5 data files, when relevant,
as well as saving the statistics of your data in ``statistics.json``, if requested.

Additionally, a log file, ``preprocess-log.yml``, and summary file, ``preprocess-summary.yml``, will be generated.


Calculate descriptors
---------------------

Expand Down
2 changes: 2 additions & 0 deletions janus_core/cli/janus.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from janus_core.cli.geomopt import geomopt
from janus_core.cli.md import md
from janus_core.cli.phonons import phonons
from janus_core.cli.preprocess import preprocess
from janus_core.cli.singlepoint import singlepoint
from janus_core.cli.train import train

Expand All @@ -30,6 +31,7 @@
app.command(help="Calculate equation of state.")(eos)
app.command(help="Calculate MLIP descriptors.")(descriptors)
app.command(help="Running training for an MLIP.")(train)
app.command(help="Running preprocessing for an MLIP.")(preprocess)


@app.callback(invoke_without_command=True, help="")
Expand Down
74 changes: 74 additions & 0 deletions janus_core/cli/preprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# noqa: I002, FA102
"""Set up MLIP preprocessing commandline interface."""

# Issues with future annotations and typer
# c.f. https://github.com/maxb2/typer-config/issues/295
# from __future__ import annotations

from pathlib import Path
from typing import Annotated

from typer import Option, Typer

app = Typer()


@app.command()
def preprocess(
mlip_config: Annotated[
Path, Option(help="Configuration file to pass to MLIP CLI.")
],
log: Annotated[Path, Option(help="Path to save logs to.")] = Path(
"preprocess-log.yml"
),
tracker: Annotated[
bool, Option(help="Whether to save carbon emissions of calculation")
] = True,
summary: Annotated[
Path,
Option(
help=(
"Path to save summary of inputs, start/end time, and carbon emissions."
)
),
] = Path("preprocess-summary.yml"),
):
"""
Convert training data to hdf5 by passing a configuration file to the MLIP's CLI.
Parameters
----------
mlip_config : Path
Configuration file to pass to MLIP CLI.
log : Optional[Path]
Path to write logs to. Default is Path("preprocess-log.yml").
tracker : bool
Whether to save carbon emissions of calculation in log file and summary.
Default is True.
summary : Optional[Path]
Path to save summary of inputs, start/end time, and carbon emissions. Default
is Path("preprocess-summary.yml").
"""
from janus_core.cli.utils import carbon_summary, end_summary, start_summary
from janus_core.training.preprocess import preprocess as run_preprocess

inputs = {"mlip_config": str(mlip_config)}

# Save summary information before preprocessing begins
start_summary(command="preprocess", summary=summary, inputs=inputs)

log_kwargs = {"filemode": "w"}
if log:
log_kwargs["filename"] = log

# Run preprocessing
run_preprocess(
mlip_config, attach_logger=True, log_kwargs=log_kwargs, track_carbon=tracker
)

# Save carbon summary
if tracker:
carbon_summary(summary=summary, log=log)

# Save time after preprocessing has finished
end_summary(summary)
23 changes: 23 additions & 0 deletions janus_core/helpers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -409,3 +409,26 @@ def track_progress(sequence: Sequence | Iterable, description: str) -> Iterable:

with progress:
yield from progress.track(sequence, description=description)


def check_files_exist(config: dict, req_file_keys: Sequence[PathLike]) -> None:
"""
Check files specified in a dictionary read from a configuration file exist.
Parameters
----------
config : dict
Dictionary read from configuration file.
req_file_keys : Sequence[Pathlike]
Files that must exist if defined in the configuration file.
Raises
------
FileNotFoundError
If a key from `req_file_keys` is in the configuration file, but the
file corresponding to the configuration value do not exist.
"""
for file_key in config.keys() & req_file_keys:
# Only check if file key is in the configuration file
if not Path(config[file_key]).exists():
raise FileNotFoundError(f"{config[file_key]} does not exist")
87 changes: 87 additions & 0 deletions janus_core/training/preprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
"""Preprocess MLIP training data."""

from __future__ import annotations

from collections.abc import Sequence
from typing import Any

from mace.cli.preprocess_data import run
from mace.tools import build_preprocess_arg_parser as mace_parser
import yaml

from janus_core.helpers.janus_types import PathLike
from janus_core.helpers.log import config_logger, config_tracker
from janus_core.helpers.utils import check_files_exist, none_to_dict


def preprocess(
mlip_config: PathLike,
req_file_keys: Sequence[PathLike] = ("train_file", "test_file", "valid_file"),
attach_logger: bool = False,
log_kwargs: dict[str, Any] | None = None,
track_carbon: bool = True,
tracker_kwargs: dict[str, Any] | None = None,
) -> None:
"""
Convert training data to hdf5 by passing a configuration file to the MLIP's CLI.
Currently only supports MACE models, but this can be extended by replacing the
argument parsing.
Parameters
----------
mlip_config : PathLike
Configuration file to pass to MLIP.
req_file_keys : Sequence[PathLike]
List of files that must exist if defined in the configuration file.
Default is ("train_file", "test_file", "valid_file").
attach_logger : bool
Whether to attach a logger. Default is False.
log_kwargs : dict[str, Any] | None
Keyword arguments to pass to `config_logger`. Default is {}.
track_carbon : bool
Whether to track carbon emissions of calculation. Default is True.
tracker_kwargs : dict[str, Any] | None
Keyword arguments to pass to `config_tracker`. Default is {}.
"""
log_kwargs, tracker_kwargs = none_to_dict(log_kwargs, tracker_kwargs)

# Validate inputs
with open(mlip_config, encoding="utf8") as file:
options = yaml.safe_load(file)
check_files_exist(options, req_file_keys)

# Configure logging
if attach_logger:
log_kwargs.setdefault("filename", "preprocess-log.yml")
log_kwargs.setdefault("name", __name__)
logger = config_logger(**log_kwargs)
tracker = config_tracker(logger, track_carbon, **tracker_kwargs)

if logger and "foundation_model" in options:
logger.info("Fine tuning model: %s", options["foundation_model"])

# Parse options from config, as MACE cannot read config file yet
args = []
for key, value in options.items():
if isinstance(value, bool):
if value is True:
args.append(f"--{key}")
else:
args.append(f"--{key}")
args.append(f"{value}")

mlip_args = mace_parser().parse_args(args)

if logger:
logger.info("Starting preprocessing")
if tracker:
tracker.start_task("Preprocessing")

run(mlip_args)

if logger:
logger.info("Preprocessing complete")
if tracker:
tracker.stop_task()
tracker.stop()
51 changes: 15 additions & 36 deletions janus_core/training/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,47 +2,26 @@

from __future__ import annotations

from pathlib import Path
from collections.abc import Sequence
from typing import Any

try:
from mace.cli.run_train import run as run_train
except ImportError as e:
raise NotImplementedError("Please update MACE to use this module.") from e
from mace.cli.run_train import run
from mace.tools import build_default_arg_parser as mace_parser
import yaml

from janus_core.helpers.janus_types import PathLike
from janus_core.helpers.log import config_logger, config_tracker
from janus_core.helpers.utils import none_to_dict


def check_files_exist(config: dict, req_file_keys: list[PathLike]) -> None:
"""
Check files specified in the MLIP configuration file exist.
Parameters
----------
config : dict
MLIP configuration file options.
req_file_keys : list[Pathlike]
List of files that must exist if defined in the configuration file.
Raises
------
FileNotFoundError
If a key from `req_file_keys` is in the configuration file, but the
file corresponding to the configuration value do not exist.
"""
for file_key in req_file_keys:
# Only check if file key is in the configuration file
if file_key in config and not Path(config[file_key]).exists():
raise FileNotFoundError(f"{config[file_key]} does not exist")
from janus_core.helpers.utils import check_files_exist, none_to_dict


def train(
mlip_config: PathLike,
req_file_keys: list[PathLike] | None = None,
req_file_keys: Sequence[PathLike] = (
"train_file",
"test_file",
"valid_file",
"statistics_file",
),
attach_logger: bool = False,
log_kwargs: dict[str, Any] | None = None,
track_carbon: bool = True,
Expand All @@ -58,9 +37,9 @@ def train(
----------
mlip_config : PathLike
Configuration file to pass to MLIP.
req_file_keys : list[PathLike] | None
req_file_keys : Sequence[PathLike]
List of files that must exist if defined in the configuration file.
Default is ["train_file", "test_file", "valid_file", "statistics_file"].
Default is ("train_file", "test_file", "valid_file", "statistics_file").
attach_logger : bool
Whether to attach a logger. Default is False.
log_kwargs : dict[str, Any] | None
Expand All @@ -72,9 +51,6 @@ def train(
"""
log_kwargs, tracker_kwargs = none_to_dict(log_kwargs, tracker_kwargs)

if req_file_keys is None:
req_file_keys = ["train_file", "test_file", "valid_file", "statistics_file"]

# Validate inputs
with open(mlip_config, encoding="utf8") as file:
options = yaml.safe_load(file)
Expand All @@ -92,11 +68,14 @@ def train(

# Path must be passed as a string
mlip_args = mace_parser().parse_args(["--config", str(mlip_config)])

if logger:
logger.info("Starting training")
if tracker:
tracker.start_task("Training")
run_train(mlip_args)

run(mlip_args)

if logger:
logger.info("Training complete")
if tracker:
Expand Down
11 changes: 11 additions & 0 deletions tests/data/mlip_preprocess.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
train_file: "tests/data/mlip_train.xyz"
valid_file: "tests/data/mlip_valid.xyz"
test_file: "tests/data/mlip_test.xyz"
energy_key: 'dft_energy'
forces_key: 'dft_forces'
stress_key: 'dft_stress'
r_max: 4.0
scaling: 'rms_forces_scaling'
batch_size: 4
seed: 2024
compute_statistics: False
Loading

0 comments on commit ca1d799

Please sign in to comment.