A lightweight tool for managing ML experiments.
Forge makes it easier to configure experiments and allows easier model inspection and evaluation due to smart checkpoints. With Forge, you can configure and build your dataset and model in separate files and load them easily in an experiment script or a jupyter notebook. Once the model is trained, it can be easily restored from a snapshot (with the corresponding dataset) without the access to the original config files.
-
Write a data config (example here).
-
Write a model config (example here).
-
Run the training script (example here).
Typically, you would copy the example train script to your project and customize it with any additional logging/setup required. -
(Optional) Analyze the trained model in a notebook (example here) or in another script (example here).
Dataset and model config files are general (separate) python scripts that define a load
function. Dataset should return a dict
, which is passed as keyword arguments to the model config.
Both config files and any scripts use forge.flags
for configuration. They are based on an older implementation of abseil. Forge does not take Tensorflow flags into account, so it's best to use forge.flags
instead.
The training script relies on checkpoint_dir
and run_name
flags, that specify where model checkpoints should be kept. For every run, a job-specific folder is created under checkpoint_dir/run_name/#
, where #
is a number. All config flags and dataset/model config are stored in a job folder, so that the corresponding job can be easily resumed later by passing the resume
flag. It is also easy to load a model checkpoint in another script or a jupyter notebook.
Features requests and contributions in the form of a pull request are welcome.
Adam R. Kosiorek