Skip to content

Commit

Permalink
[brief] Updates documentation files.
Browse files Browse the repository at this point in the history
[detailed]
- Mostly fixing typos as well as adding the new section for the
  checkpoint converter.
  • Loading branch information
marovira committed Jul 18, 2024
1 parent f64926f commit 6ba9371
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 9 deletions.
25 changes: 21 additions & 4 deletions docs/source/quick-ref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ Samplers
A critical component of ensuring reproducibility is to have a way for the order in which
batches are retrieved from the dataset stays the same even if a training run is stopped.
PyTorch does not provide a built-in system to allow this, so Helios implements this
through the :py:class:`~helios.data.samplers.ResumabeSampler` base class. The goal is to
through the :py:class:`~helios.data.samplers.ResumableSampler` base class. The goal is to
provide a way to do the following:

#. The sampler must have a way of setting the starting iteration. For example, suppose
Expand Down Expand Up @@ -229,8 +229,8 @@ been assigned when the trainer was created, then the following logic applies:
* If the function returns true, then the early stop counter resets to 0 and training
continues.
* If the function returns false, then the early stop counter increases by one. If the
counter greater than or equal to the value given to ``early_stop_cycles``, then training
stops.
counter is greater than or equal to the value given to ``early_stop_cycles``, then
training stops.

.. note::
If you wish to use the early stop system, you **must** assign ``early_stop_cycles``.
Expand Down Expand Up @@ -300,7 +300,7 @@ handling, which results in the following data being stored in the

* :py:attr:`~helios.trainer.TrainingState.current_iteration` and
:py:attr:`~helios.trainer.TrainingState.global_iteration` will both have the same value,
which woill correspond to :math:`n_e \cdot n_i` where :math:`n_e` is the current epoch
which will correspond to :math:`n_e \cdot n_i` where :math:`n_e` is the current epoch
number and :math:`n_i` is the batch number in the dataset.
* :py:attr:`~helios.trainer.TrainingState.global_epoch` will contain the current epoch
number.
Expand Down Expand Up @@ -385,6 +385,7 @@ keys:
:py:meth:`~helios.model.model.Model.state_dict`. Note that by default this is an empty
dictionary.
* ``rng``: contains the state of the supported RNGs.
* ``version``: contains the version of Helios used to generate the checkpoint.

The following keys may optionally appear in the dictionary:

Expand Down Expand Up @@ -435,6 +436,22 @@ be:
If distributed training is used, then only the process with *global rank* 0 will save
checkpoints.

Migrating Checkpoints
---------------------

The ``version`` key stored in the checkpoints generated by Helios acts as a fail-safe to
prevent future changes from breaking previously generated checkpoints. Helios *guarantees*
compatibility between checkpoints generated within the same major revision. In other
words, checkpoints generated by version 1.0 will be compatible with version 1.1.
Compatibility between major versions is **not** guaranteed. Should you wish to migrate
your checkpoints to a newer version of Helios, you may do so by either manually calling
:py:func:`~helios.chkpt_migrator.migrate_checkpoints_to_current_version` or by using the
script directly from the command line as follows:

.. code-block:: sh
python -m helios.chkpt_migrator <chkpt-root>
.. _logging:

Expand Down
11 changes: 6 additions & 5 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -437,8 +437,8 @@ Now let's look at the logging code:
Let's examine each part independently:

#. The call to ``super().on_training_batch_end`` will automatically gather any tensors
stored so in the ``_loss_items`` dictionary if we're in distributed mode, so we don't
have to manually do it ourselves.
stored in the ``_loss_items`` dictionary if we're in distributed mode, so we don't have
to manually do it ourselves.
#. When the :py:class:`~helios.trainer.Trainer` is created, we can specify the interval at
which logging should occur. Since
:py:meth:`~helios.model.model.Model.on_training_batch_end` is called on at the end of
Expand Down Expand Up @@ -533,9 +533,10 @@ do this, we're going to assign these fields before validation starts:
self._val_scores["total"] = 0
self._val_scores["correct"] = 0
Calling ``on_validatioN_start`` on the base class automatically clears out the
``_val_scores`` dictionary to ensure we don't accidentally over-write or overlap values.
After setting the fields we care about, let's perform the validation step:
Calling :py:meth:`~helios.mode.model.Model.on_validation_start` on the base class
automatically clears out the ``_val_scores`` dictionary to ensure we don't accidentally
over-write or overlap values. After setting the fields we care about, let's perform the
validation step:

.. code-block:: python
Expand Down

0 comments on commit 6ba9371

Please sign in to comment.