[brief] Updates documentation files.

[detailed] - Mostly fixing typos as well as adding the new section for the checkpoint converter.
marovira · Jul 18, 2024 · 6ba9371 · 6ba9371
1 parent f64926f
commit 6ba9371
Show file tree

Hide file tree

Showing 2 changed files with 27 additions and 9 deletions.
diff --git a/docs/source/quick-ref.rst b/docs/source/quick-ref.rst
@@ -104,7 +104,7 @@ Samplers
 A critical component of ensuring reproducibility is to have a way for the order in which
 batches are retrieved from the dataset stays the same even if a training run is stopped.
 PyTorch does not provide a built-in system to allow this, so Helios implements this
-through the :py:class:`~helios.data.samplers.ResumabeSampler` base class. The goal is to
+through the :py:class:`~helios.data.samplers.ResumableSampler` base class. The goal is to
 provide a way to do the following:
 
 #. The sampler must have a way of setting the starting iteration. For example, suppose
@@ -229,8 +229,8 @@ been assigned when the trainer was created, then the following logic applies:
 * If the function returns true, then the early stop counter resets to 0 and training
   continues.
 * If the function returns false, then the early stop counter increases by one. If the
-  counter greater than or equal to the value given to ``early_stop_cycles``, then training
-  stops.
+  counter is greater than or equal to the value given to ``early_stop_cycles``, then
+  training stops.
 
 .. note::
    If you wish to use the early stop system, you **must** assign ``early_stop_cycles``.
@@ -300,7 +300,7 @@ handling, which results in the following data being stored in the
 
 * :py:attr:`~helios.trainer.TrainingState.current_iteration` and
   :py:attr:`~helios.trainer.TrainingState.global_iteration` will both have the same value,
-  which woill correspond to :math:`n_e \cdot n_i` where :math:`n_e` is the current epoch
+  which will correspond to :math:`n_e \cdot n_i` where :math:`n_e` is the current epoch
   number and :math:`n_i` is the batch number in the dataset.
 * :py:attr:`~helios.trainer.TrainingState.global_epoch` will contain the current epoch
   number.
@@ -385,6 +385,7 @@ keys:
   :py:meth:`~helios.model.model.Model.state_dict`. Note that by default this is an empty
   dictionary.
 * ``rng``: contains the state of the supported RNGs.
+* ``version``: contains the version of Helios used to generate the checkpoint.
 
 The following keys may optionally appear in the dictionary:
 
@@ -435,6 +436,22 @@ be:
    If distributed training is used, then only the process with *global rank* 0 will save
    checkpoints.
 
+Migrating Checkpoints
+---------------------
+
+The ``version`` key stored in the checkpoints generated by Helios acts as a fail-safe to
+prevent future changes from breaking previously generated checkpoints. Helios *guarantees*
+compatibility between checkpoints generated within the same major revision. In other
+words, checkpoints generated by version 1.0 will be compatible with version 1.1.
+Compatibility between major versions is **not** guaranteed. Should you wish to migrate
+your checkpoints to a newer version of Helios, you may do so by either manually calling
+:py:func:`~helios.chkpt_migrator.migrate_checkpoints_to_current_version` or by using the
+script directly from the command line as follows:
+
+.. code-block:: sh
+
+   python -m helios.chkpt_migrator <chkpt-root>
+
 
 .. _logging:
 

diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst
@@ -437,8 +437,8 @@ Now let's look at the logging code:
 Let's examine each part independently:
 
 #. The call to ``super().on_training_batch_end`` will automatically gather any tensors
-   stored so in the ``_loss_items`` dictionary if we're in distributed mode, so we don't
-   have to manually do it ourselves.
+   stored in the ``_loss_items`` dictionary if we're in distributed mode, so we don't have
+   to manually do it ourselves.
 #. When the :py:class:`~helios.trainer.Trainer` is created, we can specify the interval at
    which logging should occur. Since
    :py:meth:`~helios.model.model.Model.on_training_batch_end` is called on at the end of
@@ -533,9 +533,10 @@ do this, we're going to assign these fields before validation starts:
         self._val_scores["total"] = 0
         self._val_scores["correct"] = 0
 
-Calling ``on_validatioN_start`` on the base class automatically clears out the
-``_val_scores`` dictionary to ensure we don't accidentally over-write or overlap values.
-After setting the fields we care about, let's perform the validation step:
+Calling :py:meth:`~helios.mode.model.Model.on_validation_start` on the base class
+automatically clears out the ``_val_scores`` dictionary to ensure we don't accidentally
+over-write or overlap values. After setting the fields we care about, let's perform the
+validation step:
 
 .. code-block:: python