diff --git a/docs/source/quick-ref.rst b/docs/source/quick-ref.rst index 8dd5dc8..bdac8a4 100644 --- a/docs/source/quick-ref.rst +++ b/docs/source/quick-ref.rst @@ -104,7 +104,7 @@ Samplers A critical component of ensuring reproducibility is to have a way for the order in which batches are retrieved from the dataset stays the same even if a training run is stopped. PyTorch does not provide a built-in system to allow this, so Helios implements this -through the :py:class:`~helios.data.samplers.ResumabeSampler` base class. The goal is to +through the :py:class:`~helios.data.samplers.ResumableSampler` base class. The goal is to provide a way to do the following: #. The sampler must have a way of setting the starting iteration. For example, suppose @@ -229,8 +229,8 @@ been assigned when the trainer was created, then the following logic applies: * If the function returns true, then the early stop counter resets to 0 and training continues. * If the function returns false, then the early stop counter increases by one. If the - counter greater than or equal to the value given to ``early_stop_cycles``, then training - stops. + counter is greater than or equal to the value given to ``early_stop_cycles``, then + training stops. .. note:: If you wish to use the early stop system, you **must** assign ``early_stop_cycles``. @@ -300,7 +300,7 @@ handling, which results in the following data being stored in the * :py:attr:`~helios.trainer.TrainingState.current_iteration` and :py:attr:`~helios.trainer.TrainingState.global_iteration` will both have the same value, - which woill correspond to :math:`n_e \cdot n_i` where :math:`n_e` is the current epoch + which will correspond to :math:`n_e \cdot n_i` where :math:`n_e` is the current epoch number and :math:`n_i` is the batch number in the dataset. * :py:attr:`~helios.trainer.TrainingState.global_epoch` will contain the current epoch number. @@ -385,6 +385,7 @@ keys: :py:meth:`~helios.model.model.Model.state_dict`. Note that by default this is an empty dictionary. * ``rng``: contains the state of the supported RNGs. +* ``version``: contains the version of Helios used to generate the checkpoint. The following keys may optionally appear in the dictionary: @@ -435,6 +436,22 @@ be: If distributed training is used, then only the process with *global rank* 0 will save checkpoints. +Migrating Checkpoints +--------------------- + +The ``version`` key stored in the checkpoints generated by Helios acts as a fail-safe to +prevent future changes from breaking previously generated checkpoints. Helios *guarantees* +compatibility between checkpoints generated within the same major revision. In other +words, checkpoints generated by version 1.0 will be compatible with version 1.1. +Compatibility between major versions is **not** guaranteed. Should you wish to migrate +your checkpoints to a newer version of Helios, you may do so by either manually calling +:py:func:`~helios.chkpt_migrator.migrate_checkpoints_to_current_version` or by using the +script directly from the command line as follows: + +.. code-block:: sh + + python -m helios.chkpt_migrator + .. _logging: diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index cda8e83..12bc296 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -437,8 +437,8 @@ Now let's look at the logging code: Let's examine each part independently: #. The call to ``super().on_training_batch_end`` will automatically gather any tensors - stored so in the ``_loss_items`` dictionary if we're in distributed mode, so we don't - have to manually do it ourselves. + stored in the ``_loss_items`` dictionary if we're in distributed mode, so we don't have + to manually do it ourselves. #. When the :py:class:`~helios.trainer.Trainer` is created, we can specify the interval at which logging should occur. Since :py:meth:`~helios.model.model.Model.on_training_batch_end` is called on at the end of @@ -533,9 +533,10 @@ do this, we're going to assign these fields before validation starts: self._val_scores["total"] = 0 self._val_scores["correct"] = 0 -Calling ``on_validatioN_start`` on the base class automatically clears out the -``_val_scores`` dictionary to ensure we don't accidentally over-write or overlap values. -After setting the fields we care about, let's perform the validation step: +Calling :py:meth:`~helios.mode.model.Model.on_validation_start` on the base class +automatically clears out the ``_val_scores`` dictionary to ensure we don't accidentally +over-write or overlap values. After setting the fields we care about, let's perform the +validation step: .. code-block:: python