Skip to content

Latest commit

 

History

History
269 lines (203 loc) · 9.78 KB

configs.rst

File metadata and controls

269 lines (203 loc) · 9.78 KB

All About Configs

This deep-dive will guide you through writing configs for running recipes.

.. grid:: 2

    .. grid-item-card:: :octicon:`mortar-board;1em;` What this deep-dive will cover

      * How to write a YAML config and run a recipe with it
      * How to use :code:`instantiate` and :code:`parse` APIs
      * How to effectively use configs and CLI overrides for running recipes

    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites

      * Be familiar with the :ref:`overview of torchtune<overview_label>`
      * Make sure to :ref:`install torchtune<install_label>`
      * Understand the :ref:`fundamentals of recipes<recipe_deepdive>`


Where do parameters live?

There are two primary entry points for you to configure parameters: configs and CLI overrides. Configs are YAML files that define all the parameters needed to run a recipe within a single location. They are the single source of truth for reproducing a run. The config parameters can be overridden on the command-line using tune for quick changes and experimentation without modifying the config.

Writing configs

Configs serve as the primary entry point for running recipes in torchtune. They are expected to be YAML files and they simply list out values for parameters you want to define for a particular run.

seed: null
shuffle: True
device: cuda
dtype: fp32
enable_fsdp: True
...

Many fields will require specifying torchtune objects with associated keyword arguments as parameters. Models, datasets, optimizers, and loss functions are common examples of this. You can easily do this using the _component_ subfield. In _component_, you need to specify the dotpath of the object you wish to instantiate in the recipe. The dotpath is the exact path you would use to import the object normally in a Python file. For example, to specify the :class:`~torchtune.datasets.alpaca_dataset` in your config with custom arguments:

dataset:
  _component_: torchtune.datasets.alpaca_dataset
  train_on_input: False

Here, we are changing the default value for train_on_input from True to False.

Once you've specified the _component_ in your config, you can create an instance of the specified object in your recipe's setup like so:

from torchtune import config

# Access the dataset field and create the object instance
dataset = config.instantiate(cfg.dataset)

This will automatically use any keyword arguments specified in the fields under dataset.

As written, the preceding example will actually throw an error. If you look at the method for :class:`~torchtune.datasets.alpaca_dataset`, you'll notice that we're missing a required positional argument, the tokenizer. Since this is another configurable torchtune object, let's understand how to handle this by taking a look at the :func:`~torchtune.config.instantiate` API.

def instantiate(
    config: DictConfig,
    *args: Any,
    **kwargs: Any,
)

:func:`~torchtune.config.instantiate` also accepts positional arguments and keyword arguments and automatically uses that with the config when creating the object. This means we can not only pass in the tokenizer, but also add additional keyword arguments not specified in the config if we'd like:

# Tokenizer is needed for the dataset, configure it first
tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
  path: /tmp/tokenizer.model

dataset:
  _component_: torchtune.datasets.alpaca_dataset
# Note the API of the tokenizer we specified - we need to pass in a path
def llama2_tokenizer(path: str) -> Llama2Tokenizer:

# Note the API of the dataset we specified - we need to pass in a model tokenizer
# and any optional keyword arguments
def alpaca_dataset(
    tokenizer: ModelTokenizer,
    train_on_input: bool = True,
    max_seq_len: int = 512,
) -> SFTDataset:

from torchtune import config

# Since we've already specified the path in the config, we don't need to pass
# it in
tokenizer = config.instantiate(cfg.tokenizer)
# We pass in the instantiated tokenizer as the first required argument, then
# we change an optional keyword argument
dataset = config.instantiate(
    cfg.dataset,
    tokenizer,
    train_on_input=False,
)

Note that additional keyword arguments will overwrite any duplicated keys in the config.

Referencing other config fields with interpolations

Sometimes you need to use the same value more than once for multiple fields. You can use interpolations to reference another field, and :func:`~torchtune.config.instantiate` will automatically resolve it for you.

output_dir: /tmp/alpaca-llama2-finetune
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ${output_dir}

Validating your config

We provide a convenient CLI utility, :ref:`tune validate<validate_cli_label>`, to quickly verify that your config is well-formed and all components can be instantiated properly. You can also pass in overrides if you want to test out the exact commands you will run your experiments with. If any parameters are not well-formed, :ref:`tune validate<validate_cli_label>` will list out all the locations where an error was found.

tune cp llama2/7B_lora_single_device ./my_config.yaml
tune validate ./my_config.yaml

Best practices for writing configs

Let's discuss some guidelines for writing configs to get the most out of them.

Airtight configs

While it may be tempting to put as much as you can in the config to give you maximum flexibility in switching parameters for your experiments, we encourage you to only include fields in the config that will be used or instantiated in the recipe. This ensures full clarity on the options a recipe was run with and will make it significantly easier to debug.

# dont do this
alpaca_dataset:
  _component_: torchtune.datasets.alpaca_dataset
slimorca_dataset:
  ...

# do this
dataset:
  # change this in config or override when needed
  _component_: torchtune.datasets.alpaca_dataset

Use public APIs only

If a component you wish to specify in a config is located in a private file, use the public dotpath in your config. These components are typically exposed in their parent module's __init__.py file. This way, you can guarantee the stability of the API you are using in your config. There should be no underscores in your component dotpath.

# don't do this
dataset:
  _component_: torchtune.datasets._alpaca.alpaca_dataset

# do this
dataset:
  _component_: torchtune.datasets.alpaca_dataset

Command-line overrides

Configs are the primary location to collect all your parameters to run a recipe, but sometimes you may want to quickly try different values without having to update the config itself. To enable quick experimentation, you can specify override values to parameters in your config via the tune command. These should be specified as key-value pairs k1=v1 k2=v2 ...

For example, to run the :ref:`LoRA single-device finetuning <lora_finetune_recipe_label>` recipe with custom model and tokenizer directories, you can provide overrides:

tune run lora_finetune_single_device \
--config llama2/7B_lora_single_device \
checkpointer.checkpoint_dir=/home/my_model_checkpoint \
checkpointer.checkpoint_files=['file_1','file_2'] \
tokenizer.path=/home/my_tokenizer_path

Overriding components

If you would like to override a class or function in the config that is instantiated via the _component_ field, you can do so by assigning to the parameter name directly. Any nested fields in the components can be overridden with dot notation.

dataset:
  _component_: torchtune.datasets.alpaca_dataset
# Change to slimorca_dataset and set train_on_input to True
tune run lora_finetune_single_device --config my_config.yaml \
dataset=torchtune.datasets.slimorca_dataset dataset.train_on_input=True

Removing config fields

You may need to remove certain parameters from the config when changing components through overrides that require different keyword arguments. You can do so by using the ~ flag and specify the dotpath of the config field you would like to remove. For example, if you want to override a built-in config and use the bitsandbytes.optim.PagedAdamW8bit optimizer, you may need to delete parameters like foreach which are specific to PyTorch optimizers. Note that this example requires that you have bitsandbytes installed.

# In configs/llama3/8B_full.yaml
optimizer:
  _component_: torch.optim.AdamW
  lr: 2e-5
  foreach: False
# Change to PagedAdamW8bit and remove fused, foreach
tune run --nproc_per_node 4 full_finetune_distributed --config llama3/8B_full \
optimizer=bitsandbytes.optim.PagedAdamW8bit ~optimizer.foreach