Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FL plan description to documentation #872

Merged
merged 3 commits into from
Aug 29, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 40 additions & 14 deletions docs/running_the_federation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -720,39 +720,65 @@ Federated Learning Plan (FL Plan) Settings
Use the Federated Learning plan (FL plan) to modify the federation workspace to your requirements in an **aggregator-based workflow**.


In order for participants to agree to take part in an experiment, everyone should know ahead of time both what code is going to run on their infrastructure and exactly what information on their system will be accessed. The federated learning (FL) plan aims to capture all of this information needed to decide whether to participate in an experiment, in addition to runtime details needed to load the code and make remote connections.
The FL plan is described by the **plan.yaml** file located in the **plan** directory of the workspace.


Each YAML top-level section contains the following subsections:

- ``template``: The name of the class including top-level packages names. An instance of this class is created when the plan gets initialized.
- ``settings``: The arguments that are passed to the class constructor.
- ``defaults``: The file that contains default settings for this subsection.
Any setting from defaults file can be overridden in the **plan.yaml** file.

The following is an example of a **plan.yaml**:

.. literalinclude:: ../openfl-workspace/torch_cnn_mnist/plan/plan.yaml
:language: yaml


Configurable Settings
^^^^^^^^^^^^^^^^^^^^^

- :class:`Aggregator <openfl.component.Aggregator>`
`openfl.component.Aggregator <https://github.com/intel/openfl/blob/develop/openfl/component/aggregator/aggregator.py>`_
Defines the settings for the aggregator which is the model-owner in the experiment. While models can be trained from scratch, in many cases the federation performs fine-tuning of a previously trained model. For this reason, pre-trained weights for the model are stored in protobuf files on the aggregator node and passed to collaborator nodes during initialization. The settings for aggregator include:

- :code:`init_state_path`: (str:path) Defines the weight protobuf file path where the experiment's initial weights will be loaded from. These weights will be generated with the `fx plan initialize` command.
- :code:`best_state_path`: (str:path) Defines the weight protobuf file path that will be saved to for the highest accuracy model during the experiment.
- :code:`last_state_path`: (str:path) Defines the weight protobuf file path that will be saved to during the last round completed in each experiment.
- :code:`rounds_to_train`: (int) Specifies the number of rounds in a federation. A federated learning round is defined as one complete iteration when the collaborators train the model and send the updated model weights back to the aggregator to form a new global model. Within a round, collaborators can train the model for multiple iterations called epochs.
psfoley marked this conversation as resolved.
Show resolved Hide resolved
- :code:`write_logs`: (boolean) Metric logging callback feature. By default, logging is done through `tensorboard <https://www.tensorflow.org/tensorboard/get_started>`_ but users can also use custom metric logging function for each task.


- :class:`Collaborator <openfl.component.Collaborator>`
`openfl.component.Collaborator <https://github.com/intel/openfl/blob/develop/openfl/component/collaborator/collaborator.py>`_
Defines the settings for the collaborator which is the data owner in the experiment. The settings for collaborator include:

- :code:`delta_updates`: (boolean) Determines whether the difference in model weights between the current and previous round will be sent (True), or if whole checkpoints will be sent (False). Setting to delta_updates to True leads to higher sparsity in model weights sent across, which may improve compression ratios.
- :code:`opt_treatment`: (str) Defines the optimizer state treatment policy. Valid options are : 'RESET' - reinitialize optimizer for every round (default), 'CONTINUE_LOCAL' - keep local optimizer state for every round, 'CONTINUE_GLOBAL' - aggregate optimizer state for every round.
psfoley marked this conversation as resolved.
Show resolved Hide resolved


- :class:`Data Loader <openfl.federated.data.loader.DataLoader>`
`openfl.federated.data.loader.DataLoader <https://github.com/intel/openfl/blob/develop/openfl/federated/data/loader.py>`_
Defines the data loader class that provides access to local dataset. It implements a train loader and a validation loader that takes in the train dataset and the validation dataset respectively. The settings for the dataloader include:

- :code:`collaborator_count`: (int) The number of collaborators participating in the federation
- :code:`data_group_name`: (str) The name of the dataset
- :code:`batch_size`: (int) The size of the training or validation batch


- :class:`Task Runner <openfl.federated.task.runner.TaskRunner>`
`openfl.federated.task.runner.TaskRunner <https://github.com/intel/openfl/blob/develop/openfl/federated/task/runner.py>`_
Defines the model, training/validation functions, and how to extract and set the tensors from model weights and optimizer dictionary. Depending on different AI frameworks like PyTorch and Tensorflow, users can select pre-defined task runner methods.


- :class:`Assigner <openfl.component.Assigner>`
`openfl.component.Assigner <https://github.com/intel/openfl/blob/develop/openfl/component/assigner/assigner.py>`_
Defines the task that are sent to the collaborators from the aggregator. There are three default tasks that could be given to each Collaborator:

- :code:`aggregated_model_validation`: (str) Perform validation on aggregated global model sent by the aggregator.
- :code:`train`: (str) Perform training on the global model.
- :code:`locally_tuned_model_validation`: (str) Perform validation on the model that was locally trained by the collaborator.


Each YAML top-level section contains the following subsections:

- ``template``: The name of the class including top-level packages names. An instance of this class is created when the plan gets initialized.
- ``settings``: The arguments that are passed to the class constructor.
- ``defaults``: The file that contains default settings for this subsection.
Any setting from defaults file can be overridden in the **plan.yaml** file.

The following is an example of a **plan.yaml**:

.. literalinclude:: ../openfl-workspace/torch_cnn_mnist/plan/plan.yaml
:language: yaml


Tasks
Expand Down
Loading