Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving and loading the trained model after the end of a federated experiment #1139

Open
enrico310786 opened this issue Nov 12, 2024 · 6 comments
Labels
question Further information is requested

Comments

@enrico310786
Copy link

Hi,

I made some experiments of federated learning using the tutorial PyTorch_TinyImageNet at this link: https://github.com/securefederatedai/openfl/tree/develop/openfl-tutorials/interactive_api/PyTorch_TinyImageNet

Everithing goes right. I have one director and two envoys. The directory is in one server and the two envoys are in two different server. During the training i see the accuracy is growing.

My question are:

  1. where is saved the best model and its weghts?
  2. In which format?
  3. How can I load the model and use it in inference mode?

I noted the in the workspace folder, once the experimnet is ended, there is a file called "model_obj.pkl". I load the file

path_model_pkl = "model_obj.pkl"
with open(path_model_pkl, 'rb') as f:
model_interface = pickle.load(f)
model = model_interface.model

but, if i apply this model to the images of the test set of one of the two envoy, i do not obtain the good results compatible with a trained model. So, i think that this is not the best trained model. Where is it stored at the end of the experiment?

Thanks

@kta-intel
Copy link
Collaborator

Hi @enrico310786 !

Short answer:
You can access the best model with:

best_model = fl_experiment.get_best_model()
best_model.state_dict()

Then save it in its native torch format (i.e. .pt or .pth) and use it for inference as you normally would.

Long answer:
OpenFL's Interactive API is actually being deprecated. There are active efforts to consolidate our API and while the director/envoy concept will likely still exist in some capacity, for now it is recommended that you either use the Task Runner API (quickstart) where the model will be saved as a .pbuf in your workspace that can be converted to its native format with fx model save or the Workflow API (quickstart) which gives you the flexibility to define how you want to define your model at each stage

@kta-intel kta-intel added the question Further information is requested label Nov 12, 2024
@enrico310786
Copy link
Author

enrico310786 commented Nov 13, 2024

Hi https://github.com/kta-intel,

thank you for your answer. About the long answer, I try to update my know how to the Task Runner API or the Workflow API. About the short answer, i noted the following facts:

  1. Once defined the model architecture for a regression task as
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(13, 150)
        self.fc2 = nn.Linear(150, 50)
        self.fc3 = nn.Linear(50, 1) 

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model_net = SimpleNN()

and the ModelInterface with

framework_adapter = 'openfl.plugins.frameworks_adapters.pytorch_adapter.FrameworkAdapterPlugin'
model_interface = ModelInterface(model=model_net, optimizer=optimizer, framework_plugin=framework_adapter)

I also defined an initial_model object as
initial_model = deepcopy(model_net)
and printing the initial_model weights I obtain

tensor([[-0.0021, 0.1488, -0.2283, ..., -0.0838, -0.0545, -0.2650],
[-0.1837, -0.1143, 0.0103, ..., 0.2075, -0.0447, 0.0293],
[ 0.2511, -0.2573, -0.1746, ..., -0.1619, 0.2384, 0.1238],
...,
[-0.2398, 0.2194, -0.1492, ..., -0.1561, -0.0217, 0.2169],
[ 0.0238, 0.1927, -0.0021, ..., 0.1863, 0.0120, 0.1169],
[ 0.1160, -0.2394, -0.2438, ..., 0.2573, 0.2502, -0.1769]]).....

  1. Once the experiment is finished, printing the weights of fl_experiment.get_best_model(), I obtain the same weights of the initial_model. Shouldn't the weights be different since the model is trained now? Furthermore, if I used the fl_experiment.get_best_model() on the test set of one envoy, I obtain bad results (high MSE and low R^2). All these facts indicate to me that fl_experiment.get_best_model() it is not the best trained model.

  2. If I use instead fl_experiment.get_last_model(), now the weights are different respect to the initial_model
    tensor([[ 2.7470, 1.7603, 0.4984, ..., 0.4626, -2.6358, -1.7808],
    [ 2.5191, -0.3715, 2.2026, ..., -0.9344, -0.8067, -0.1721],
    [ 0.7177, -0.7920, 0.3306, ..., -1.0026, -1.1008, -0.2933],
    ...,
    [ 1.1725, -0.4773, 1.3435, ..., -1.2107, -0.7849, -0.0271],
    [ 2.8732, 0.3654, 1.6125, ..., -0.7965, -0.7755, -0.0415],
    [ 4.6579, 0.2192, -0.2842, ..., -1.1465, -1.3399, -0.7404]])....

and applying the fl_experiment.get_last_model() on the test set of one envoy, I obtain good result (low MSE and high R^2). But I thing that fl_experiment.get_last_model() is the latest model at the final run not the best one.

Why does fl_experiment.get_best_model() give me the initial_model weights and not those of the best one?

Thanks again,
Enrico

@kta-intel
Copy link
Collaborator

Thanks for the investigation.

Hmm, this might be a bug (or at the very least, insufficient checkpointing logic). My suspicion is that the the Interactive API backend is using a loss criteria on the validation step to check for the best model, but since the validate function is measuring for accuracy, it is marking the higher value as worse and not saving it. On the other hand, as you said, .get_last_model() is just a running set of weights for the latest run, so the training is being reflected, albeit not the best state.

This is actually more of an educated guess based on similar issues in the past. I have to dive a bit deeper to confirm, though.

@teoparvanov
Copy link
Collaborator

Hi @enrico310786, as @kta-intel mentioned earlier, we are in the process of deprecating Interactive API. Is the issue also reproducible with Task Runner API, or Workflow API?

@enrico310786
Copy link
Author

Hi @teoparvanov, I don't know. So far I just used the Interactive API framework.

@teoparvanov
Copy link
Collaborator

teoparvanov commented Nov 18, 2024

@enrico310786 , here are a couple of resources to get you started with the main OpenFL API-s:

  • Task Runner API: check out this blog post that walks you through the process of creating a federation from a "central" ML model.
  • Workflow API: read the docs, and try out the 101 tutorial

Please keep us posted on how this is going. The Slack #onboarding channel is a good place to get additional support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants