First pass at optionally shutting down server (and remaining clients) on failures #280

emersodb · 2024-11-08T14:32:27Z

PR Type

Feature/Fix

Short Description

Clickup Ticket(s): N/A

There is an odd default behavior in flower servers wherein, if all clients die due to an exception for example, the server process essentially becomes a "zombie." That is, in spite of receiving failure messages from clients, its process persists indefinitely waiting for client messages to continue training.

This can be quite troublesome for debugging, testing, or just experimentation because the persisting server can interfere with future runs if not manually killed (via htop or other process monitor). This is because it will still be listening on the default port and may intercept messages from clients that are spun up in a new process kicked off by the user.

At first blush, we might use the server config timeout to help address this issue. While this is possible, it means that you need to have a fairly reasonable estimate of how long training will take on the client sides, which a user may not have. So I implemented an option in our base server to at least say that the server doesn't tolerate failures in the FL process. If a failure is received, the server will send kill signals to any remaining clients and terminate its processes with an error.

Note: I didn't disable accept_failures in our examples. However, if we think it makes sense to do so, at least in the places where we are training with 100% of available clients, I'm happy to add that option.

Tests Added

I tested this on our Basic Example by throwing an exception right at the start of training on each client. See attachment for what it looks like on the server side.
server.txt

…nt of none values.

scarere

Looks good, a couple questions but other than that I think this is a much needed feature

fl4health/server/base_server.py

scarere · 2024-11-08T14:58:07Z

fl4health/server/base_server.py

+    def _terminate_after_unacceptable_failures(self, timeout: Optional[float]) -> None:
+        assert not self.accept_failures
+        # First we shutdown all clients involved in the FL training/evaluation if they can be.
+        self.disconnect_all_clients(timeout=timeout)


Just curious, do servers have their own shutdown methods that should be called here before raising an exception?

From what I can tell they don't. In the base server code they just send out this signal and let the program finish as the termination. However, it is worth calling our own shutdown method before throwing the exception. So I'll add that.

scarere · 2024-11-08T15:03:51Z

fl4health/server/base_server.py

-            reporters,
-            checkpointer,
-            server_name=server_name,
+            client_manager, strategy, reporters, checkpointer, server_name=server_name, accept_failures=accept_failures


Will this affect how this server should do checkpointing? Currently it saves the server state after the evaluate round. So if the fit round has no failures, but the evaluate round does, we lose the parameters after the successful fit round. Is this the behaviour we want or should we move the per round checkpointing to occer after the fit round but before the evaluate round?

That's a good question. My initial thought is that it's alright if we lose a single FL round due to failures on the evaluation side. My reasoning would be that even though we finished a train cycle, we never actually got to see what the "results" of that cycle and aggregation were for the clients. So re-starting from those trained parameters is sort of "blind." That is, we don't "know" how those parameters actually did in that round. Just my initial thoughts.

Yeah that makes sense. I was thinking the same thing which is why I was unsure. I think John is the one who primarily uses the checkpointing server so maybe he'll have an opinion

I'm fine with whatever though

Sounds good. We'll see what @jewelltaylor thinks 🙂

Agree with you two, we should only consider the round complete after evaluation

fl4health/server/scaffold_server.py

…the server.

scarere · 2024-11-08T17:16:49Z

fl4health/server/base_server.py

+            if isinstance(failure, BaseException):
+                log(
+                    ERROR,
+                    "An exception was returned instead of any failed results. As such the client ID is unknown. "


This might be a bit overkill so feel free to ignore this suggestion. But could you not technically determine the client ids of the clients that returned exceptions by process of elimination? If you have multiple exceptions you won't know which exception goes with which client, but you could at least log a list of clients that failed.

Yup. That's a fair point. I'll log a list of clients that failed. We won't know which corresponds to which exception, but it's reasonable to at least gather them.

So it turns out this is harder than it seems at first. Basically, if they return only an exception, they are immediately removed from the client manager and they aren't necessarily connected to the server yet when the first fit_round starts (for example). So the representation of these CIDs in the client manager is a bit asynchronous, which makes these reporting really difficult. I'm going to just punt on it for now and we can try to be more verbose later.

emersodb · 2024-11-11T17:15:50Z

fl4health/server/base_server.py

@@ -663,7 +708,7 @@ def _get_initial_parameters(self, server_round: int, timeout: Optional[float]) -
            log(INFO, "Received initial parameters from one random client")
        else:
            log(
-                WARN,


Moved to WARNING so we can drop a duplicate import.

jewelltaylor

LGTM! It will be nice to not have to deal the zombie processes :)

Rudimentary implementation of server termination on clientside failures

87df218

emersodb requested review from lotif, fatemetkl, jewelltaylor, scarere and sanaAyrml November 8, 2024 14:32

Fixing a small extraction bug in getting the failures in the rare eve…

7e50846

…nt of none values.

scarere reviewed Nov 8, 2024

View reviewed changes

Adding a bit more detailed logging and a more 'graceful' shutdown of …

4afe66f

…the server.

scarere reviewed Nov 8, 2024

View reviewed changes

emersodb added 2 commits November 11, 2024 10:03

Merge branch 'main' into dbe/server_shutdowns_on_client_failures

0bb106b

Merge branch 'main' into dbe/server_shutdowns_on_client_failures

04ae53f

emersodb requested a review from scarere November 11, 2024 17:09

emersodb commented Nov 11, 2024

View reviewed changes

jewelltaylor approved these changes Nov 11, 2024

View reviewed changes

emersodb merged commit 986350c into main Nov 11, 2024
6 checks passed

emersodb deleted the dbe/server_shutdowns_on_client_failures branch November 11, 2024 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First pass at optionally shutting down server (and remaining clients) on failures #280

First pass at optionally shutting down server (and remaining clients) on failures #280

emersodb commented Nov 8, 2024 •

edited

Loading

scarere left a comment

scarere Nov 8, 2024

emersodb Nov 8, 2024

scarere Nov 8, 2024

emersodb Nov 8, 2024

scarere Nov 8, 2024

scarere Nov 8, 2024

emersodb Nov 8, 2024

jewelltaylor Nov 11, 2024

scarere Nov 8, 2024

emersodb Nov 8, 2024

emersodb Nov 8, 2024

emersodb Nov 11, 2024

jewelltaylor left a comment

First pass at optionally shutting down server (and remaining clients) on failures #280

First pass at optionally shutting down server (and remaining clients) on failures #280

Conversation

emersodb commented Nov 8, 2024 • edited Loading

PR Type

Short Description

Tests Added

scarere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jewelltaylor left a comment

Choose a reason for hiding this comment

emersodb commented Nov 8, 2024 •

edited

Loading