Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at optionally shutting down server (and remaining clients) on failures #280

Merged
merged 5 commits into from
Nov 11, 2024

Conversation

emersodb
Copy link
Collaborator

@emersodb emersodb commented Nov 8, 2024

PR Type

Feature/Fix

Short Description

Clickup Ticket(s): N/A

There is an odd default behavior in flower servers wherein, if all clients die due to an exception for example, the server process essentially becomes a "zombie." That is, in spite of receiving failure messages from clients, its process persists indefinitely waiting for client messages to continue training.

This can be quite troublesome for debugging, testing, or just experimentation because the persisting server can interfere with future runs if not manually killed (via htop or other process monitor). This is because it will still be listening on the default port and may intercept messages from clients that are spun up in a new process kicked off by the user.

At first blush, we might use the server config timeout to help address this issue. While this is possible, it means that you need to have a fairly reasonable estimate of how long training will take on the client sides, which a user may not have. So I implemented an option in our base server to at least say that the server doesn't tolerate failures in the FL process. If a failure is received, the server will send kill signals to any remaining clients and terminate its processes with an error.

Note: I didn't disable accept_failures in our examples. However, if we think it makes sense to do so, at least in the places where we are training with 100% of available clients, I'm happy to add that option.

Tests Added

I tested this on our Basic Example by throwing an exception right at the start of training on each client. See attachment for what it looks like on the server side.
server.txt

Copy link
Collaborator

@scarere scarere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a couple questions but other than that I think this is a much needed feature

fl4health/server/base_server.py Outdated Show resolved Hide resolved
def _terminate_after_unacceptable_failures(self, timeout: Optional[float]) -> None:
assert not self.accept_failures
# First we shutdown all clients involved in the FL training/evaluation if they can be.
self.disconnect_all_clients(timeout=timeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do servers have their own shutdown methods that should be called here before raising an exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell they don't. In the base server code they just send out this signal and let the program finish as the termination. However, it is worth calling our own shutdown method before throwing the exception. So I'll add that.

reporters,
checkpointer,
server_name=server_name,
client_manager, strategy, reporters, checkpointer, server_name=server_name, accept_failures=accept_failures
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this affect how this server should do checkpointing? Currently it saves the server state after the evaluate round. So if the fit round has no failures, but the evaluate round does, we lose the parameters after the successful fit round. Is this the behaviour we want or should we move the per round checkpointing to occer after the fit round but before the evaluate round?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. My initial thought is that it's alright if we lose a single FL round due to failures on the evaluation side. My reasoning would be that even though we finished a train cycle, we never actually got to see what the "results" of that cycle and aggregation were for the clients. So re-starting from those trained parameters is sort of "blind." That is, we don't "know" how those parameters actually did in that round. Just my initial thoughts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. I was thinking the same thing which is why I was unsure. I think John is the one who primarily uses the checkpointing server so maybe he'll have an opinion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with whatever though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. We'll see what @jewelltaylor thinks 🙂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you two, we should only consider the round complete after evaluation

fl4health/server/scaffold_server.py Show resolved Hide resolved
if isinstance(failure, BaseException):
log(
ERROR,
"An exception was returned instead of any failed results. As such the client ID is unknown. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a bit overkill so feel free to ignore this suggestion. But could you not technically determine the client ids of the clients that returned exceptions by process of elimination? If you have multiple exceptions you won't know which exception goes with which client, but you could at least log a list of clients that failed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. That's a fair point. I'll log a list of clients that failed. We won't know which corresponds to which exception, but it's reasonable to at least gather them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out this is harder than it seems at first. Basically, if they return only an exception, they are immediately removed from the client manager and they aren't necessarily connected to the server yet when the first fit_round starts (for example). So the representation of these CIDs in the client manager is a bit asynchronous, which makes these reporting really difficult. I'm going to just punt on it for now and we can try to be more verbose later.

@emersodb emersodb requested a review from scarere November 11, 2024 17:09
@@ -663,7 +708,7 @@ def _get_initial_parameters(self, server_round: int, timeout: Optional[float]) -
log(INFO, "Received initial parameters from one random client")
else:
log(
WARN,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to WARNING so we can drop a duplicate import.

Copy link
Collaborator

@jewelltaylor jewelltaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! It will be nice to not have to deal the zombie processes :)

@emersodb emersodb merged commit 986350c into main Nov 11, 2024
6 checks passed
@emersodb emersodb deleted the dbe/server_shutdowns_on_client_failures branch November 11, 2024 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants