-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First pass at optionally shutting down server (and remaining clients) on failures #280
Conversation
…nt of none values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, a couple questions but other than that I think this is a much needed feature
def _terminate_after_unacceptable_failures(self, timeout: Optional[float]) -> None: | ||
assert not self.accept_failures | ||
# First we shutdown all clients involved in the FL training/evaluation if they can be. | ||
self.disconnect_all_clients(timeout=timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, do servers have their own shutdown methods that should be called here before raising an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell they don't. In the base server code they just send out this signal and let the program finish as the termination. However, it is worth calling our own shutdown method before throwing the exception. So I'll add that.
reporters, | ||
checkpointer, | ||
server_name=server_name, | ||
client_manager, strategy, reporters, checkpointer, server_name=server_name, accept_failures=accept_failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this affect how this server should do checkpointing? Currently it saves the server state after the evaluate round. So if the fit round has no failures, but the evaluate round does, we lose the parameters after the successful fit round. Is this the behaviour we want or should we move the per round checkpointing to occer after the fit round but before the evaluate round?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. My initial thought is that it's alright if we lose a single FL round due to failures on the evaluation side. My reasoning would be that even though we finished a train cycle, we never actually got to see what the "results" of that cycle and aggregation were for the clients. So re-starting from those trained parameters is sort of "blind." That is, we don't "know" how those parameters actually did in that round. Just my initial thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense. I was thinking the same thing which is why I was unsure. I think John is the one who primarily uses the checkpointing server so maybe he'll have an opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with whatever though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. We'll see what @jewelltaylor thinks 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with you two, we should only consider the round complete after evaluation
if isinstance(failure, BaseException): | ||
log( | ||
ERROR, | ||
"An exception was returned instead of any failed results. As such the client ID is unknown. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a bit overkill so feel free to ignore this suggestion. But could you not technically determine the client ids of the clients that returned exceptions by process of elimination? If you have multiple exceptions you won't know which exception goes with which client, but you could at least log a list of clients that failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. That's a fair point. I'll log a list of clients that failed. We won't know which corresponds to which exception, but it's reasonable to at least gather them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it turns out this is harder than it seems at first. Basically, if they return only an exception, they are immediately removed from the client manager and they aren't necessarily connected to the server yet when the first fit_round starts (for example). So the representation of these CIDs in the client manager is a bit asynchronous, which makes these reporting really difficult. I'm going to just punt on it for now and we can try to be more verbose later.
@@ -663,7 +708,7 @@ def _get_initial_parameters(self, server_round: int, timeout: Optional[float]) - | |||
log(INFO, "Received initial parameters from one random client") | |||
else: | |||
log( | |||
WARN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to WARNING so we can drop a duplicate import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! It will be nice to not have to deal the zombie processes :)
PR Type
Feature/Fix
Short Description
Clickup Ticket(s): N/A
There is an odd default behavior in flower servers wherein, if all clients die due to an exception for example, the server process essentially becomes a "zombie." That is, in spite of receiving failure messages from clients, its process persists indefinitely waiting for client messages to continue training.
This can be quite troublesome for debugging, testing, or just experimentation because the persisting server can interfere with future runs if not manually killed (via htop or other process monitor). This is because it will still be listening on the default port and may intercept messages from clients that are spun up in a new process kicked off by the user.
At first blush, we might use the server config timeout to help address this issue. While this is possible, it means that you need to have a fairly reasonable estimate of how long training will take on the client sides, which a user may not have. So I implemented an option in our base server to at least say that the server doesn't tolerate failures in the FL process. If a failure is received, the server will send kill signals to any remaining clients and terminate its processes with an error.
Note: I didn't disable
accept_failures
in our examples. However, if we think it makes sense to do so, at least in the places where we are training with 100% of available clients, I'm happy to add that option.Tests Added
I tested this on our Basic Example by throwing an exception right at the start of training on each client. See attachment for what it looks like on the server side.
server.txt