LocalSGD / DiLoCo support #39

d4l3k · 2024-12-13T00:21:38Z

This is a tracking issue for adding LocalSGD support into torchft. There's been interest in LocalSGD support and it's something we'd like to be able to support.

This should be fairly straightforward as we can use the Manager + quorum in an outer loop and then use an allreduce only periodically copy of the weights.

Something like:

manager = Manager(...)
model = ...

while True:
    for step in range(local_steps):
        inputs, labels = next(dataloader_iter)
        optimizer.zero_grad()
        criterion(model(inputs), labels).backwards()
        optimizer.step()

    # update quorum and PGs (could overlap with the optimizer steps above)
    manager.step()

    # free gradient memory to make room for averaged weights
    optimizer.zero_grad(set_to_none=True)

    # copy the model weights and start the allreduce mean
    # we need a temporary copy to gracefully handle failures
    params = {}
    for name, param in model.named_parameters():
        copy = param.detach().clone()
        manager.allreduce_grad(copy)
        params[name] = copy

    # this will wait for all transfers to complete succesfully
    if manager.should_commit():
        for name, param in model.named_parameters():
            param.copy_(params[name])
            del params[name]

DiLoCo should be a small modification of this algorithm to use a separate optimizer instead of just averaging the weights

For efficiency we should probably use the DDP reducer on the parameters directly and copy underlying Storage to make a backup copy

References:

LocalSGD: https://arxiv.org/abs/2311.08105
DiLoCo: https://arxiv.org/abs/2311.08105

d4l3k · 2024-12-17T20:25:03Z

One of the additional points here is on when we allow rejoining/recovering. Our current implementation is quite rigid but with LocalSGD we may want more control for when we detect failing workers as well as when we allow them to recover to avoid blocking.

https://pytorch.slack.com/archives/C083HHTCU06/p1734466793734549?thread_ts=1734049373.379299&cid=C083HHTCU06

I think for flexibility we should change when we increment the step count
If we rename start_step to get_quorum and add a recover_allowed field to it that would let us call this multiple times per step, with the first time allowing recovery and the second time not.
We would then move the step incrementing to either should_commit or to a new explicit step/commit method

d4l3k · 2024-12-18T23:14:56Z

For the quorum we have a few options:

detect early failures and restart via heartbeats somehow
add a proper join/shrink-only support in quorum (adds complexity to quorum)
add support for multiple quorums per lighthouse -- that way we can do tick/tock behavior so workers will always join the primary quorum and the members in the secondary quorum are only healthy workers. This avoids complicating the quorum algorithm, though it will require some fallback logic if we get stuck in the secondary quorum

d4l3k added the enhancement New feature or request label Dec 13, 2024

d4l3k changed the title ~~LocalSGD support~~ LocalSGD / DiLoCo support Dec 13, 2024

This was referenced Dec 18, 2024

manager: support multiple calls to start_quorum and conditional healing #46

Merged

local_sgd: initial version of fault tolerant LocalSGD #47

Merged

d4l3k mentioned this issue Dec 19, 2024

lighthouse, manager: support multiple quorum rooms #48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LocalSGD / DiLoCo support #39

LocalSGD / DiLoCo support #39

d4l3k commented Dec 13, 2024 •

edited

Loading

d4l3k commented Dec 17, 2024

d4l3k commented Dec 18, 2024

LocalSGD / DiLoCo support #39

LocalSGD / DiLoCo support #39

Comments

d4l3k commented Dec 13, 2024 • edited Loading

d4l3k commented Dec 17, 2024

d4l3k commented Dec 18, 2024

d4l3k commented Dec 13, 2024 •

edited

Loading