FSDP + torchtitan support #43

d4l3k · 2024-12-16T20:38:19Z

This is a tracking issue for anything related to getting FSDP working with torchtitan.

fegin · 2024-12-18T16:59:15Z

Some thoughts of DeviceMesh:

init_device_mesh is the recommended way to create a DeviceMesh, especially for nD DeviceMesh. However, init_device_mesh can only be used to initialized a world mesh -- meaning that the mesh has to contain all the ranks. Otherwise, the inference rules of DeviceMesh can be wrong and thus result in incorrect PG creation. More specifically, DeviceMesh uses get_rank() to understand which PG does this rank belong to.
DeviceMesh.from_group() can be used for manually creating the PG information. This will be correct but is impossible to let users to figure with nD DeviceMesh.
Combining init_device_mesh with DeviceMesh.from_group(), like extend_device_mesh() is likely to be wrong due to the nature of init_device_mesh requires the mesh to be world mesh.

The proposed solution is to let TorchFT provide ft_init_device_mesh and lie DeviceMesh about the dimension of replicate but it seems that this will still be incorrect because the other dimensions will still get incorrect PG information due to the usage of get_rank().

d4l3k assigned fegin Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP + torchtitan support #43

FSDP + torchtitan support #43

d4l3k commented Dec 16, 2024

fegin commented Dec 18, 2024 •

edited

Loading

FSDP + torchtitan support #43

FSDP + torchtitan support #43

Comments

d4l3k commented Dec 16, 2024

fegin commented Dec 18, 2024 • edited Loading

fegin commented Dec 18, 2024 •

edited

Loading