Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP + torchtitan support #43

Open
d4l3k opened this issue Dec 16, 2024 · 1 comment
Open

FSDP + torchtitan support #43

d4l3k opened this issue Dec 16, 2024 · 1 comment
Assignees

Comments

@d4l3k
Copy link
Member

d4l3k commented Dec 16, 2024

This is a tracking issue for anything related to getting FSDP working with torchtitan.

@fegin
Copy link
Contributor

fegin commented Dec 18, 2024

Some thoughts of DeviceMesh:

  1. init_device_mesh is the recommended way to create a DeviceMesh, especially for nD DeviceMesh. However, init_device_mesh can only be used to initialized a world mesh -- meaning that the mesh has to contain all the ranks. Otherwise, the inference rules of DeviceMesh can be wrong and thus result in incorrect PG creation. More specifically, DeviceMesh uses get_rank() to understand which PG does this rank belong to.

  2. DeviceMesh.from_group() can be used for manually creating the PG information. This will be correct but is impossible to let users to figure with nD DeviceMesh.

  3. Combining init_device_mesh with DeviceMesh.from_group(), like extend_device_mesh() is likely to be wrong due to the nature of init_device_mesh requires the mesh to be world mesh.

The proposed solution is to let TorchFT provide ft_init_device_mesh and lie DeviceMesh about the dimension of replicate but it seems that this will still be incorrect because the other dimensions will still get incorrect PG information due to the usage of get_rank().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants