Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technical Challenges for Cross-Silo Communication when using gRPC #38

Open
gaow0007 opened this issue Dec 18, 2024 · 2 comments
Open

Technical Challenges for Cross-Silo Communication when using gRPC #38

gaow0007 opened this issue Dec 18, 2024 · 2 comments

Comments

@gaow0007
Copy link

Hi, PrimeIntellect,
Thanks for your sharing awesome works. I am very interested in the cross-silo communication part in this work. I notice that this repo depends upon HivedMind , which adopts the gRPC for communication. However, gRPC has the maximum byte size limit for communication. Furthermore, the training task will last for several months. The HTTP2 might not be good choice to keep live for a long time. Have you any thoughts to address such technical challenges?

Look forward to your sharing.

Best,
Wei.

@Jackmin801
Copy link
Member

Hey Wei,

We have moved most of our implementation efforts to prime framework. Maybe you can build from there. We dont have the byte size limitation anymore as we are using gloo & nccl (might change in future but should still support big params).

As for fault-tolerance for training tasks that take months, we dont think HTTP2 is a good idea for doing collectives. Instead, we recreate the gloo / nccl process groups on failure. We have a working implementation in our repo using ElasticDeviceMesh but we are currently working on a more robust implementation using torchft. This is still experimental though, we hope to land it in prime framework sometime in Jan.

Best,
Jackmin

@Jackmin801
Copy link
Member

Oh also happy to hop on a call to learn more about your work :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants