You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, PrimeIntellect,
Thanks for your sharing awesome works. I am very interested in the cross-silo communication part in this work. I notice that this repo depends upon HivedMind , which adopts the gRPC for communication. However, gRPC has the maximum byte size limit for communication. Furthermore, the training task will last for several months. The HTTP2 might not be good choice to keep live for a long time. Have you any thoughts to address such technical challenges?
Look forward to your sharing.
Best,
Wei.
The text was updated successfully, but these errors were encountered:
We have moved most of our implementation efforts to prime framework. Maybe you can build from there. We dont have the byte size limitation anymore as we are using gloo & nccl (might change in future but should still support big params).
As for fault-tolerance for training tasks that take months, we dont think HTTP2 is a good idea for doing collectives. Instead, we recreate the gloo / nccl process groups on failure. We have a working implementation in our repo using ElasticDeviceMesh but we are currently working on a more robust implementation using torchft. This is still experimental though, we hope to land it in prime framework sometime in Jan.
Hi, PrimeIntellect,
Thanks for your sharing awesome works. I am very interested in the cross-silo communication part in this work. I notice that this repo depends upon HivedMind , which adopts the gRPC for communication. However, gRPC has the maximum byte size limit for communication. Furthermore, the training task will last for several months. The HTTP2 might not be good choice to keep live for a long time. Have you any thoughts to address such technical challenges?
Look forward to your sharing.
Best,
Wei.
The text was updated successfully, but these errors were encountered: