-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hetero训练模式 #239
Comments
|
我查看代码之后的理解是和服务器的卡数有关系,不知道对不对。yaml配置如下,请您帮忙看一下,谢谢啦! |
-->
|
好的太感谢了!代码还是没理解透彻,现在明白啦! |
最新的版本即将发布,会一同更新readme,可以持续关注! |
不好意思还有点问题想咨询,hetero_process_meshes参数内dp相等、pp是求和等于pipeline_model_parallel_size,但是tp的值如何设置呢?另外,hetero_device_types设置的个数和实际使用的卡数有什么关系呢? |
hetero_process_meshes的degree怎么设置?pp的值设定有什么规定吗? |
@heavyrain-lzy 请问一下,异构pp间的通信组是怎么创建的呢,以及他们的收发逻辑,flagscale/train/parallel_context.py中build_global_groups的实现逻辑,有相关文档或者示例图吗 |
您好,请问一下,我根据github提示构建相关环境,运行异构训练的代码,出现如下提示: |
请查看TE官网相关配置。也可以使用NGC镜像构建相应运行环境 |
我基于2411和2410的NGC镜像都构建过,出现如下问题,请问推荐使用哪个版本进行构建? |
您好,我这边也在复现您的这个实验结果。基于NGC的2411和2412镜像,按照flagscale指引进行构建之后,有提示transformer_engine库没有pytorch模块, |
请问在hetero模式下是不是必须要enable_hetero:True?显卡数量是不是要大于4?hostfile是不是必须需要?期待大家耐心回答!
The text was updated successfully, but these errors were encountered: