hetero训练模式 #239

drakethree3 · 2024-10-22T09:29:33Z

请问在hetero模式下是不是必须要enable_hetero:True？显卡数量是不是要大于4？hostfile是不是必须需要？期待大家耐心回答！

heavyrain-lzy · 2024-10-23T06:19:42Z

hetero模式下必须要enable_hetero:True
显卡数量没有要求，具体看异构并行策略
如果是单机运行可以不用hostfile
可以按照命令尝试调试
python run.py --config-path ./examples/aquila/conf --config-name config_hetero

drakethree3 · 2024-10-24T05:52:20Z

感谢你的回答！我按照正常配置使用hetero模式，在enable_hetero:True前提下会有如下报错，我的服务器是4张卡。这个报错是什么原因呢？

heavyrain-lzy · 2024-10-24T06:07:35Z

感谢你的回答！我按照正常配置使用hetero模式，在enable_hetero:True前提下会有如下报错，我的服务器是4张卡。这个报错是什么原因呢？

可以在报错的地方进行debug 或给出详细的yaml配置文件协助分析

drakethree3 · 2024-10-24T06:33:04Z

我查看代码之后的理解是和服务器的卡数有关系，不知道对不对。yaml配置如下，请您帮忙看一下，谢谢啦！
yaml.zip

heavyrain-lzy · 2024-10-24T06:50:25Z

我查看代码之后的理解是和服务器的卡数有关系，不知道对不对。yaml配置如下，请您帮忙看一下，谢谢啦！ yaml.zip

  #enable_hetero: True
  #hetero_device_types: ["A100", "A100", "A100", "A100"]
  hetero_device_types: A100
  hetero_current_device_type: A100
  hetero_pipeline_layer_split: [4, 2]
  #hetero_process_meshes: [4, 1, 1, 2, 1]

-->

  enable_hetero: True
  hetero_device_types: ["A100"]
  hetero_current_device_type: A100
  hetero_pipeline_layer_split: [4, 2]
  hetero_process_meshes: [1, 1, 1, 2, 2] #长度是5的倍速:tp1,cp1,ep1,dp1,pp1

drakethree3 · 2024-10-24T07:04:53Z

好的太感谢了！代码还是没理解透彻，现在明白啦！

heavyrain-lzy · 2024-10-24T07:08:16Z

最新的版本即将发布，会一同更新readme，可以持续关注！
如果问题得到解决，请及时关闭issue！

drakethree3 · 2024-10-24T09:47:36Z

不好意思还有点问题想咨询，hetero_process_meshes参数内dp相等、pp是求和等于pipeline_model_parallel_size，但是tp的值如何设置呢？另外，hetero_device_types设置的个数和实际使用的卡数有什么关系呢？

drakethree3 · 2024-10-24T09:52:08Z

如下图设置是可以的，tp的值也只能是等于tensor_model_parallel_size吗？

heavyrain-lzy · 2024-10-25T03:10:04Z

如下图设置是可以的，tp的值也只能是等于tensor_model_parallel_size吗？

Using the degree in the hetero_process_meshes.

drakethree3 · 2024-10-25T03:17:03Z

hetero_process_meshes的degree怎么设置？pp的值设定有什么规定吗？

drakethree3 · 2024-10-28T08:37:36Z

接上一次的问题我是用提供的example/aquila/conf/下的config_hetero.yaml测试功能，也是报错如下，这个问题如何解决呢？感谢解答！

heavyrain-lzy · 2024-10-29T01:25:11Z

接上一次的问题我是用提供的example/aquila/conf/下的config_hetero.yaml测试功能，也是报错如下，这个问题如何解决呢？感谢解答！

请参考arguments.py中的参数说明配置相关yaml，确保参数使用符合要求。

tingyecang · 2024-11-25T03:07:25Z

@heavyrain-lzy 请问一下，异构pp间的通信组是怎么创建的呢，以及他们的收发逻辑，flagscale/train/parallel_context.py中build_global_groups的实现逻辑，有相关文档或者示例图吗

a545394427 · 2024-12-24T09:15:50Z

您好，请问一下，我根据github提示构建相关环境，运行异构训练的代码，出现如下提示：
[default6]:Traceback (most recent call last):
[default6]: File "/usr/local/lib/python3.10/dist-packages/transformer_engine-1.14.0.dev0+1975ace-py3.10-linux-x86_64.egg/transformer_engine/pytorch/init.py", line 52, in _load_library
[default6]: so_path = next(so_dir.glob(f"{module_name}..{extension}"))
[default6]:StopIteration
即使我重新构建了transformer_engine，仍然找不到所需要的transformer_engine_torch..so文件，这种情况需要对于环境再做哪些配置呢？

heavyrain-lzy · 2024-12-24T09:33:31Z

1975ace

请查看TE官网相关配置。也可以使用NGC镜像构建相应运行环境

a545394427 · 2024-12-24T09:55:58Z

1975ace

请查看TE官网相关配置。也可以使用NGC镜像构建相应的运行环境

我基于2411和2410的NGC镜像都构建过，出现如下问题，请问推荐使用哪个版本进行构建？

a545394427 · 2024-12-25T06:57:07Z

好的太感谢了！代码还是没理解透彻，现在明白啦！

您好，我这边也在复现您的这个实验结果。基于NGC的2411和2412镜像，按照flagscale指引进行构建之后，有提示transformer_engine库没有pytorch模块,
[default6]: File "/root/FlagScale/megatron/megatron/core/extensions/transformer_engine.py", line 94, in
[default6]: class TELinear(te.pytorch.Linear):
[default6]: ^^^^^^^^^^
[default6]:AttributeError: module 'transformer_engine' has no attribute 'pytorch'
请问您是基于什么镜像环境构建的？

drakethree3 closed this as completed Oct 24, 2024

drakethree3 reopened this Oct 24, 2024

aoyulong closed this as completed Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hetero训练模式 #239

hetero训练模式 #239

drakethree3 commented Oct 22, 2024

heavyrain-lzy commented Oct 23, 2024

drakethree3 commented Oct 24, 2024

heavyrain-lzy commented Oct 24, 2024

drakethree3 commented Oct 24, 2024

heavyrain-lzy commented Oct 24, 2024

drakethree3 commented Oct 24, 2024

heavyrain-lzy commented Oct 24, 2024

drakethree3 commented Oct 24, 2024

drakethree3 commented Oct 24, 2024

heavyrain-lzy commented Oct 25, 2024

drakethree3 commented Oct 25, 2024

drakethree3 commented Oct 28, 2024

heavyrain-lzy commented Oct 29, 2024

tingyecang commented Nov 25, 2024

a545394427 commented Dec 24, 2024

heavyrain-lzy commented Dec 24, 2024

a545394427 commented Dec 24, 2024

a545394427 commented Dec 25, 2024

hetero训练模式 #239

hetero训练模式 #239

Comments

drakethree3 commented Oct 22, 2024

heavyrain-lzy commented Oct 23, 2024

drakethree3 commented Oct 24, 2024

heavyrain-lzy commented Oct 24, 2024

drakethree3 commented Oct 24, 2024

heavyrain-lzy commented Oct 24, 2024

drakethree3 commented Oct 24, 2024

heavyrain-lzy commented Oct 24, 2024

drakethree3 commented Oct 24, 2024

drakethree3 commented Oct 24, 2024

heavyrain-lzy commented Oct 25, 2024

drakethree3 commented Oct 25, 2024

drakethree3 commented Oct 28, 2024

heavyrain-lzy commented Oct 29, 2024

tingyecang commented Nov 25, 2024

a545394427 commented Dec 24, 2024

heavyrain-lzy commented Dec 24, 2024

a545394427 commented Dec 24, 2024

a545394427 commented Dec 25, 2024