Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ppoffload不支持gradient_accumulation_fusion的问题 #4

Open
Amaris0 opened this issue Dec 23, 2024 · 0 comments
Open

ppoffload不支持gradient_accumulation_fusion的问题 #4

Amaris0 opened this issue Dec 23, 2024 · 0 comments

Comments

@Amaris0
Copy link

Amaris0 commented Dec 23, 2024

你好,
我目前在尝试解决ppoffload不支持gradient_accumulation_fusion的问题。
同时开ppoffload和ga fusion,会报错在megatron/core/tensor_parallel/layers.py 中,报错原因是 weight.grad=None,即在offload和onload的过程中,Tensor会丢grad。

if ctx.gradient_accumulation_fusion:
if wgrad_compute:
if hasattr(weight, 'main_grad'):
tmp_grad = weight.main_grad
else:
tmp_grad = weight.grad
# weight.grad 为 None
if tmp_grad.dtype == torch.float32:
fused_weight_gradient_mlp_cuda.wgrad_gemm_accum_fp32(
total_input, grad_output, tmp_grad
)
elif tmp_grad.dtype in (torch.float16, torch.bfloat16):
fused_weight_gradient_mlp_cuda.wgrad_gemm_accum_fp16(
total_input, grad_output, tmp_grad
)
else:
raise RuntimeError("Unsupported gradient type for gradient accumulation fusion")

我进行了多种尝试保留gard,都失败了。甚至注释掉offload后释放显存的操作和onload的操作,依然报同样的错误。
我有两个问题,希望大佬帮忙解答。
一、ppoffload 支持 ga fusion这个特性,后续有支持么?为啥我offload中不释放Tensor依然报错呢?
二、offload.py 中 ForwardEmptyBackwardIdentityFunction 和 ForwardLeftBackwardRightFunction 这两个类是什么作用呢?没看懂这两个类的作用,是不是跟丢失grad有关呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant