-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adalomo 试图在 llama 2 70b 模型训练时出现 NCCL communicator 这类超时的错误 #67
Comments
你好,adalomo和lomo现在只支持纯dp或者纯tp。使用8卡的话,建议把dp_size改成8,tp_size设置为1 |
文件:https://github.com/OpenLMLab/LOMO/blob/main/adalomo/instruction-tuning/train.py 模型改为llama2-70b |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
您好,再次感谢 lomo 这一出色的系列工作。目前我正在尝试使用 collie 这里面的框架进行训练,使用的是最新的 dev 分支代码。在训练时我使用 llama 2 13b 这种规模的模型是没有问题的,但是在 70b 一直会出现
我是用的脚本就是您这个项目了 adalomo 中直接提供的 instruction tuning 的脚本 其他参数只是设置 tp=2 模型我自己给了个路径。请问您这个有遇到过或者该如何设置呢?我的机器环境是 8*A100 的配置。
感谢~~
The text was updated successfully, but these errors were encountered: