-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix llm with torchtune v0.3 #289
Conversation
I think I've seen this error before but I'm not sure why it got reported
|
Could not reproduce the error above. Not sure where it came from |
I have an issue with the newest version on |
This is probably a benchmark timeout, signal 15 is sigterm, milabench probably killed the bench because it took too long |
Is it ok to flag the bench as failed then? Where in the code are the exit code handled to avoid flagging a bench as a fail if milabench killed it on purpose? |
We don't actually forward the timeout back as an error, just a message here. But the timeout might be fine in some cases. There are some accelerator that really do not exit super cleanly and could be hanging during the exit, as long as we have enough observation it is not an issue. There is an issue in how we check with enough observations here we should compare this against the expected number of observation to make sure we ran everything as expected |
Is it possible that in a remote multi node env (base milabench node -> [master node, worker node]) this errors is flags the bench as failed? 2 benches with 2 errors get reported and the only error I seam to find is the signal 15: https://github.com/mila-iqia/milabench/actions/runs/11220338262/job/31212779116?pr=257 For the rate issue, if we have few number of iterations / observations the score at the end would lower also right? Would we want to validate the number of observations because the first iterations are slower than those following so having more it would give a better idea on the performance of the GPU? |
In 2 nodes setup there is going to be 2 processes, the worker process gets tagged with nolog and is mostly ignored as a result. I think in the case of this run, the error comes from the lack of metrics, and the lack of metrics might be because the job timed out or hanged. It does say that the problem is "no training rate retrieved", then it also tries to extract the exception that could have caused the issue. Here it is the SIGTERM because of the timeout
You can see higher in the log that a timeout was indeed triggered for both nodes.
|
We have a target number of observations, we drop the highest and the lowest value before doing the average. We mostly want to check we have the right number of observation because if we don't that means something unexpected has happened and as such that might invalidate the results. |
Tested on :