Fix llm with torchtune v0.3 #289

satyaog · 2024-09-23T19:05:18Z

Tested on :

remote [stdout] Breakdown                                                                                                                                       
remote [stdout] ---------                                                                                                                                       
remote [stdout] bench              | fail |   n | ngpu |       perf |   sem% |   std% | peak_memory |      score | weight                                       
remote [stdout] llm-lora-ddp-gpus  |    0 |   1 |    2 |    5596.19 |   0.3% |   1.8% |       35063 |    5596.19 |   1.00                                       
remote [stdout] llm-lora-ddp-nodes |    0 |   2 |    4 |    1025.44 |   0.5% |   2.9% |       35199 |    1025.44 |   1.00                                       
remote [stdout] llm-lora-single    |    0 |   2 |    1 |    3506.01 |   0.1% |   1.1% |       32995 |    7016.71 |   1.00                                       ```

I don't have the resources to test for llama 70B

satyaog · 2024-09-23T19:07:49Z

I think I've seen this error before but I'm not sure why it got reported

remote [stdout] llm-lora-ddp-nodes.manager                                                                                                                           
remote [stdout] ==========================                                                                                                                           
remote [stdout]   * no training rate retrieved                                                                                                                       
remote [stdout]   * Error codes = 1                                                                                                                                  
remote [stdout]   * 1 exceptions found                                                                                                                               
remote [stdout]     * 1 x torch.distributed.elastic.multiprocessing.api.SignalException: Process 4083594 got signal: 15                                              
remote [stdout]         | Traceback (most recent call last):                                                                                                         
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/bin/tune", line 8, in <module>                                                
remote [stdout]         |     sys.exit(main())                                                                                                                       
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 49, in main        
remote [stdout]         |     parser.run(args)                                                                                                                       
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 43, in run         
remote [stdout]         |     args.func(args)                                                                                                                        
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torchtune/_cli/run.py", line 183, in _run_cmd    
remote [stdout]         |     self._run_distributed(args)                                                                                                            
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torchtune/_cli/run.py", line 89, in _run_distributed                                                                                                                                                     
remote [stdout]         |     run(args)                                                                                                                              
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run      
remote [stdout]         |     elastic_launch(                                                                                                                        
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__                                                                                                                                                             
remote [stdout]         |     return launch_agent(self._config, self._entrypoint, list(args))                                                                        
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent                                                                                                                                                         
remote [stdout]         |     result = agent.run()                                                                                                                   
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper                                                                                                                                                      
remote [stdout]         |     result = f(*args, **kwargs)                                                                                                            
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run                                                                                                                                                     
remote [stdout]         |     result = self._invoke_run(role)                                                                                                        
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 835, in _invoke_run                                                                                                                                             
remote [stdout]         |     time.sleep(monitor_interval)                                                                                                           
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler                                                                                                                                           
remote [stdout]         |     raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)                                                    
remote [stdout]         | torch.distributed.elastic.multiprocessing.api.SignalException: Process 4083594 got signal: 15

satyaog · 2024-09-23T20:22:05Z

Could not reproduce the error above. Not sure where it came from

Delaunay · 2024-09-30T14:21:27Z

I have an issue with the newest version on llm-full-mp-gpus it times out after a while.
Everything seems to work but it just takes longer than past version.
I will force torchtune to be <0.3 for now

Delaunay · 2024-10-09T16:25:23Z

remote [stdout] | torch.distributed.elastic.multiprocessing.api.SignalException: Process 4083594 got signal: 15

This is probably a benchmark timeout, signal 15 is sigterm, milabench probably killed the bench because it took too long

satyaog · 2024-10-09T16:35:35Z

Is it ok to flag the bench as failed then? Where in the code are the exit code handled to avoid flagging a bench as a fail if milabench killed it on purpose?

Delaunay · 2024-10-09T20:06:38Z

We don't actually forward the timeout back as an error, just a message here.

But the timeout might be fine in some cases. There are some accelerator that really do not exit super cleanly and could be hanging during the exit, as long as we have enough observation it is not an issue.

There is an issue in how we check with enough observations here we should compare this against the expected number of observation to make sure we ran everything as expected

satyaog · 2024-10-11T13:15:39Z

Is it possible that in a remote multi node env (base milabench node -> [master node, worker node]) this errors is flags the bench as failed? 2 benches with 2 errors get reported and the only error I seam to find is the signal 15:

https://github.com/mila-iqia/milabench/actions/runs/11220338262/job/31212779116?pr=257

For the rate issue, if we have few number of iterations / observations the score at the end would lower also right? Would we want to validate the number of observations because the first iterations are slower than those following so having more it would give a better idea on the performance of the GPU?

Delaunay · 2024-10-11T18:36:35Z

In 2 nodes setup there is going to be 2 processes, the worker process gets tagged with nolog and is mostly ignored as a result.

I think in the case of this run, the error comes from the lack of metrics, and the lack of metrics might be because the job timed out or hanged.

It does say that the problem is "no training rate retrieved", then it also tries to extract the exception that could have caused the issue. Here it is the SIGTERM because of the timeout

 remote [stdout] llm-full-mp-nodes.manager
remote [stdout] =========================
remote [stdout]   * no training rate retrieved
remote [stdout]   * Error codes = 1
remote [stdout]   * 1 exceptions found
remote [stdout]     * 1 x torch.distributed.elastic.multiprocessing.api.SignalException: Process 30929 got signal: 15
remote [stdout]         | Traceback (most recent call last):
remote [stdout]         |   File "/home/runner/work/milabench/output/venv/torch/bin/tune", line 8, in <module>
remote [stdout]         |     sys.exit(main())

You can see higher in the log that a timeout was indeed triggered for both nodes.

remote [stdout] llm-full-mp-nodes.node1.nolog [message] Terminating process because it ran for longer than 3600 seconds.
remote [stdout] llm-full-mp-nodes.manager [message] Terminating process because it ran for longer than 3600 seconds.

Delaunay · 2024-10-11T18:50:15Z

For the rate issue, if we have few number of iterations / observations the score at the end would lower also right? Would we want to validate the number of observations because the first iterations are slower than those following so having more it would give a better idea on the performance of the GPU?

We have a target number of observations, we drop the highest and the lowest value before doing the average.

We mostly want to check we have the right number of observation because if we don't that means something unexpected has happened and as such that might invalidate the results.

Fix llm with torchtune v0.3

fd99d8a

Delaunay merged commit 34f56e7 into mila-iqia:master Oct 2, 2024
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix llm with torchtune v0.3 #289

Fix llm with torchtune v0.3 #289

satyaog commented Sep 23, 2024 •

edited

Loading

satyaog commented Sep 23, 2024 •

edited

Loading

satyaog commented Sep 23, 2024

Delaunay commented Sep 30, 2024

Delaunay commented Oct 9, 2024

satyaog commented Oct 9, 2024

Delaunay commented Oct 9, 2024

satyaog commented Oct 11, 2024 •

edited

Loading

Delaunay commented Oct 11, 2024 •

edited

Loading

Delaunay commented Oct 11, 2024 •

edited

Loading

Fix llm with torchtune v0.3 #289

Fix llm with torchtune v0.3 #289

Conversation

satyaog commented Sep 23, 2024 • edited Loading

satyaog commented Sep 23, 2024 • edited Loading

satyaog commented Sep 23, 2024

Delaunay commented Sep 30, 2024

Delaunay commented Oct 9, 2024

satyaog commented Oct 9, 2024

Delaunay commented Oct 9, 2024

satyaog commented Oct 11, 2024 • edited Loading

Delaunay commented Oct 11, 2024 • edited Loading

Delaunay commented Oct 11, 2024 • edited Loading

satyaog commented Sep 23, 2024 •

edited

Loading

satyaog commented Sep 23, 2024 •

edited

Loading

satyaog commented Oct 11, 2024 •

edited

Loading

Delaunay commented Oct 11, 2024 •

edited

Loading

Delaunay commented Oct 11, 2024 •

edited

Loading