Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix llm with torchtune v0.3 #289

Merged
merged 1 commit into from
Oct 2, 2024
Merged

Conversation

satyaog
Copy link
Member

@satyaog satyaog commented Sep 23, 2024

Tested on :

remote [stdout] Breakdown                                                                                                                                       
remote [stdout] ---------                                                                                                                                       
remote [stdout] bench              | fail |   n | ngpu |       perf |   sem% |   std% | peak_memory |      score | weight                                       
remote [stdout] llm-lora-ddp-gpus  |    0 |   1 |    2 |    5596.19 |   0.3% |   1.8% |       35063 |    5596.19 |   1.00                                       
remote [stdout] llm-lora-ddp-nodes |    0 |   2 |    4 |    1025.44 |   0.5% |   2.9% |       35199 |    1025.44 |   1.00                                       
remote [stdout] llm-lora-single    |    0 |   2 |    1 |    3506.01 |   0.1% |   1.1% |       32995 |    7016.71 |   1.00                                       ```

I don't have the resources to test for llama 70B

@satyaog
Copy link
Member Author

satyaog commented Sep 23, 2024

I think I've seen this error before but I'm not sure why it got reported

remote [stdout] llm-lora-ddp-nodes.manager                                                                                                                           
remote [stdout] ==========================                                                                                                                           
remote [stdout]   * no training rate retrieved                                                                                                                       
remote [stdout]   * Error codes = 1                                                                                                                                  
remote [stdout]   * 1 exceptions found                                                                                                                               
remote [stdout]     * 1 x torch.distributed.elastic.multiprocessing.api.SignalException: Process 4083594 got signal: 15                                              
remote [stdout]         | Traceback (most recent call last):                                                                                                         
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/bin/tune", line 8, in <module>                                                
remote [stdout]         |     sys.exit(main())                                                                                                                       
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 49, in main        
remote [stdout]         |     parser.run(args)                                                                                                                       
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 43, in run         
remote [stdout]         |     args.func(args)                                                                                                                        
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torchtune/_cli/run.py", line 183, in _run_cmd    
remote [stdout]         |     self._run_distributed(args)                                                                                                            
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torchtune/_cli/run.py", line 89, in _run_distributed                                                                                                                                                     
remote [stdout]         |     run(args)                                                                                                                              
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run      
remote [stdout]         |     elastic_launch(                                                                                                                        
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__                                                                                                                                                             
remote [stdout]         |     return launch_agent(self._config, self._entrypoint, list(args))                                                                        
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent                                                                                                                                                         
remote [stdout]         |     result = agent.run()                                                                                                                   
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper                                                                                                                                                      
remote [stdout]         |     result = f(*args, **kwargs)                                                                                                            
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run                                                                                                                                                     
remote [stdout]         |     result = self._invoke_run(role)                                                                                                        
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 835, in _invoke_run                                                                                                                                             
remote [stdout]         |     time.sleep(monitor_interval)                                                                                                           
remote [stdout]         |   File "/tmp/ortizgas/milabench_mn_slurm__a100_x2/venv/torch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler                                                                                                                                           
remote [stdout]         |     raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)                                                    
remote [stdout]         | torch.distributed.elastic.multiprocessing.api.SignalException: Process 4083594 got signal: 15                                              

@satyaog
Copy link
Member Author

satyaog commented Sep 23, 2024

Could not reproduce the error above. Not sure where it came from

@Delaunay
Copy link
Collaborator

I have an issue with the newest version on llm-full-mp-gpus it times out after a while.
Everything seems to work but it just takes longer than past version.
I will force torchtune to be <0.3 for now

@Delaunay Delaunay merged commit 34f56e7 into mila-iqia:master Oct 2, 2024
1 of 3 checks passed
@Delaunay
Copy link
Collaborator

Delaunay commented Oct 9, 2024

remote [stdout] | torch.distributed.elastic.multiprocessing.api.SignalException: Process 4083594 got signal: 15

This is probably a benchmark timeout, signal 15 is sigterm, milabench probably killed the bench because it took too long

@satyaog
Copy link
Member Author

satyaog commented Oct 9, 2024

Is it ok to flag the bench as failed then? Where in the code are the exit code handled to avoid flagging a bench as a fail if milabench killed it on purpose?

@Delaunay
Copy link
Collaborator

Delaunay commented Oct 9, 2024

We don't actually forward the timeout back as an error, just a message here.

But the timeout might be fine in some cases. There are some accelerator that really do not exit super cleanly and could be hanging during the exit, as long as we have enough observation it is not an issue.

There is an issue in how we check with enough observations here we should compare this against the expected number of observation to make sure we ran everything as expected

@satyaog
Copy link
Member Author

satyaog commented Oct 11, 2024

Is it possible that in a remote multi node env (base milabench node -> [master node, worker node]) this errors is flags the bench as failed? 2 benches with 2 errors get reported and the only error I seam to find is the signal 15:

https://github.com/mila-iqia/milabench/actions/runs/11220338262/job/31212779116?pr=257

For the rate issue, if we have few number of iterations / observations the score at the end would lower also right? Would we want to validate the number of observations because the first iterations are slower than those following so having more it would give a better idea on the performance of the GPU?

@Delaunay
Copy link
Collaborator

Delaunay commented Oct 11, 2024

In 2 nodes setup there is going to be 2 processes, the worker process gets tagged with nolog and is mostly ignored as a result.

I think in the case of this run, the error comes from the lack of metrics, and the lack of metrics might be because the job timed out or hanged.

It does say that the problem is "no training rate retrieved", then it also tries to extract the exception that could have caused the issue. Here it is the SIGTERM because of the timeout

 remote [stdout] llm-full-mp-nodes.manager
remote [stdout] =========================
remote [stdout]   * no training rate retrieved
remote [stdout]   * Error codes = 1
remote [stdout]   * 1 exceptions found
remote [stdout]     * 1 x torch.distributed.elastic.multiprocessing.api.SignalException: Process 30929 got signal: 15
remote [stdout]         | Traceback (most recent call last):
remote [stdout]         |   File "/home/runner/work/milabench/output/venv/torch/bin/tune", line 8, in <module>
remote [stdout]         |     sys.exit(main())

You can see higher in the log that a timeout was indeed triggered for both nodes.

remote [stdout] llm-full-mp-nodes.node1.nolog [message] Terminating process because it ran for longer than 3600 seconds.
remote [stdout] llm-full-mp-nodes.manager [message] Terminating process because it ran for longer than 3600 seconds.

@Delaunay
Copy link
Collaborator

Delaunay commented Oct 11, 2024

For the rate issue, if we have few number of iterations / observations the score at the end would lower also right? Would we want to validate the number of observations because the first iterations are slower than those following so having more it would give a better idea on the performance of the GPU?

We have a target number of observations, we drop the highest and the lowest value before doing the average.

We mostly want to check we have the right number of observation because if we don't that means something unexpected has happened and as such that might invalidate the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants