Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory while ConvertingTheStateDict - How to split across GPUs? #73

Closed
mrgohlke opened this issue Oct 19, 2024 · 1 comment
Closed

Comments

@mrgohlke
Copy link

mrgohlke commented Oct 19, 2024

SETUP:
I have a server with 8x V100 GPUs and have successfully used the fsdp_qlora script to train a Llama-3.1-70B model on a custom dataset using hqq_dora. Attempting to convert the model_state_dict.safetensors to adapter_model.safetensors using a python script nearly identical to #60

ERROR:
CUDA out of memory error occurs at approximately 70% of loading the model:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 10.44 MiB is free. Process 1081625 has 398.00 MiB memory in use. Process 1083739 has 886.00 MiB memory in use. Including non-PyTorch memory, this process has 30.46 GiB memory in use. Of the allocated memory 29.92 GiB is allocated by PyTorch, and 193.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

NVIDIA-SMI shows that the model is being loaded only on GPU0:

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-32GB Off | 00000000:00:10.0 Off | 0 |
| N/A 35C P0 66W / 300W | 10296MiB / 32768MiB | 42% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

| 1 Tesla V100-SXM2-32GB Off | 00000000:00:11.0 Off | 0 |
| N/A 30C P0 40W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla V100-SXM2-32GB Off | 00000000:00:1B.0 Off | 0 |
| N/A 27C P0 41W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 Tesla V100-SXM2-32GB Off | 00000000:00:1C.0 Off | 0 |
| N/A 27C P0 40W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 Tesla V100-SXM2-32GB Off | 00000000:02:0D.0 Off | 0 |
| N/A 28C P0 38W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 Tesla V100-SXM2-32GB Off | 00000000:02:0E.0 Off | 0 |
| N/A 28C P0 39W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 Tesla V100-SXM2-32GB Off | 00000000:02:0F.0 Off | 0 |
| N/A 29C P0 40W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 Tesla V100-SXM2-32GB Off | 00000000:02:10.0 Off | 0 |
| N/A 30C P0 40W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

How can I use multiple GPUs to perform the conversion and avoid the OOM error?

@ghsama
Copy link

ghsama commented Dec 13, 2024

Can you share the solution for the issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants