Out of memory while ConvertingTheStateDict - How to split across GPUs? #73

mrgohlke · 2024-10-19T17:20:19Z

SETUP:
I have a server with 8x V100 GPUs and have successfully used the fsdp_qlora script to train a Llama-3.1-70B model on a custom dataset using hqq_dora. Attempting to convert the model_state_dict.safetensors to adapter_model.safetensors using a python script nearly identical to #60

ERROR:
CUDA out of memory error occurs at approximately 70% of loading the model:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 10.44 MiB is free. Process 1081625 has 398.00 MiB memory in use. Process 1083739 has 886.00 MiB memory in use. Including non-PyTorch memory, this process has 30.46 GiB memory in use. Of the allocated memory 29.92 GiB is allocated by PyTorch, and 193.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

NVIDIA-SMI shows that the model is being loaded only on GPU0:

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-32GB Off | 00000000:00:10.0 Off | 0 |
| N/A 35C P0 66W / 300W | 10296MiB / 32768MiB | 42% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

| 1 Tesla V100-SXM2-32GB Off | 00000000:00:11.0 Off | 0 |
| N/A 30C P0 40W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla V100-SXM2-32GB Off | 00000000:00:1B.0 Off | 0 |
| N/A 27C P0 41W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 Tesla V100-SXM2-32GB Off | 00000000:00:1C.0 Off | 0 |
| N/A 27C P0 40W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 Tesla V100-SXM2-32GB Off | 00000000:02:0D.0 Off | 0 |
| N/A 28C P0 38W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 Tesla V100-SXM2-32GB Off | 00000000:02:0E.0 Off | 0 |
| N/A 28C P0 39W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 Tesla V100-SXM2-32GB Off | 00000000:02:0F.0 Off | 0 |
| N/A 29C P0 40W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 Tesla V100-SXM2-32GB Off | 00000000:02:10.0 Off | 0 |
| N/A 30C P0 40W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

How can I use multiple GPUs to perform the conversion and avoid the OOM error?

ghsama · 2024-12-13T11:06:55Z

Can you share the solution for the issue ?

mrgohlke closed this as completed Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory while ConvertingTheStateDict - How to split across GPUs? #73

Out of memory while ConvertingTheStateDict - How to split across GPUs? #73

mrgohlke commented Oct 19, 2024 •

edited

Loading

ghsama commented Dec 13, 2024

Out of memory while ConvertingTheStateDict - How to split across GPUs? #73

Out of memory while ConvertingTheStateDict - How to split across GPUs? #73

Comments

mrgohlke commented Oct 19, 2024 • edited Loading

ghsama commented Dec 13, 2024

mrgohlke commented Oct 19, 2024 •

edited

Loading