CUDA out of memory. (RTX 4070 super) #195

KuShiro189 · 2024-04-30T15:42:54Z

🔴 If you have installed AllTalk in a custom Python environment, I will only be able to provide limited assistance/support. AllTalk draws on a variety of scripts and libraries that are not written or managed by myself, and they may fail, error or give strange results in custom built python environments.

🔴 Please generate a diagnostics report and upload the "diagnostics.log" as this helps me understand your configuration.

diagnostics.log

Describe the bug
CUDA out of memory on any batch size even on batch size 1 (RTX 4070 super)

To Reproduce
here are the parameters i attempted (every single of them would return CUDA out of memory):
-the default set
-32 epoch, 16 batch size, 1 grad acc steps, 16 max permitted size of audio
-24 epoch, 8 batch size, 1 grad acc steps, 8 max permitted size of audio
-16 epoch, 2 batch size, 1 grad acc steps, 8 max permitted size of audio
-8 epoch, 4 batch size, 2 grad acc steps, 8 max permitted size of audio
-8 epoch, 4 batch size, 1 grad acc steps, 8 max permitted size of audio (the screenshot)
-2 epoch, 1 batch size, 1 grad acc steps, 4 max permitted size of audio

Screenshots

Text/logs
Traceback (most recent call last):
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1833, in fit
self._fit()
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1785, in _fit
self.train_epoch()
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1504, in train_epoch
outputs, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1360, in train_step
outputs, loss_dict_new, step_time = self.optimize(
^^^^^^^^^^^^^^
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1288, in optimize
optimizer.step()
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\lr_scheduler.py", line 75, in wrapper
return wrapped(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\optimizer.py", line 385, in wrapper
out = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\adamw.py", line 187, in step
adamw(
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\adamw.py", line 339, in adamw
func(
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\optim\adamw.py", line 608, in _multi_tensor_adamw
exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 11.99 GiB of which 1.57 GiB is free. Of the allocated memory 7.50 GiB is allocated by PyTorch, and 186.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\AI\text-generation-webui-main\extensions\alltalk_tts\finetune.py", line 1376, in train_model
config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=str(output_path), max_audio_length=max_audio_length)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\text-generation-webui-main\extensions\alltalk_tts\finetune.py", line 617, in train_gpt
trainer.fit()
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\trainer.py", line 1860, in fit
remove_experiment_folder(self.output_path)
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\trainer\generic_utils.py", line 77, in remove_experiment_folder
fs.rm(experiment_path, recursive=True)
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\site-packages\fsspec\implementations\local.py", line 185, in rm
shutil.rmtree(p)
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\shutil.py", line 787, in rmtree
return _rmtree_unsafe(path, onerror)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\shutil.py", line 634, in _rmtree_unsafe
onerror(os.unlink, fullname, sys.exc_info())
File "C:\AI\text-generation-webui-main\installer_files\env\Lib\shutil.py", line 632, in _rmtree_unsafe
os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:/AI/text-generation-webui-main/extensions/alltalk_tts/finetune/tmp-trn/training/XTTS_FT-April-30-2024_10+11PM-ea551d3\trainer_0_log.txt'

Desktop (please complete the following information):
AllTalk was updated: 3/18/2024
Custom Python environment: text generation webUI's python environment, but i've attempted on my local python environment aswell and they returned the same error
Text-generation-webUI was updated: 3/11/2024

Additional context
seems like regardless of what parameters i set, it will always try to utilize the entire 12GB VRAM ignoring the 0.5-1GB used by other programs.
i'd like to specifically know how did you pulled it off in your 4070 aswell if possible,thanks!

erew123 · 2024-04-30T15:50:24Z

Hi @KuShiro189

On the Ram & VRAM tab, have you checked the link to make sure the Nvidia Stable Diffusion memory settings aren't disabled? https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion

This setting allows Windows machines to extend their VRAM memory into System RAM if needed. If its been turned off, you can only use the 12GB of VRAM that you have.

Thanks

KuShiro189 · 2024-04-30T16:44:20Z

appreciated the quick response!

i had not used stable diffusion before, and have not set anything about the memory settings
i did attempted to set both the python.exe in the text gen webUI env and the global nvidia settings to have the memory settings on, and restarted the finetune.py script

but the finetune script still gives the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU 0 has a total capacity of 11.99 GiB of which 4.93 GiB is free. Of the allocated memory 5.69 GiB is allocated by PyTorch, and 119.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

on a side note, on my RAM & VRAM tab, the GPU information and vram did not show up and only shows the system RAM, despite that my GPU is still being utilized by the finetune script as far as i've seen it in task manager. perhaps the issues might be about it?
edit: maybe i should try updating my alltalk_tts?

sorry for taking long to reply, my PC crashed during another attempt like above

erew123 · 2024-04-30T17:11:38Z

Hi @KuShiro189

The article is called "Stable Diffusion memory fallback" by Nvidia, though the actual setting is "CUDA - Sysmem Fallback Policy" and it changes the way the Nvidia driver works with memory allocation. They should have called it something better and less confusing.

If you haven't, I would 110% suggest you check if that setting has been changed by something else as other applications can change it.

Its the only setting that I know of for Windows that will have any effect on the VRAM memory allocation.

I've just run through a finetuning process to confirm all is working. The default behaviour you should see when it runs is that as the 1st Epoch comes to an end, with a 12GB GPU, the Shared memory should start to increase. This is the "CUDA - Sysmem Fallback Policy" in operation, allowing the GPU to use System RAM when it runs out of memory.

It only uses that memory for maybe 30-60 seconds as it stores 3 x 5GB copies of the AI model in there as it shifts things around before saving 2x of them off to disk. When it saves them off to disk, it releases that memory, so you see both your VRAM and Shared memory drop again:

As you will note, those screenshots are both running on a RTX 4070 with 12GB VRAM.

And the only secret to making it run, that I know of, is to "CUDA - Sysmem Fallback Policy" as without that, your Nvdia driver will limit CUDA operations to the 12GB VRAM built into your GPU.

So can you confirm you have checked that setting to see if its been disabled OR indeed set it to "Prefer Sysmem Fallback" to see if that changes things for you.

Thanks

KuShiro189 · 2024-04-30T17:36:25Z

yep, i activated it

but it did not seem to overflow to the system memory (can be seen on the peak VRAM usage where instead of overflowing it into the system RAM, it returns an error)

i'll give a try on updating the alltalk tts and see if anything changes

erew123 · 2024-04-30T21:38:40Z

Hi @KuShiro189

The only thing I can add as a thought, is that I dont know how that value is activated/passed over to Python's Pytorch CUDA environment, meaning that, when you change the setting, you will probably have to open a new command prompt and load a fresh Python environment. Using an already open command prompt may not carry over the new setting, but I cant say for certain as Ive not looked into its behaviour in that kind of detail.

RenNagasaki · 2024-04-30T22:13:43Z

@KuShiro189 what I learned is that your System(C:) Partition should have atleast 20GB of Space. If that runs out, OOM error seems to occur aswell.

KuShiro189 · 2024-05-01T04:56:15Z

thank you both for the input!
and sorry for taking quite long to respond, it was midnight when i opened this issue

i did attempted to start a new CMD and python environment after i have set the nvidia settings, and still no luck, my GPU seem to refuse overflowing the memory into the RAM
i updated both alltalk tts and my nvidia driver and restarted my entire computer
still no luck. im going to check the system variables and the BIOS settings aswell

also i have 121GB free on SSD for now, shouldn't be the problem there

my thoughts are that my GPU somehow refuses to overflow its memory to system RAM due to either a factory setting prevented it or something in the BIOS or system prevented it. im going to check everything for a while

once again thank you for both of your time!

KuShiro189 · 2024-05-01T07:51:02Z

no luck yet ;-;
my GPU for some reason just did not want to overflow its memory to system RAM (even though i have plenty space in RAM) regardless of the setting on the Nvidia control panel. not sure why but i'd like to experiment more on this by trying purposefully loading something massive in my GPU to troubleshoot the problem

in case of this keep goes on, perhaps there are other way to do this? probably anyone with a good GPU can help me finetune the model with my dataset?

erew123 · 2024-05-01T11:11:14Z

@KuShiro189 The CUDA - Sysmem Fallback Policy is applied at the Nvidia driver level, so (as far as I understand) there are no other settings that would impact this working. Though as mentioned, other Windows applications can send instructions to the Nvidia driver, but, if you have set Prefer Sysmem Fallback that should force the setting on and nothing should be able to over-ride that.

So as far as AllTalk's code goes, it just accesses the VRAM memory via Python. AllTalk sends requests to Python and AllTalk has no concept/access to control CUDA - Sysmem Fallback Policy or the Nvidia driver/memory allocation settings. All the AllTalk and fintetune script does, is request that something be stored within the VRAM or removed from the VRAM. There is no clever anything beyond that.

Python itself doesn't have that level of control either, which is why its back to the Nvidia driver to extend or not extend into System Ram. This setting is also ONLY available on Windows systems and CUDA - Sysmem Fallback Policy is not available on Linux. I assume you aren't running text-gen-webui and AllTalk through Windows Subsystem for Linux (WSL)? as I note here that the setting doesn't pass over to WSL microsoft/WSL#11050 (according to the people whom wrote that on the MS github). Im pretty sure you arent using WSL, based on your diagnostics file, but I could be wrong.

The only suggestions I have at this point are:

Ensure CUDA - Sysmem Fallback Policy is set to Prefer Sysmem Fallback and nothing is resetting it. So check the setting after a reboot.
I can in no way see how the text-gen-webui's Python environment could be impacting it, but you could setup AllTalk as a standalone copy elsewhere on your system and see if that suffers the same issue. Running as a standalone installation will build its own Python environment separate to that of Text-gen-webui and would negate any issues with the text-gen-webui Python environment. aka, its a more controlled environment.
Im not sure what the largest LLM you have is, but in theory, if you load a 13b model into your VRAM and then load AllTalk without "Low VRAM" enabled, it should extend the AllTalk AI model into your Shared RAM, as long as the setting above is set. There are reasons people may or may not set that setting within the text-gen-webui environment, depending on which loaders they are using e.g. Trying to speed up responses. oobabooga/text-generation-webui#5784 however as far as I am aware, there are no ways Text-gen-webui (or the LLM) loaders change this or any related.

I've had a general hunt of the internet and I cant think of or see any other routes to try diagnose/resolve this. For various reasons I had to run about 8 finetuning sessions yesterday on the current finetune code from github and I didn't encounter the out of memory issue once, everything behaved as expected. The only real difference between your system and mine was that you are on Windows 10 and Im on 11, which shouldn't make the slightest bit of difference. And I was on 2 Nvidia driver versions later than yourself, but again, that shouldnt make a difference and there were no bug fixes relating to memory management between the versions of the driver.

I can only suggest try the above 3 things. Other than that, I am stumped for what to try or further things to suggest.

Thanks

KuShiro189 · 2024-05-02T14:17:48Z

thank you so much for the time! very appreciated that!
its all good on my end for now :)

erew123 closed this as completed May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory. (RTX 4070 super) #195

CUDA out of memory. (RTX 4070 super) #195

KuShiro189 commented Apr 30, 2024

erew123 commented Apr 30, 2024 •

edited

Loading

KuShiro189 commented Apr 30, 2024 •

edited

Loading

erew123 commented Apr 30, 2024

KuShiro189 commented Apr 30, 2024

erew123 commented Apr 30, 2024

RenNagasaki commented Apr 30, 2024

KuShiro189 commented May 1, 2024

KuShiro189 commented May 1, 2024

erew123 commented May 1, 2024

KuShiro189 commented May 2, 2024

CUDA out of memory. (RTX 4070 super) #195

CUDA out of memory. (RTX 4070 super) #195

Comments

KuShiro189 commented Apr 30, 2024

erew123 commented Apr 30, 2024 • edited Loading

KuShiro189 commented Apr 30, 2024 • edited Loading

erew123 commented Apr 30, 2024

KuShiro189 commented Apr 30, 2024

erew123 commented Apr 30, 2024

RenNagasaki commented Apr 30, 2024

KuShiro189 commented May 1, 2024

KuShiro189 commented May 1, 2024

erew123 commented May 1, 2024

KuShiro189 commented May 2, 2024

erew123 commented Apr 30, 2024 •

edited

Loading

KuShiro189 commented Apr 30, 2024 •

edited

Loading