You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue Description
While running test_time_training.py, some tasks train successfully, but others fail with errors.
Errors Encountered:
1. list index out of range during training for certain tasks.
2. CUDA out of memory for specific tasks, even though the GPU has significant memory in use by PyTorch but some reserved memory remains unallocated.
Example:
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
0%| | 0/125 [00:00<?, ?it/s]
list index out of range
Error training for 0a2355a6
0%| | 0/125 [00:00<?, ?it/s]
Please advise on potential solutions or debugging approaches for these issues.
The text was updated successfully, but these errors were encountered:
Issue Description
While running test_time_training.py, some tasks train successfully, but others fail with errors.
Errors Encountered:
1. list index out of range during training for certain tasks.
2. CUDA out of memory for specific tasks, even though the GPU has significant memory in use by PyTorch but some reserved memory remains unallocated.
Example:
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
0%| | 0/125 [00:00<?, ?it/s]
list index out of range
Error training for 0a2355a6
0%| | 0/125 [00:00<?, ?it/s]
Please advise on potential solutions or debugging approaches for these issues.
The text was updated successfully, but these errors were encountered: