Errors in test_time_training.py: Index Out of Range and CUDA Out of Memory #5

deepmindby · 2024-12-20T17:21:11Z

Issue Description
While running test_time_training.py, some tasks train successfully, but others fail with errors.

Errors Encountered:
1. list index out of range during training for certain tasks.
2. CUDA out of memory for specific tasks, even though the GPU has significant memory in use by PyTorch but some reserved memory remains unallocated.

Example:
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
0%| | 0/125 [00:00<?, ?it/s]
list index out of range
Error training for 0a2355a6
0%| | 0/125 [00:00<?, ?it/s]

Please advise on potential solutions or debugging approaches for these issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors in test_time_training.py: Index Out of Range and CUDA Out of Memory #5

Errors in test_time_training.py: Index Out of Range and CUDA Out of Memory #5

deepmindby commented Dec 20, 2024

Errors in test_time_training.py: Index Out of Range and CUDA Out of Memory #5

Errors in test_time_training.py: Index Out of Range and CUDA Out of Memory #5

Comments

deepmindby commented Dec 20, 2024