Skip to content

Commit

Permalink
pd: fix oom error (#4493)
Browse files Browse the repository at this point in the history
Paddle use `MemoryError` rather than `RuntimeError` used in pytorch, now
I can test DPA-1 and DPA-2 in 16G V100...

![image](https://github.com/user-attachments/assets/42ead773-bf26-4195-8f67-404b151371de)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Improved detection of out-of-memory (OOM) errors to enhance
application stability.
- Ensured cached memory is cleared upon OOM errors, preventing potential
memory leaks.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
  • Loading branch information
HydrogenSulfate authored Dec 23, 2024
1 parent cfe17a3 commit 242408d
Showing 1 changed file with 2 additions and 6 deletions.
8 changes: 2 additions & 6 deletions deepmd/pd/utils/auto_batch_size.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,12 +49,8 @@ def is_oom_error(self, e: Exception) -> bool:
# several sources think CUSOLVER_STATUS_INTERNAL_ERROR is another out-of-memory error,
# such as https://github.com/JuliaGPU/CUDA.jl/issues/1924
# (the meaningless error message should be considered as a bug in cusolver)
if isinstance(e, RuntimeError) and (
"CUDA out of memory." in e.args[0]
or "CUDA driver error: out of memory" in e.args[0]
or "cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR" in e.args[0]
):
if isinstance(e, MemoryError) and ("ResourceExhaustedError" in e.args[0]):
# Release all unoccupied cached memory
# paddle.device.cuda.empty_cache()
paddle.device.cuda.empty_cache()
return True
return False

0 comments on commit 242408d

Please sign in to comment.