pd: fix oom error (#4493)

Paddle use `MemoryError` rather than `RuntimeError` used in pytorch, now I can test DPA-1 and DPA-2 in 16G V100... ![image](https://github.com/user-attachments/assets/42ead773-bf26-4195-8f67-404b151371de)  ## Summary by CodeRabbit - **Bug Fixes** - Improved detection of out-of-memory (OOM) errors to enhance application stability. - Ensured cached memory is cleared upon OOM errors, preventing potential memory leaks.
deepmodeling · Dec 23, 2024 · 242408d · 242408d
1 parent cfe17a3
commit 242408d
Showing 1 changed file with 2 additions and 6 deletions.
diff --git a/deepmd/pd/utils/auto_batch_size.py b/deepmd/pd/utils/auto_batch_size.py
@@ -49,12 +49,8 @@ def is_oom_error(self, e: Exception) -> bool:
         # several sources think CUSOLVER_STATUS_INTERNAL_ERROR is another out-of-memory error,
         # such as https://github.com/JuliaGPU/CUDA.jl/issues/1924
         # (the meaningless error message should be considered as a bug in cusolver)
-        if isinstance(e, RuntimeError) and (
-            "CUDA out of memory." in e.args[0]
-            or "CUDA driver error: out of memory" in e.args[0]
-            or "cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR" in e.args[0]
-        ):
+        if isinstance(e, MemoryError) and ("ResourceExhaustedError" in e.args[0]):
             # Release all unoccupied cached memory
-            # paddle.device.cuda.empty_cache()
+            paddle.device.cuda.empty_cache()
             return True
         return False