-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The function deepmd.infer.DeepDipole.eval() cannot utilize multiple GPUs in parallel #2877
Comments
I guess that this error stems from the lack of support for multi-GPU parallelization of inference from python api. @njzjz Is that true? @Kehan-Cai-nanako Can you try to use LAMMPS's |
The error arises when I only input one frame, i.e., the initial
configuration of the simulation.
Best,
Kehan
…On Thu, Sep 28, 2023 at 1:32 PM Jinzhe Zeng ***@***.***> wrote:
Do you input one frame or multiple frames? Currently DeepTensor does not
support automatic batch size, unlike DeepPot, so this may cause OOM.
@Yi-FanLi <https://github.com/Yi-FanLi> You can try to support it. See
#1173 <#1173>.
—
Reply to this email directly, view it on GitHub
<#2877 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVR2XSSC3UFPNJRZVAUE7LDX4WYDTANCNFSM6AAAAAA5LG6JJE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
By the way, the size of the system is 34560 atoms + 27548 ghost atoms,
which may be relatively large for the memory capacity of GPU. ( The
simulation used DPLR. )
Best,
…On Thu, Sep 28, 2023 at 1:37 PM Kehan Cai ***@***.***> wrote:
The error arises when I only input one frame, i.e., the initial
configuration of the simulation.
Best,
Kehan
On Thu, Sep 28, 2023 at 1:32 PM Jinzhe Zeng ***@***.***>
wrote:
> Do you input one frame or multiple frames? Currently DeepTensor does not
> support automatic batch size, unlike DeepPot, so this may cause OOM.
> @Yi-FanLi <https://github.com/Yi-FanLi> You can try to support it. See
> #1173 <#1173>.
>
> —
> Reply to this email directly, view it on GitHub
> <#2877 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AVR2XSSC3UFPNJRZVAUE7LDX4WYDTANCNFSM6AAAAAA5LG6JJE>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
It's a little bit weird because you only have 1 frame and ~30k atoms. With an energy model based on the se_e2_a descriptor, an 80GB A100 can bear ~1000k atoms. @njzjz I think we need to do a more detailed analysis of the memory use in DeepTensor's inference. |
We concluded that this issue is due to the system being too large for the GPU version of the Python interface. More specifically, the neighbor list is too large so it cannot be allocated on the GPU. The workaround is to use LAMMPS's rerun command and use the compute deeptensor/atom command in LAMMPS. |
Thank you for the clarification.
Best,
Kehan
…On Sat, Sep 30, 2023 at 11:12 PM Yifan Li李一帆 ***@***.***> wrote:
We concluded that this issue is due to the system being too large for the
GPU version of the Python interface. More specifically, the neighbor list
is too large so it cannot be allocated on the GPU. The workaround is to use
LAMMPS's rerun command and use the compute deeptensor/atom command in
LAMMPS.
—
Reply to this email directly, view it on GitHub
<#2877 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVR2XSRHNJ34SDDM6G3DAPDX5DNQ3ANCNFSM6AAAAAA5LG6JJE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The space complexity of the current neighboring algorithm is |
Fix deepmodeling#2877 Signed-off-by: Jinzhe Zeng <[email protected]>
Fix #2877 --------- Signed-off-by: Jinzhe Zeng <[email protected]>
With #3046, I can now run the script, so I think this issue has been resolved. output:
I note that ASE may not be the fastest implementation. One can use other implementations but convert them to the ASE interface. |
Bug summary
When using the function deepmd.infer.DeepDipole.eval() to infer Wannier centroids, even though I requested multiple GPUs, only one of them is used in practice, while others are in idle state. Namely, the function feed all information of atomic positions into one GPU, and this may trigger the out-of-memory error when the size of the simulation system is large.
DeePMD-kit Version
2.2.4
TensorFlow Version
2.12.0
How did you download the software?
conda
Input Files, Running Commands, Error Log, etc.
2023-09-28 11:33:23.683264: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0
.2023-09-28 11:33:25.043901: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-28 11:33:27.204238: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-09-28 11:33:33.162274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79067 MB memory: -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:65:00.0, compute capability: 8.0
2023-09-28 11:33:33.163249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79067 MB memory: -> device: 1, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:ca:00.0, compute capability: 8.0
2023-09-28 11:33:33.231080: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-09-28 11:34:55.644022: W tensorflow/tsl/framework/bfc_allocator.cc:366] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
cuda assert: invalid argument /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/lib/src/cuda/neighbor_list.cu 194
2023-09-28 11:34:56.865522: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at custom_op.cc:18 : INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
2023-09-28 11:34:56.865588: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
2023-09-28 11:34:56.865609: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
Traceback (most recent call last):
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1378, in _do_call
return fn(*args)
^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1361, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1454, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
(1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/scratch/gpfs/kehanc/StudyFold/ResearchFold/Relaxor_Ferroelectrics/preliminary/src/wannier3.py", line 131, in
compute_wannier_centroid_savenpz(read_conf_directory, read_traj_directory, DW, 'full') # MODIFY!! concern atom_style = 'full' or 'atomic'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/gpfs/kehanc/StudyFold/ResearchFold/Relaxor_Ferroelectrics/preliminary/src/wannier3.py", line 64, in compute_wannier_centroid_savenpz
wannier_ref = DW.eval(pos_ref, cell_ref, atom_types=atypes).reshape(-1,3)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/deepmd/infer/deep_tensor.py", line 229, in eval
v_out = self.sess.run(t_out, feed_dict=feed_dict_test)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 968, in run
result = self._run(None, fetches, feed_dict, options_ptr,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1191, in _run
results = self._do_run(handle, final_targets, final_fetches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1371, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1397, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
2 root error(s) found.
(0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
(1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'load/ProdEnvMatA':
give_yifan.zip
Steps to Reproduce
Run the command:
sbatch run_wc3.slurm
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: