Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The function deepmd.infer.DeepDipole.eval() cannot utilize multiple GPUs in parallel #2877

Closed
Kehan-Cai-nanako opened this issue Sep 28, 2023 · 9 comments · Fixed by #3046
Closed
Assignees

Comments

@Kehan-Cai-nanako
Copy link

Bug summary

When using the function deepmd.infer.DeepDipole.eval() to infer Wannier centroids, even though I requested multiple GPUs, only one of them is used in practice, while others are in idle state. Namely, the function feed all information of atomic positions into one GPU, and this may trigger the out-of-memory error when the size of the simulation system is large.

DeePMD-kit Version

2.2.4

TensorFlow Version

2.12.0

How did you download the software?

conda

Input Files, Running Commands, Error Log, etc.

2023-09-28 11:33:23.683264: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-09-28 11:33:25.043901: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-28 11:33:27.204238: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-09-28 11:33:33.162274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79067 MB memory: -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:65:00.0, compute capability: 8.0
2023-09-28 11:33:33.163249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79067 MB memory: -> device: 1, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:ca:00.0, compute capability: 8.0
2023-09-28 11:33:33.231080: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-09-28 11:34:55.644022: W tensorflow/tsl/framework/bfc_allocator.cc:366] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
cuda assert: invalid argument /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/lib/src/cuda/neighbor_list.cu 194
2023-09-28 11:34:56.865522: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at custom_op.cc:18 : INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
2023-09-28 11:34:56.865588: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
2023-09-28 11:34:56.865609: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
Traceback (most recent call last):
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1378, in _do_call
return fn(*args)
^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1361, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1454, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
(1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/scratch/gpfs/kehanc/StudyFold/ResearchFold/Relaxor_Ferroelectrics/preliminary/src/wannier3.py", line 131, in
compute_wannier_centroid_savenpz(read_conf_directory, read_traj_directory, DW, 'full') # MODIFY!! concern atom_style = 'full' or 'atomic'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/gpfs/kehanc/StudyFold/ResearchFold/Relaxor_Ferroelectrics/preliminary/src/wannier3.py", line 64, in compute_wannier_centroid_savenpz
wannier_ref = DW.eval(pos_ref, cell_ref, atom_types=atypes).reshape(-1,3)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/deepmd/infer/deep_tensor.py", line 229, in eval
v_out = self.sess.run(t_out, feed_dict=feed_dict_test)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 968, in run
result = self._run(None, fetches, feed_dict, options_ptr,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1191, in _run
results = self._do_run(handle, final_targets, final_fetches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1371, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1397, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
2 root error(s) found.
(0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
(1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'load/ProdEnvMatA':

give_yifan.zip

Steps to Reproduce

Run the command:

sbatch run_wc3.slurm

Further Information, Files, and Links

No response

@Yi-FanLi Yi-FanLi self-assigned this Sep 28, 2023
@Yi-FanLi
Copy link
Collaborator

I guess that this error stems from the lack of support for multi-GPU parallelization of inference from python api. @njzjz Is that true?

@Kehan-Cai-nanako Can you try to use LAMMPS's rerun feature and use the compute deeptensor/atom command to do the inference?

@njzjz
Copy link
Member

njzjz commented Sep 28, 2023

Do you input one frame or multiple frames? Currently DeepTensor does not support automatic batch size, unlike DeepPot, so inputing multiple frames may cause OOM.
@Yi-FanLi You can try to support it. See #1173.

@Kehan-Cai-nanako
Copy link
Author

Kehan-Cai-nanako commented Sep 28, 2023 via email

@Kehan-Cai-nanako
Copy link
Author

Kehan-Cai-nanako commented Sep 28, 2023 via email

@Yi-FanLi
Copy link
Collaborator

It's a little bit weird because you only have 1 frame and ~30k atoms. With an energy model based on the se_e2_a descriptor, an 80GB A100 can bear ~1000k atoms. @njzjz I think we need to do a more detailed analysis of the memory use in DeepTensor's inference.

@Yi-FanLi
Copy link
Collaborator

Yi-FanLi commented Oct 1, 2023

We concluded that this issue is due to the system being too large for the GPU version of the Python interface. More specifically, the neighbor list is too large so it cannot be allocated on the GPU. The workaround is to use LAMMPS's rerun command and use the compute deeptensor/atom command in LAMMPS.

@Kehan-Cai-nanako
Copy link
Author

Kehan-Cai-nanako commented Oct 1, 2023 via email

@njzjz
Copy link
Member

njzjz commented Oct 1, 2023

The space complexity of the current neighboring algorithm is $O(n^2)$. A better algorithm is required when the number of atoms is large.

@njzjz njzjz added enhancement and removed bug labels Oct 1, 2023
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Dec 8, 2023
@njzjz njzjz linked a pull request Dec 8, 2023 that will close this issue
@njzjz njzjz self-assigned this Dec 8, 2023
wanghan-iapcm pushed a commit that referenced this issue Dec 11, 2023
@njzjz
Copy link
Member

njzjz commented Dec 11, 2023

With #3046, I can now run the script, so I think this issue has been resolved.

output:

read_conf_directory = ./T50/
read_traj_directory = ./T50/
atypes.shape = (34560,)
atypes = [3 1 2 2 2 3 1 2 2 2 3 1 2 2 2 3 0 2 2 2]
pos_ref.shape = (34560, 3)
cell_ref = Cell([67.88225099, 67.88225099, 96.0])
np.diag(cell_ref) = [67.88225099 67.88225099 96.        ]
type(wannier_ref) is <class 'numpy.ndarray'>
wannier_ref.shape = (27648, 3)
wannier_ref = [[ 6.76164581e-03  4.77885641e-11 -4.78120556e-03]
 [-4.82161764e-11  4.37749264e-11  4.10922896e-03]
 [ 7.36184699e-02 -7.36184692e-02 -5.14084738e-03]
 [ 5.39583441e-03 -1.28866391e-10  3.81543074e-03]
 [ 3.29759379e-10  6.76164553e-03  4.78120554e-03]
 [ 4.25239296e-10  3.73859745e-10  9.63019841e-02]
 [-5.18825737e-03  5.18825643e-03  3.24166692e-18]
 [ 6.80957854e-02  6.80957861e-02 -5.94455577e-18]
 [ 1.37889803e-02 -1.37889803e-02 -8.30501870e-18]
 [-5.00852858e-11 -5.39583395e-03 -3.81543078e-03]]

I note that ASE may not be the fastest implementation. One can use other implementations but convert them to the ASE interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants