unregular behaviour while using different cpus combination #5548

JTaozhang · 2024-11-21T02:29:03Z

Describe the bug

Hi there,

Currently, I am working on a WTe2 bilayer systems, which contains about 504 atoms. I try to calculate the band structure with kpoint mesh of 11 along the high symmetric path. The software version is v3.8.2. With the same INPUT setting and KPT settings, but adopting different cpus combinations, one works and another runs abnormal. Specifically it reports nothing, no error and no useful information. one is mpirun -np 8 -env OMP_NUM_THREADS=28 and total cpus is 224(8 nodes, 56 cpus per node), another is mpirun -np 20 -env OMP_NUM_THREADS=28 and total cpus is 560 (10 nodes, 56 cpus per node).

for the abnormal job, the whole outoput information shows below,

                              ABACUS v3.8.2

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: unknown

    Start Time is Wed Nov 20 18:43:57 2024
                                                                                     
 ------------------------------------------------------------------------------------

 READING GENERAL INFORMATION

                           global_out_dir = OUT.WTe2/
                           global_in_card = INPUT
                               pseudo_dir = /share/home/zhangtao/work/WTe2/abacus/pseudo/
                              orbital_dir = /share/home/zhangtao/work/WTe2/abacus/orbital/

I don't know what causes this abnormal behavior, could you test the code? I think the parallel calculation part may still possess some unstable problem. this problem I also dicussed in the wechat online group, somebody suggest me to propose an issue here. So I do this.

related file is here
WTe2.zip

Expected behavior

the second submitting setting should work fast than the first setting.

To Reproduce

calculate the charge density files according to the INPUT_scf, and KPT_scf;
after getting the charge density files, do the band calculation according to INPUT and KPT.
compare the results and check the code.

Environment

module load cmake/cmake-3.25 gnu/12.1.0

source /share/apps/intel2022/setvars.sh
source /share/home/zhangtao/software/abacus-develop-3.8.3/toolchain/install/setup

Additional Context

no more information is needed

Task list for Issue attackers (only for developers)

The text was updated successfully, but these errors were encountered:

QuantumMisaka · 2024-11-22T00:56:01Z

Hi @JTaozhang
From my experience, this problem is from your job submitting scripts and the server setting. but the provided files do not contain any of information about them, please provide them in detail.
I've done parallel computation with HSE functional by using OMP_NUM_THREADS=16 mpirun -np 32 abacus and it works well

JTaozhang · 2024-11-22T07:31:51Z

Hi,
thanks for your reply. I have attached the submitting scripts here, for the combination of "mpirun -np 8 -env OMP_NUM_THREADS=28 and total cpus is 224(8 nodes, 56 cpus per node)", it works. However, for the "mpirun -np 20 -env OMP_NUM_THREADS=28 and total cpus is 560 (10 nodes, 56 cpus per node)", it fails.

I think ,different machine has different setting, I am not sure you can reproduce my case with your machine. Maybe you can change your combination using my atomic system to check this problem.

one more question, less tasks in a node means that the memory of the node will be less shared by other task, right? The mp_num_thread decides how the cpus are distributed to one task, which governs the parallel computing.

20-28.zip

Best,
Tao

YuLiu98 · 2024-11-25T03:44:00Z

Could you supply your STRU and charge density files for nscf?

JTaozhang · 2024-11-25T11:45:52Z

ok. I attached the file here.

Best.

YuLiu98 · 2024-11-25T12:36:23Z

ok. I attached the file here.

Best.

Thank you, but it seems that you failed to upload the files.

JTaozhang · 2024-11-26T02:03:18Z

No description provided.

JTaozhang · 2024-11-26T02:17:24Z

I think the uploading failed, due to the large size, I will share it by baidu cloud.

链接：https://pan.baidu.com/s/1lphoofZi1MhJh49etZ2weQ
提取码：1230

Best

mohanchen assigned YuLiu98 Nov 23, 2024

mohanchen added Questions Raise your quesiton! We will answer it. Large Systems Issues related to large-size systems labels Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unregular behaviour while using different cpus combination #5548

unregular behaviour while using different cpus combination #5548

JTaozhang commented Nov 21, 2024 •

edited by kirk0830

Loading

QuantumMisaka commented Nov 22, 2024

JTaozhang commented Nov 22, 2024

YuLiu98 commented Nov 25, 2024

JTaozhang commented Nov 25, 2024

YuLiu98 commented Nov 25, 2024

JTaozhang commented Nov 26, 2024

JTaozhang commented Nov 26, 2024

unregular behaviour while using different cpus combination #5548

unregular behaviour while using different cpus combination #5548

Comments

JTaozhang commented Nov 21, 2024 • edited by kirk0830 Loading

Describe the bug

Expected behavior

To Reproduce

Environment

Additional Context

Task list for Issue attackers (only for developers)

QuantumMisaka commented Nov 22, 2024

JTaozhang commented Nov 22, 2024

YuLiu98 commented Nov 25, 2024

JTaozhang commented Nov 25, 2024

YuLiu98 commented Nov 25, 2024

JTaozhang commented Nov 26, 2024

JTaozhang commented Nov 26, 2024

JTaozhang commented Nov 21, 2024 •

edited by kirk0830

Loading