Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unregular behaviour while using different cpus combination #5548

Open
16 tasks
JTaozhang opened this issue Nov 21, 2024 · 7 comments
Open
16 tasks

unregular behaviour while using different cpus combination #5548

JTaozhang opened this issue Nov 21, 2024 · 7 comments
Assignees
Labels
Large Systems Issues related to large-size systems Questions Raise your quesiton! We will answer it.

Comments

@JTaozhang
Copy link

JTaozhang commented Nov 21, 2024

Describe the bug

Hi there,

Currently, I am working on a WTe2 bilayer systems, which contains about 504 atoms. I try to calculate the band structure with kpoint mesh of 11 along the high symmetric path. The software version is v3.8.2. With the same INPUT setting and KPT settings, but adopting different cpus combinations, one works and another runs abnormal. Specifically it reports nothing, no error and no useful information. one is mpirun -np 8 -env OMP_NUM_THREADS=28 and total cpus is 224(8 nodes, 56 cpus per node), another is mpirun -np 20 -env OMP_NUM_THREADS=28 and total cpus is 560 (10 nodes, 56 cpus per node).

for the abnormal job, the whole outoput information shows below,

                              ABACUS v3.8.2

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: unknown

    Start Time is Wed Nov 20 18:43:57 2024
                                                                                     
 ------------------------------------------------------------------------------------

 READING GENERAL INFORMATION

                           global_out_dir = OUT.WTe2/
                           global_in_card = INPUT
                               pseudo_dir = /share/home/zhangtao/work/WTe2/abacus/pseudo/
                              orbital_dir = /share/home/zhangtao/work/WTe2/abacus/orbital/

I don't know what causes this abnormal behavior, could you test the code? I think the parallel calculation part may still possess some unstable problem. this problem I also dicussed in the wechat online group, somebody suggest me to propose an issue here. So I do this.

related file is here
WTe2.zip

Expected behavior

the second submitting setting should work fast than the first setting.

To Reproduce

  1. calculate the charge density files according to the INPUT_scf, and KPT_scf;
  2. after getting the charge density files, do the band calculation according to INPUT and KPT.
  3. compare the results and check the code.

Environment

module load cmake/cmake-3.25 gnu/12.1.0

source /share/apps/intel2022/setvars.sh
source /share/home/zhangtao/software/abacus-develop-3.8.3/toolchain/install/setup

Additional Context

no more information is needed

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).
@QuantumMisaka
Copy link
Collaborator

Hi @JTaozhang
From my experience, this problem is from your job submitting scripts and the server setting. but the provided files do not contain any of information about them, please provide them in detail.
I've done parallel computation with HSE functional by using OMP_NUM_THREADS=16 mpirun -np 32 abacus and it works well

@JTaozhang
Copy link
Author

Hi,
thanks for your reply. I have attached the submitting scripts here, for the combination of "mpirun -np 8 -env OMP_NUM_THREADS=28 and total cpus is 224(8 nodes, 56 cpus per node)", it works. However, for the "mpirun -np 20 -env OMP_NUM_THREADS=28 and total cpus is 560 (10 nodes, 56 cpus per node)", it fails.

I think ,different machine has different setting, I am not sure you can reproduce my case with your machine. Maybe you can change your combination using my atomic system to check this problem.

one more question, less tasks in a node means that the memory of the node will be less shared by other task, right? The mp_num_thread decides how the cpus are distributed to one task, which governs the parallel computing.

20-28.zip

Best,
Tao

@mohanchen mohanchen added Questions Raise your quesiton! We will answer it. Large Systems Issues related to large-size systems labels Nov 23, 2024
@YuLiu98
Copy link
Collaborator

YuLiu98 commented Nov 25, 2024

Could you supply your STRU and charge density files for nscf?

@JTaozhang
Copy link
Author

ok. I attached the file here.

Best.

@YuLiu98
Copy link
Collaborator

YuLiu98 commented Nov 25, 2024

ok. I attached the file here.

Best.

Thank you, but it seems that you failed to upload the files.

@JTaozhang
Copy link
Author

No description provided.

@JTaozhang
Copy link
Author

I think the uploading failed, due to the large size, I will share it by baidu cloud.

链接:https://pan.baidu.com/s/1lphoofZi1MhJh49etZ2weQ
提取码:1230

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Large Systems Issues related to large-size systems Questions Raise your quesiton! We will answer it.
Projects
None yet
Development

No branches or pull requests

4 participants