Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mpirun on glogin/blogin (SLURM) #1208

Open
joakimkjellsson opened this issue Aug 14, 2024 · 6 comments
Open

Use mpirun on glogin/blogin (SLURM) #1208

joakimkjellsson opened this issue Aug 14, 2024 · 6 comments

Comments

@joakimkjellsson
Copy link
Contributor

Good afternoon all

glogin (GWDG Emmy) has undergone some hardware and software upgrades recently. Since the upgrade, I find jobs launched with srun are considerably slower than jobs launched with mpirun. The support team recommends mpirun. So I'd like to use mpirun.

But I can't work out if ESM-Tools can do it. There is an mpirun.py file with a function to write a hostfile for mpirun, but as far as I can see this function is never used. If we use SLURM, then it seems that ESM-Tools will always build a hostfile_srun and then launch with srun.

My idea would be to have something like this in slurm.py:
Line 65 is currently:

write_one_hostfile(current_hostfile, config)

but it should be

if launcher == 'srun':
   write_one_hostfile_srun(current_hostfile, config)
elif launcher == 'mpirun':
   write_one_hostfile_mpirun(current_hostfile, config)
else:
   print(' ESM-Tools does not recognise the launcher ', launcher)
   print(' The launchers supported are srun and mpirun')

and then the two functions would be slightly different.
One benefit with mpirun would be that heterogeneous parallelisation becomes very easy since we can do:

mpirun OMP_NUM_THREADS=4 -np 288 ./oifs -e ECE3 : -np 432 ./oceanx : -np 20 ./xios.x 

although I'm not sure and would have to double-check exactly how it should be done on glogin.

Before I venture down this path though, I just want to check: Is it already possible to use mpirun but I'm just too dense to figure out how? If not, is someone else already working on a similar solution?

Cheers
Joakim

@pgierz
Copy link
Member

pgierz commented Aug 14, 2024

Hi @joakimkjellsson,

did you try setting computer.launcher to mpirun? You can do that in your runscript. That will swap out your srun <OPTIONS> to use mpirun instead.

I'd need to look more deeply into how to set the actual options. That would need a code change.

@joakimkjellsson
Copy link
Contributor Author

Hi @pgierz
Sorry forgot to mention this. So if I do that (launcher: mpirun and launcher_flags: "") my launch command becomes:

time mpirun  $(cat hostfile_srun) 2>&1 &

so it would use mpirun but give executables in the format expected by srun.
At the moment, hostfile_srun is:

0-287  ./oifs -e ECE3
288-719  ./oceanx
720-739  ./xios.x
740-740  ./rnfma

but I would need it to be

-np 288 ./oifs -e ECE3 : -np 432 ./oceanx : -np 20 ./xios.x : -np 1 ./rnfma

The function write_one_hostfile in mpirun.py seems to do that, but it never gets called. Almost as if someone started working on this but never finished ;-)
I would like to have two functions, write_one_hostfile_srun and write_one_hostfile_mpirun, and have some kind of if statement in slurm.py to choose which one to use.

/J

@pgierz
Copy link
Member

pgierz commented Aug 14, 2024

@joakimkjellsson What branch are you on? I'll start from that one, should be quick enough to program.

@joakimkjellsson
Copy link
Contributor Author

@pgierz no worries. I've already coded it in. My main question was whether someone had already done it or was planning to do it, in which case I would not do it :-)

I renamed the old write_one_hostfile to write_one_hostfile_srun and made a new write_one_hostfile:

def write_one_hostfile(self, hostfile, config):
        """ 
        Gathers previously prepared requirements
        (batch_system.calculate_requirements) and writes them to ``self.path``.
        Suitable for mpirun launcher
        """
        
        # make an empty string which we will append commands to
        mpirun_options = ""

        for model in config["general"]["valid_model_names"]:
            end_proc = config[model].get("end_proc", None)
            start_proc = config[model].get("start_proc", None)
            print(' model ', model)
            print(' start_proc ', start_proc)
            print(' end_proc ', end_proc)
            
            # a model component like oasis3mct does not need cores
            # since its technically a library
            # So start_proc and end_proc will be None. Skip it
            if start_proc == None or end_proc == None:
                continue
            
            # number of cores needed
            no_cpus = end_proc - start_proc + 1
            print(' no_cpus ',no_cpus)
            
            if "execution_command" in config[model]:
                command = "./" + config[model]["execution_command"]
            elif "executable" in config[model]:
                command = "./" + config[model]["executable"]
            else:
                continue
            
            # the mpirun command is set here. 
            mpirun_options += (
                    " -np %d %s :" % (no_cpus, command)
                )
    
        mpirun_options = mpirun_options[:-1]  # remove trailing ":"
    
        with open(hostfile, "w") as hostfile:
            hostfile.write(mpirun_options)

Already made a few test runs and it seems to work. I'll do some more tests. Then it will end up in the feature/blogin-rockylinux9 branch, where I'm trying to get FOCI-OpenIFS running on glogin.

/J

@mandresm
Copy link
Contributor

Perfect, thanks for figuring that out. Let us know when you are ready to merge and we can see if we can improve in terms of generalization of the write_one_hostfile function.

@joakimkjellsson
Copy link
Contributor Author

I made the change to slurm.py: 058fcf9#diff-0c204676837e94ca027f7a61a71d27914ea3a6b8071d5d3dc4c7791dfa5eb15b

When Sebastian is back we might do some cleaning etc and then merge this fix branch into geomar_dev. Then that can merge into release.

Cheers!
/J

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants