Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent launchers to launch incompatible Step objects #594

Open
al-rigazzi opened this issue May 21, 2024 · 0 comments
Open

Prevent launchers to launch incompatible Step objects #594

al-rigazzi opened this issue May 21, 2024 · 0 comments
Labels
area: launcher Issues related to any of the launchers within SmartSim repo: smartsim Issues related to SmartSim infrastructure library type: feature Issues that include feature request or feature idea

Comments

@al-rigazzi
Copy link
Collaborator

Description

Launchers should raise an error if users attempt to run an incompatible Step.

Justification

Launchers determine the correct way of launching a Step through its type. For example, the main branching of the SlurmLauncher.run() method looks like this:

        if isinstance(step, SbatchStep):
            # wait for batch step to submit successfully
            return_code, out, err = self.task_manager.start_and_wait(cmd_list, step.cwd)
            if return_code != 0:
                raise LauncherError(f"Sbatch submission failed\n {out}\n {err}")
            if out:
                step_id = out.strip()
                logger.debug(f"Gleaned batch job id: {step_id} for {step.name}")

        # Launch a in-allocation or on-allocation (if srun) command
        elif isinstance(step, SrunStep):
            task_id = self.task_manager.start_task(cmd_list, step.cwd)
        else:
            # MPI/local steps don't direct output like slurm steps
            out, err = step.get_output_files()

            # pylint: disable-next=consider-using-with
            output = open(out, "w+", encoding="utf-8")
            # pylint: disable-next=consider-using-with
            error = open(err, "w+", encoding="utf-8")
            task_id = self.task_manager.start_task(
                cmd_list, step.cwd, step.env, out=output.fileno(), err=error.fileno()
            )

The last catch-all else branch will try to run any Step which is not a SrunStep or an SbatchStep as some type of mpirun step. A stronger type checking should be enforced to avoid user-errors stemming from wrong script portings (e.g. running a JsrunStep through the SlurmLauncher.

Implementation Strategy

All if/elif branches should ensure a compatible step is being launched, all compatible steps should be covered by the branches (mpirun steps should be identified in an extensible way). The else branch should raise an error.

@al-rigazzi al-rigazzi added area: launcher Issues related to any of the launchers within SmartSim repo: smartsim Issues related to SmartSim infrastructure library type: feature Issues that include feature request or feature idea labels May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: launcher Issues related to any of the launchers within SmartSim repo: smartsim Issues related to SmartSim infrastructure library type: feature Issues that include feature request or feature idea
Projects
None yet
Development

No branches or pull requests

1 participant