Restart/Rerun/Control-flow UI, Requirements #440

jwhite242 · 2024-04-01T19:40:29Z

jwhite242
Apr 1, 2024
Maintainer

Kicking off a discussion on what we want to the restart/control flow implementation to look like.

Functionality

Restart
- In place (Current restart behavior, but only for timeout)
- New workspace: run the restart in a separate workspace (maybe more interesting for rerun)
Rerun
- In place
- New workspace
Mixing/matching of the two, dependent upon state/return code from the initial step

Controls
How to expose more control of these modes in spec

State based conditionals, e.g. via 'FAILURE', 'TIMEOUT', ..., and combinations of the them
Script based controls: post-steps, some more general 'check' step scripts for reusable checks
- This option will also be amenable to usage in iteration controls
- Multiple scripts would be ideal here, enabling calling/using many at once instead of requiring giant monolithic checkers that try and handle everything

Specification Usage

A few sample ideas on what this could look like to start iterating on that:

Restart specific:

step2:
  run:
    cmd:
  restart:
    condition: [
      step.run: any($(maestro.state.TIMEOUT), $(maestro.state.HWFAILURE), $(maestro.state.PREEMPTED)),
      step.poststep: $(maestro.state.CONTINUE) # post steps being a rare place of retcodes (or some string enum) being useful
    cmd:
      ...
    max_attempts: 3

Similarly, could extend this to the depends steps, making it easier to control child steps like funnels to run even with some parent failures

step2:
  depends: [
    step1: any($(maestro.state.SUCCESS), $(maestro.state.FAIL), $(maestro.state.CANCELLED), $(maestro.state.TIMEOUT))
  ]

Note the current behavior for restarts is limited to parent_step: $(maestro.state.TIMEOUT)

Additional more detailed script based checks would want another group in the step similar to 'run/cmd', say 'check_steps', being a list of scripts that could be called in these conditionals. And maybe want options to schedule those too?

jwhite242 · 2024-04-01T19:54:27Z

jwhite242
Apr 1, 2024
Maintainer Author

Additional details on what to do in the case of manual restart/rerun after conductor finishes builds off of the above ideas:

Restart/rerun a step that has children
The in-place/new workspace options, and whether to blow away first to rerun from scratch quickly become needed: i.e. hardware failure on one instance of a parent prevents a funnel step from running:

Need option to reset workspace and rerun, or simply trigger the restart block, or some other custom step: state based conditionals above make this easier to do
Child steps:
- Any that were never run can simply be executed after resetting the failed state that was propagated down the graph from the parent's failure
- If we allow a funnel to run even with failures as shown above, what do to about the workspace? Think we need options to run in new workspace and preserve the old data: new workspace may require regenerating things below the funnel if any of those were run, depending on workspace managment choices (add .1, .2, etc for older variants so $(step.workspace) tokens don't need rebuilding?)

What about things run without a restart block? Add variants of the serialized/recorded spec with the included restart block?

What about mid-study fixes to things that might be a dependency such as tweaking an input for a step to say deal with instability in the executed code (optimizer/finite element code/whatever) that crops up for certain parameter combinations? Add variants to those inputs and provide mechanisms for registering/detecting changes in them? Can also do this without going through maestro at all as you can do now, but would be nice to provide some limited hooks/tools for tracking/logging such mid-study edits

1 reply

bgunnar5 Apr 23, 2024

In regards to funnel steps that we allow to run after a parent failure, new workspaces wouldn't be a bad idea. Although, when you say "new workspaces" are you talking about a new step workspace or a new workspace for the entire workflow? In other words, would we have a directory structure like:

my_workflow_<timestamp>/
├── step_1.1/
│      └─── step_1.1.sh
└── step_1.2/
        └─── step_1.2.sh

or like this:

my_workflow_<timestamp>.1/
└── step_1/
        └─── step_1.sh

my_workflow_<timestamp>.2/
└── step_1/
        └─── step_1.sh

Would we allow users the choice on whether to create a new directory or overwrite the initial one? Or do we just automatically create a new one for them?

Mid-study edits get really tricky in my opinion. Do we just treat this like a new parameter combo and leave everything generated from the old, broken combo as is? Do we add a log file to the top of the step called something like <step_name>_edits.txt that tracks exactly what was modified and what new subdirectories were created because of the change? Is there a limit on what the user can edit here (i.e. just parameter values or can they modify nodes/procs/etc. too)?

jwhite242 · 2024-04-01T20:03:48Z

jwhite242
Apr 1, 2024
Maintainer Author

Finally, one more comment on the idea of pulling some of this out into separate check/post-step type scripts: this would hopefully more easily enable the control flow while preserving the original step's return states. Playing games with return codes can be another option for controlling things, as is done currently with embedding things that say read logs and do busy waits to be interrupted in order to force a timeout triggered restart. But this also loses the actual step scripts return code/state without the user taking care to store/log that somewhere before overriding it with running some other command in the step script to override the state/return code.

0 replies

doutriaux1 · 2024-04-01T20:43:43Z

doutriaux1
Apr 1, 2024
Maintainer

@jwhite242 is rerun the same as restart except you get the command form the failed step? I like the idea of having the restart inside the step (for readability) but this limits what you can do with it. I like like the idea of being able to restart based on the exit code (e.g timeout vs failure vs regular completion but some user script looks at it and say otherwise). I would also like the possibility to handle mutliple restart command

Bellow is a wild draft with generalization of the step by giving it a type attribute and setting it to restart to let maestro know this is not a step. I linked the step2 to it via depends keyword which allow use to link it to multiple steps for general restart steps. We could also add a restart key in the step to link it.

The cmd within the step would be the command to run to determine if the restart is triggered or not. If not present we could use values from the parent step (the one we might want to restart). When run we can link the trigger to the exit code of the cmd

Again lots to unpack here, but here is what it could look like:

step2:
  run:
    cmd: |
      echo "STEP2"
restart_on_failure:
  type: restart
  depends: ['step2',]
  condition: [
    parent.run: any($(maestro.state.TIMEOUT), $(maestro.state.HWFAILURE), $(maestro.state.PREEMPTED)),
    parent.poststep: $(maestro.state.CONTINUE) # post steps being a rare place of retcodes (or some string enum) being useful
  ]
  restart_cmd: |
    echo "STEP2 restarted due to timeout"
  max_attempts: 3

restart_on_user_script_404_403:
  type: restart
  depends: ['step2',]
  cmd: |
    python check_results.py
  condition: [
    cmd.exit_code: any(404, 403)
  ]
  restart_cmd: |
    echo "STEP2 restarted by user script"
  max_attempts: 3

restart_on_user_script_non_zero:
  type: restart
  depends: ['step2',]
  cmd: |
    python check_results.py
  condition: [
    cmd.exit_code: not(403, 404, 0)
  ]
  restart_cmd: |
    echo "STEP2 restarted by user script"
  max_attempts: 3

0 replies

jwhite242 · 2024-04-02T20:10:21Z

jwhite242
Apr 2, 2024
Maintainer Author

@doutriaux1 rerun means the case where you have to run a step from scratch -> so say a hardware failure or some other interrupt happens before your step gets far enough for the restart block to even be functional. Think we'd need some sort of custom check/post step script to really be able to handle this as the state from the scheduler won't be enough to tell which one of restart or rerun should be invoked on the subsequent attempt.

Also do kinda like your notion of a named condition set, i.e. restart_on_user_script_non_zero groupings!

0 replies

jwhite242 · 2024-04-13T21:48:09Z

jwhite242
Apr 13, 2024
Maintainer Author

Ok, a second take on all this control flow business, exploring a few more things:

potential splits in topological ordering and control flow ordering -> see depends block. So separate the task order dependencies from the handling of execution results that are layered on top.
grouping user named custom steps, and reserving cmd/restart/rerun for the actual work processes?
- cmd/restart/rerun are intended to be where the 'real work' of the step takes place. user_cmds being smaller things used for control flow of those work sub steps
cues/triggers block to contain the intrastep control flow. Thinking consolidating these makes it easier to grok/plan the control flow of a step, especially when step cmd/restart/rerun scripts are much larger then the extra simple examples we have in this thread?
moving max_attempts (or iterations? naming suggestions welcome) to step level, to be an overarching limit on how many times to run the work steps (cmd/restart/rerun). Single step level setting i think makes for a clearer view/control on how many times this step will attempt the 'real work' sub steps, omitting the check type/custom steps that only get used for control flow from the count.
- there other keys needed here?
- Also, maybe we want further constraints on restart/rerun to avoid churn? Seems like something the check scripts could better handle, with some tokens to guide them:
  - $(step1.attempt): total number of work steps run, irrespective of cmd/restart/rerun
  - $(step1.attempt.restart): total number of times the restart's happened
  - $(step1.attempt.rerun): total number of times a rerun has been triggered. This last one seems most in need of a special handling to avoid max_attempts on something that's continually failing. Restart early bailout seems better handled in the post step trigger to just mark it failed based on whatever simulation metric (i.e. timestep in your finite elem sim is just too small and it's never going to make progress). But, does maestro need to be involved in either given users can record these things as they see fit in log files for each step (or db's, etc..)? -> less involvement = less coupling to the workflow tool, which is preferred i think.

A note on what's not here: resource info. This is primarily about the control flow syntax and organization so far. Each substep likely needs it's own resource spec and even scheduler (local vs scheduled, etc). A resource spec block at the top level of the step is likely a good candidate, defining some named resource specs that can be used in each sub step, in addition to direct override of sub keys (walltime, etc) on a per sub-step case. But that's a later discussion.

study:
  - name: step1
    description: Run some expensive simulation or some data generation task..
    run:
      cmd:     # cmd remains special/reserved
      restart: # same as above, keeping existing restart meaning/semantics
      rerun:   # this assumes rerun from scratch, blowing away everything (or new workspace), 
                   # unlike restart which preserves existing workspace
      
      # contain all user custom cmds here, avoiding  any ambiguity about what's reserved or not
      user_cmds:  
        my_post_check: 
           description: successful return code (0) if prior step has gone far enough (no more restart/iters), fail if not
           cmd: |
             check_outputs
        my_rerun_restart_check: 
           description: |
             Return success if this step needs a rerun, false if not.  Handle case where original/prior cmd didn't get far 
             enough to make a checkpoint to restart from, e.g. hardware failure.
           cmd: |
             check _outputs_rerun  # helper to determine if task got far enough to restart or if rerun is needed
            
      cues:  # or triggers?  -> this is all the control flow pieces

        - subcmd: rerun
           condition: [any( cmd.state.HWFAILURE, my_rerun_restart_check.state.SUCCESS)]
        - subcmd: restart
           # Alternate mapping syntax that can enable human readable descriptions with the condition?  Emphasizing 
           # documentation being a maestro priority, having documentation options on these might be important in 
           # complex workflows.  Not sure this is 'the way to do it' though..
             - name: base_restart
                description: default timeout triggering
                condition: any( cmd.state.TIMEOUT, restart.state.TIMEOUT)
             - name: sim_needs_more_time
                description: Simulation's running fine, but just needs more time after checking the logs
                condition: any( my_post_check.state.FAIL, my_rerun_restart_check.FAIL)
           
    # Outside specific substeps (cmd/restart/rerun/user_cmds..), we have step level controls and settings; what's a good name for these so it's not just a rando collection of keys down the road?
    max_attempts: 10  # max number of times to allow running any of cmd, restart, or rerun


  - name: data_extract
    description: data extraction from individual step1 instances
    run:
      cmd: 
                  
    # Depends: separate topological ordering from execution?
    depends:
       topology:
          - step1
        cues:  
          # This is for interstep cues.  Trigger this larger step, with the cues in this steps' run block only controlling 
          # and referring to it's own sub-steps.  NOTE: DO WE NEED TO BE ABLE TO REFERENCE PARENT SUBSTEPS?
          - any( step1.state.TIMEOUT, step1.state.SUCCESS, step1.state.FAILED)   # try and extract some data in most cases
       
       
  - name: summary_report
    description: Compile the results of the steps and data extraction, including failed instances
    run:
      cmd: 
      
    depends:
      topology:
         - all(data_extract)    # potentially more clear/less terse version of existing `data_extract_*`
      cues:
         - any(data_extract.state.*)  # enable shorthand for running this no matter the state of each data_extract instance

Some other thoughts on the ~operators used here and others we might want

all: an AND test of any list of bools
any: an OR test of any list of bools
not: simple negation
Combinations of the above should enable most control flow we might need? Let me know if there's cases we can't do with these that I'm overlooking; do we need things like step1.state in [FAILURE, SUCCESS, ...] when combinations of any/all/not can accomplish the same?

Ideally simpler boolean treatment of the return codes from scripts/cmds gives us all we need, combined with splitting off the checks to separate steps so states of the original work steps are preserved for use in these tests. Diving into custom return codes can certainly work, but would be nice to avoid it if we can to avoid conflicts with other tools that have their own opinions on those/overlap with ours (the overlap is particularly ripe for confusing control flow and logging via overriding the intent expressed in the spec).

2 replies

bgunnar5 Apr 23, 2024

I love the idea of these cues and think they'll be super useful for users.

So currently for Merlin, there are two different return codes that trigger a "restart": MERLIN_RETRY and MERLIN_RESTART. The difference between these two return codes is that MERLIN_RETRY will not look for the restart command, instead it will just look for the command in cmd and re-run that. On the other hand, MERLIN_RESTART will first look for a restart command and if it exists will execute it; if it does not exist, it will re-run the command in cmd. This section of Merlin's docs has info on them.

It seems like the rerun functionality that you described in your response to Charles is very similar to, if not identical to, Merlin's retry functionality. If so, do we even need a new entry for rerun or could we just re-execute cmd? Is this more of a provenance concern?

To your point about whether Maestro should be involved in handling the max number of reruns/restarts to allow: I think leaving the max_attempts option would suffice, then with the implementation of post-step checks and the $(step_name.attempts.restart/rerun) tokens, users could implement their own checks (one less thing for us to worry about breaking 😄 )

In regards to operators, I think all, any, and not should work just fine.

jwhite242 Apr 25, 2024
Maintainer Author

That's a great point, and I think it would be nice to be able to just have some shortcuts to have rerun be able to execute the existing cmd, with options to have the executor (maestro/merlin/etc) do a few different things:

rerun it in place in a dirty workspace, so merlin's retry option
rerun it in place in a clean workspace: blow away anything that wasn't generated by the executor, or blow it all away and regenerate the step scripts, and then rerun it
rerun it in a completely new workspace, preserving more of the data, possibly with some scheme like adding .1 or .iter_1 or something, ticking the number every time it happens. adding the suffixes that way would enable all the $(step.workspace) tokens to remain valid downstream, with a potential new token we could add to access the iterations/check for them.

As for why keep rerun beyond these key/value options we could set without a script, I think there are use cases for having a rerun script be different from cmd such that users can do more targeted workspace/task management: i.e. check if some portion of the output was good and keep it rather than blowing everything away (maybe an expensive mesh generation task), change any parts of the cmd that is talking to a database as I've seen users do for real time logging/tracking at the workflow level (note, this is separate from any of the task related tracking merlin/maestro do to manage the orchestration). Being able to reuse/inline the cmd script in here would be nice as well though to minimize copy/paste and substep syncing errors when those cmds need updating; not sure about that implementation just yet, but ideally something inside the spec parsing engine rather than say using yaml's anchors where we have no visibility on it (less coupling to yaml features).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart/Rerun/Control-flow UI, Requirements #440

{{title}}

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Restart/Rerun/Control-flow UI, Requirements #440

jwhite242 Apr 1, 2024 Maintainer

Replies: 5 comments · 3 replies

jwhite242 Apr 1, 2024 Maintainer Author

bgunnar5 Apr 23, 2024

jwhite242 Apr 1, 2024 Maintainer Author

doutriaux1 Apr 1, 2024 Maintainer

jwhite242 Apr 2, 2024 Maintainer Author

jwhite242 Apr 13, 2024 Maintainer Author

bgunnar5 Apr 23, 2024

jwhite242 Apr 25, 2024 Maintainer Author

jwhite242
Apr 1, 2024
Maintainer

Replies: 5 comments 3 replies

jwhite242
Apr 1, 2024
Maintainer Author

jwhite242
Apr 1, 2024
Maintainer Author

doutriaux1
Apr 1, 2024
Maintainer

jwhite242
Apr 2, 2024
Maintainer Author

jwhite242
Apr 13, 2024
Maintainer Author

jwhite242 Apr 25, 2024
Maintainer Author