Replies: 5 comments 3 replies
-
Additional details on what to do in the case of manual restart/rerun after conductor finishes builds off of the above ideas: Restart/rerun a step that has children
What about things run without a restart block? Add variants of the serialized/recorded spec with the included restart block? What about mid-study fixes to things that might be a dependency such as tweaking an input for a step to say deal with instability in the executed code (optimizer/finite element code/whatever) that crops up for certain parameter combinations? Add variants to those inputs and provide mechanisms for registering/detecting changes in them? Can also do this without going through maestro at all as you can do now, but would be nice to provide some limited hooks/tools for tracking/logging such mid-study edits |
Beta Was this translation helpful? Give feedback.
-
Finally, one more comment on the idea of pulling some of this out into separate check/post-step type scripts: this would hopefully more easily enable the control flow while preserving the original step's return states. Playing games with return codes can be another option for controlling things, as is done currently with embedding things that say read logs and do busy waits to be interrupted in order to force a timeout triggered restart. But this also loses the actual step scripts return code/state without the user taking care to store/log that somewhere before overriding it with running some other command in the step script to override the state/return code. |
Beta Was this translation helpful? Give feedback.
-
@jwhite242 is Bellow is a wild draft with generalization of the step by giving it a The Again lots to unpack here, but here is what it could look like: step2:
run:
cmd: |
echo "STEP2"
restart_on_failure:
type: restart
depends: ['step2',]
condition: [
parent.run: any($(maestro.state.TIMEOUT), $(maestro.state.HWFAILURE), $(maestro.state.PREEMPTED)),
parent.poststep: $(maestro.state.CONTINUE) # post steps being a rare place of retcodes (or some string enum) being useful
]
restart_cmd: |
echo "STEP2 restarted due to timeout"
max_attempts: 3
restart_on_user_script_404_403:
type: restart
depends: ['step2',]
cmd: |
python check_results.py
condition: [
cmd.exit_code: any(404, 403)
]
restart_cmd: |
echo "STEP2 restarted by user script"
max_attempts: 3
restart_on_user_script_non_zero:
type: restart
depends: ['step2',]
cmd: |
python check_results.py
condition: [
cmd.exit_code: not(403, 404, 0)
]
restart_cmd: |
echo "STEP2 restarted by user script"
max_attempts: 3 |
Beta Was this translation helpful? Give feedback.
-
@doutriaux1 Also do kinda like your notion of a named condition set, i.e. |
Beta Was this translation helpful? Give feedback.
-
Ok, a second take on all this control flow business, exploring a few more things:
A note on what's not here: resource info. This is primarily about the control flow syntax and organization so far. Each substep likely needs it's own resource spec and even scheduler (local vs scheduled, etc). A resource spec block at the top level of the step is likely a good candidate, defining some named resource specs that can be used in each sub step, in addition to direct override of sub keys (walltime, etc) on a per sub-step case. But that's a later discussion. study:
- name: step1
description: Run some expensive simulation or some data generation task..
run:
cmd: # cmd remains special/reserved
restart: # same as above, keeping existing restart meaning/semantics
rerun: # this assumes rerun from scratch, blowing away everything (or new workspace),
# unlike restart which preserves existing workspace
# contain all user custom cmds here, avoiding any ambiguity about what's reserved or not
user_cmds:
my_post_check:
description: successful return code (0) if prior step has gone far enough (no more restart/iters), fail if not
cmd: |
check_outputs
my_rerun_restart_check:
description: |
Return success if this step needs a rerun, false if not. Handle case where original/prior cmd didn't get far
enough to make a checkpoint to restart from, e.g. hardware failure.
cmd: |
check _outputs_rerun # helper to determine if task got far enough to restart or if rerun is needed
cues: # or triggers? -> this is all the control flow pieces
- subcmd: rerun
condition: [any( cmd.state.HWFAILURE, my_rerun_restart_check.state.SUCCESS)]
- subcmd: restart
# Alternate mapping syntax that can enable human readable descriptions with the condition? Emphasizing
# documentation being a maestro priority, having documentation options on these might be important in
# complex workflows. Not sure this is 'the way to do it' though..
- name: base_restart
description: default timeout triggering
condition: any( cmd.state.TIMEOUT, restart.state.TIMEOUT)
- name: sim_needs_more_time
description: Simulation's running fine, but just needs more time after checking the logs
condition: any( my_post_check.state.FAIL, my_rerun_restart_check.FAIL)
# Outside specific substeps (cmd/restart/rerun/user_cmds..), we have step level controls and settings; what's a good name for these so it's not just a rando collection of keys down the road?
max_attempts: 10 # max number of times to allow running any of cmd, restart, or rerun
- name: data_extract
description: data extraction from individual step1 instances
run:
cmd:
# Depends: separate topological ordering from execution?
depends:
topology:
- step1
cues:
# This is for interstep cues. Trigger this larger step, with the cues in this steps' run block only controlling
# and referring to it's own sub-steps. NOTE: DO WE NEED TO BE ABLE TO REFERENCE PARENT SUBSTEPS?
- any( step1.state.TIMEOUT, step1.state.SUCCESS, step1.state.FAILED) # try and extract some data in most cases
- name: summary_report
description: Compile the results of the steps and data extraction, including failed instances
run:
cmd:
depends:
topology:
- all(data_extract) # potentially more clear/less terse version of existing `data_extract_*`
cues:
- any(data_extract.state.*) # enable shorthand for running this no matter the state of each data_extract instance
Some other thoughts on the ~operators used here and others we might want
Ideally simpler boolean treatment of the return codes from scripts/cmds gives us all we need, combined with splitting off the checks to separate steps so states of the original work steps are preserved for use in these tests. Diving into custom return codes can certainly work, but would be nice to avoid it if we can to avoid conflicts with other tools that have their own opinions on those/overlap with ours (the overlap is particularly ripe for confusing control flow and logging via overriding the intent expressed in the spec). |
Beta Was this translation helpful? Give feedback.
-
Kicking off a discussion on what we want to the restart/control flow implementation to look like.
Functionality
Controls
How to expose more control of these modes in spec
Specification Usage
A few sample ideas on what this could look like to start iterating on that:
Restart specific:
Similarly, could extend this to the depends steps, making it easier to control child steps like funnels to run even with some parent failures
Note the current behavior for restarts is limited to
parent_step: $(maestro.state.TIMEOUT)
Additional more detailed script based checks would want another group in the step similar to 'run/cmd', say 'check_steps', being a list of scripts that could be called in these conditionals. And maybe want options to schedule those too?
Beta Was this translation helpful? Give feedback.
All reactions