-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[auto run retry tags] 2/n - Add dagster/will_retry
tag to runs that fail
#25932
Conversation
python_modules/dagster/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py
Outdated
Show resolved
Hide resolved
6123ef9
to
e2b5358
Compare
dagster/will_retry
tag to runs that will be retried via auto-reexecution
yield (run, retry_number) | ||
if get_boolean_tag_value(run.tags.get(WILL_RETRY_TAG), default_value=False): | ||
retry_number = int(run.tags.get(RETRY_NUMBER_TAG, "0")) + 1 | ||
yield (run, retry_number) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clairelin135 comment #25771 (comment)
We should delete that dagster/run_will_retry tag after the retry kicks off, right? So that a get_runs call filtering on that tag will only fetch runs that are about to retry, and not runs that have already retried
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A point sort of in favor of this approach is how the backfill daemon will use the will_retry
tag.
When determining if the backfill is over we need to iterate over all failed runs to see if they will be retried. If we don't remove the will_retry
tag and we only look for the truthiness of the will_retry
tag the backfill will never complete since a failed run that will retry will always have that tag, even if the retry has been kicked off. So to fix that we would need to fetch the run group and see if the retry has been kicked off for that run. This doesn't seem like a huge problem, but is some additional logic to keep track of.
If we remove the tag after the retry is submitted, then we don't need to add that additional info. But with the current implementation that would leave us with:
- run that was never planned to be retried has tag
dagster/will_retry=false
- run that will be retried has tag `dagster/will_retry=true
- run that was retried has no tag
which is a bit strange
Another option is to rename the tag dagster/waiting_to_retry
tag that moves from True
to False
after the retry is submitted. I think that is a bit easier to follow. Definitely having some way to mark that the retry has been kicked off will be very useful. Whether thats deleting the tag, updating its value, or adding a new tag could all work though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other option could be to add the run id of the retried run. So a run would start with dagster/will_retry=true
and we would know if the retry has been launched if a dagster/child_run=run_id
tag was added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am strongly in favor of adding some sort of child_run
tag functionality. Any process waiting on the completion of the run + it's retries will be implemented most efficiently using a tag like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you were making this point already, but just to make it explicit, you actually don't need to delete the will_retry tag if you have the child_run tag right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah if we added a child_run
tag then the will_retry
tag would not be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to close the loop on this based on our convo in slack - we're gonna add a did_retry tag once the retry has been kicked off
@gibsondan @dpeng817 @clairelin135 this PR isn't quite ready for full review (no unit tests etc) but i want to get a first pass of review to work out some of the details for this approach before updating existing tests, writing new tests, etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some initial comments, and I agree with your proposal to add the run id of the retried run. I think that will have a lot of utility.
python_modules/dagster/dagster/_core/storage/runs/sql_run_storage.py
Outdated
Show resolved
Hide resolved
|
||
run = self.get_run_by_id(run_id) | ||
if run and event.get_dagster_event().event_type == DagsterEventType.RUN_FAILURE: | ||
self.add_run_tags( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a potential for an issue here if we can't access the DB? like if the run fails because of a DB connection issue, we won't be able to write this tag and then the run will never get retried. Not sure if this is really something to worry about though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that in cloud, handles_run_events_in_store_event is True, so we will need to make sure that this change is propagated to cloud as well (which helps with the case that you mentioned as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for future traceability - see https://github.com/dagster-io/internal/pull/12765
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one high level thing is we are going to want to be sure to test this in cloud as well as OSS
@@ -932,6 +937,10 @@ def run_retries_enabled(self) -> bool: | |||
def run_retries_max_retries(self) -> int: | |||
return self.get_settings("run_retries").get("max_retries", 0) | |||
|
|||
@property | |||
def retry_on_asset_or_op_failure(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is going to be a top-level instance field i think it is worth clarifying that it's about run retries in the name
Also cloud accesses this field in a different way than get_settings() (it looks at deployment settings) so we will need to account for that / override it if needed)
We'll want to make sure that we test this change on cloud as well as OSS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for future traceability - see https://github.com/dagster-io/internal/pull/12763 #26046
|
||
run = self.get_run_by_id(run_id) | ||
if run and event.get_dagster_event().event_type == DagsterEventType.RUN_FAILURE: | ||
self.add_run_tags( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that in cloud, handles_run_events_in_store_event is True, so we will need to make sure that this change is propagated to cloud as well (which helps with the case that you mentioned as well)
python_modules/dagster/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py
Outdated
Show resolved
Hide resolved
run = self.get_run_by_id(run_id) | ||
if run and event.get_dagster_event().event_type == DagsterEventType.RUN_FAILURE: | ||
self.add_run_tags( | ||
run_id, {WILL_RETRY_TAG: str(self._should_retry_run(run)).lower()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a theoretical race condition here that was not there before where two runs from the same run group fail at the exact same time and both have this set to True, where before the daemon would have processed them sequentially and set the first one to True and the second one to False - I don't think 'two runs from the same run group' is a particularly common thing in the first place so this concern may be theoretical
e2b5358
to
05e24b1
Compare
dagster/will_retry
tag to runs that will be retried via auto-reexecutiondagster/will_retry
tag to runs that fail
e3ce093
to
9ed5cc3
Compare
05e24b1
to
dba589e
Compare
9ed5cc3
to
23d8e0a
Compare
dba589e
to
2cba332
Compare
try: | ||
max_retries = int(raw_max_retries_tag) | ||
except ValueError: | ||
warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the retry daemon logs an engine event if the max retries tag is invalid. Since we shouldn't log engine events here, i swapped to a warning for now. I'm not sure what the best way to handle this is though @gibsondan
dagster/will_retry
tag to runs that faildagster/will_retry
tag to runs that fail
python_modules/dagster/dagster_tests/execution_tests/versioning_tests/test_data_versions.py
Outdated
Show resolved
Hide resolved
@@ -2394,6 +2399,71 @@ def get_handlers(self) -> Sequence[logging.Handler]: | |||
def store_event(self, event: "EventLogEntry") -> None: | |||
self._event_storage.store_event(event) | |||
|
|||
def should_retry_run(self, run: DagsterRun) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe this name could be a little clearer about when and why it is called - I would think based on the name that it is something that you can call at any time on a run and it will tell for you whether it can be retried, but what it actually is (as I understand it) is a method that you call exactly once at the moment that the run fails. The result of that function is then put on a tag on the run, and after that, that tag is what you should use to determine whether or not the run is going to retry (and is what the actual retry logic in the daemon will use).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to auto_reexecution_should_retry_run
, but that still doesn't get to your point about only calling this fn once to determine the tag value
@@ -2451,7 +2521,13 @@ def handle_new_event( | |||
and event.get_dagster_event().is_job_event | |||
): | |||
self._run_storage.handle_run_event(run_id, event.get_dagster_event()) | |||
|
|||
run = self.get_run_by_id(run_id) | |||
if run and event.get_dagster_event().is_run_failure and self.run_retries_enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to use this tag to know whether the run is going to retry from e.g. airflift, don't we need the tag to be set even if run retries are not enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i'm missing something on the airlift side - is there a case where run_retries would be false, but we'd still be retrying the run with a different mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just unsure whether airlift would want the tag to always be there to always know for sure that it was not going to retry (even if it is only not retrying because retries are turned off entirely)
I guess the tag just not being there at all also means that it is not going to retry (or that it's on an older version from before we added the tag)? So maybe its fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the tag just not being there at all also means that it is not going to retry
yeah that's right
probably worth documenting what the different values for the tag indicate. Where do you think is the right spot for that? tags.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a little comment where we define the tag, but can add some more info in other places if it seems relevant
3f6cfcd
to
900b1eb
Compare
5b28601
to
97dac07
Compare
97dac07
to
7380737
Compare
87972b3
to
9d50ea8
Compare
remove circular import write tag a string account for None run int even more defensive about None runs re-org some comment fixes add did_retry tag pr comments refactor engine event reporting to daemon rename method modify so it just writes the tag
9d50ea8
to
08e7927
Compare
… fail (#25932) ## Summary & Motivation Based on discussion in #25771 (comment) We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base. This PR adds a `dagster/will_retry` tag to any run that will be retried according to the retry maximums set by the user. `dagster/will_retry=false` is applied to any run that failed, but will not be retried This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out associated internal pr dagster-io/internal#12765 ## How I Tested These Changes updated existing tests for auto reexecution to assert that the tag exists when we expect it to
… fail (dagster-io#25932) ## Summary & Motivation Based on discussion in dagster-io#25771 (comment) We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base. This PR adds a `dagster/will_retry` tag to any run that will be retried according to the retry maximums set by the user. `dagster/will_retry=false` is applied to any run that failed, but will not be retried This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out associated internal pr https://github.com/dagster-io/internal/pull/12765 ## How I Tested These Changes updated existing tests for auto reexecution to assert that the tag exists when we expect it to
Summary & Motivation
Based on discussion in #25771 (comment)
We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base.
This PR adds a
dagster/will_retry
tag to any run that will be retried according to the retry maximums set by the user.dagster/will_retry=false
is applied to any run that failed, but will not be retriedThis PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out
associated internal pr https://github.com/dagster-io/internal/pull/12765
How I Tested These Changes
updated existing tests for auto reexecution to assert that the tag exists when we expect it to