Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[auto run retry tags] 2/n - Add dagster/will_retry tag to runs that fail #25932

Merged
merged 15 commits into from
Nov 27, 2024

Conversation

jamiedemaria
Copy link
Contributor

@jamiedemaria jamiedemaria commented Nov 14, 2024

Summary & Motivation

Based on discussion in #25771 (comment)

We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base.

This PR adds a dagster/will_retry tag to any run that will be retried according to the retry maximums set by the user. dagster/will_retry=false is applied to any run that failed, but will not be retried

This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out

associated internal pr https://github.com/dagster-io/internal/pull/12765

How I Tested These Changes

updated existing tests for auto reexecution to assert that the tag exists when we expect it to

@jamiedemaria jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 6123ef9 to e2b5358 Compare November 18, 2024 15:39
@jamiedemaria jamiedemaria changed the title see if this works Add `` tag to runs that will be retried via auto-reexecution Nov 18, 2024
@jamiedemaria jamiedemaria changed the title Add `` tag to runs that will be retried via auto-reexecution Add dagster/will_retry tag to runs that will be retried via auto-reexecution Nov 18, 2024
yield (run, retry_number)
if get_boolean_tag_value(run.tags.get(WILL_RETRY_TAG), default_value=False):
retry_number = int(run.tags.get(RETRY_NUMBER_TAG, "0")) + 1
yield (run, retry_number)
Copy link
Contributor Author

@jamiedemaria jamiedemaria Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clairelin135 comment #25771 (comment)

We should delete that dagster/run_will_retry tag after the retry kicks off, right? So that a get_runs call filtering on that tag will only fetch runs that are about to retry, and not runs that have already retried

Copy link
Contributor Author

@jamiedemaria jamiedemaria Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A point sort of in favor of this approach is how the backfill daemon will use the will_retry tag.

When determining if the backfill is over we need to iterate over all failed runs to see if they will be retried. If we don't remove the will_retry tag and we only look for the truthiness of the will_retry tag the backfill will never complete since a failed run that will retry will always have that tag, even if the retry has been kicked off. So to fix that we would need to fetch the run group and see if the retry has been kicked off for that run. This doesn't seem like a huge problem, but is some additional logic to keep track of.

If we remove the tag after the retry is submitted, then we don't need to add that additional info. But with the current implementation that would leave us with:

  • run that was never planned to be retried has tag dagster/will_retry=false
  • run that will be retried has tag `dagster/will_retry=true
  • run that was retried has no tag
    which is a bit strange

Another option is to rename the tag dagster/waiting_to_retry tag that moves from True to False after the retry is submitted. I think that is a bit easier to follow. Definitely having some way to mark that the retry has been kicked off will be very useful. Whether thats deleting the tag, updating its value, or adding a new tag could all work though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other option could be to add the run id of the retried run. So a run would start with dagster/will_retry=true and we would know if the retry has been launched if a dagster/child_run=run_id tag was added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am strongly in favor of adding some sort of child_run tag functionality. Any process waiting on the completion of the run + it's retries will be implemented most efficiently using a tag like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you were making this point already, but just to make it explicit, you actually don't need to delete the will_retry tag if you have the child_run tag right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah if we added a child_run tag then the will_retry tag would not be deleted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to close the loop on this based on our convo in slack - we're gonna add a did_retry tag once the retry has been kicked off

@jamiedemaria jamiedemaria marked this pull request as ready for review November 18, 2024 15:50
@jamiedemaria
Copy link
Contributor Author

@gibsondan @dpeng817 @clairelin135 this PR isn't quite ready for full review (no unit tests etc) but i want to get a first pass of review to work out some of the details for this approach before updating existing tests, writing new tests, etc

Copy link
Contributor

@dpeng817 dpeng817 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial comments, and I agree with your proposal to add the run id of the retried run. I think that will have a lot of utility.


run = self.get_run_by_id(run_id)
if run and event.get_dagster_event().event_type == DagsterEventType.RUN_FAILURE:
self.add_run_tags(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a potential for an issue here if we can't access the DB? like if the run fails because of a DB connection issue, we won't be able to write this tag and then the run will never get retried. Not sure if this is really something to worry about though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that in cloud, handles_run_events_in_store_event is True, so we will need to make sure that this change is propagated to cloud as well (which helps with the case that you mentioned as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@gibsondan gibsondan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one high level thing is we are going to want to be sure to test this in cloud as well as OSS

@@ -932,6 +937,10 @@ def run_retries_enabled(self) -> bool:
def run_retries_max_retries(self) -> int:
return self.get_settings("run_retries").get("max_retries", 0)

@property
def retry_on_asset_or_op_failure(self) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is going to be a top-level instance field i think it is worth clarifying that it's about run retries in the name

Also cloud accesses this field in a different way than get_settings() (it looks at deployment settings) so we will need to account for that / override it if needed)

We'll want to make sure that we test this change on cloud as well as OSS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


run = self.get_run_by_id(run_id)
if run and event.get_dagster_event().event_type == DagsterEventType.RUN_FAILURE:
self.add_run_tags(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that in cloud, handles_run_events_in_store_event is True, so we will need to make sure that this change is propagated to cloud as well (which helps with the case that you mentioned as well)

python_modules/dagster/dagster/_core/instance/__init__.py Outdated Show resolved Hide resolved
run = self.get_run_by_id(run_id)
if run and event.get_dagster_event().event_type == DagsterEventType.RUN_FAILURE:
self.add_run_tags(
run_id, {WILL_RETRY_TAG: str(self._should_retry_run(run)).lower()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a theoretical race condition here that was not there before where two runs from the same run group fail at the exact same time and both have this set to True, where before the daemon would have processed them sequentially and set the first one to True and the second one to False - I don't think 'two runs from the same run group' is a particularly common thing in the first place so this concern may be theoretical

python_modules/dagster/dagster/_core/instance/__init__.py Outdated Show resolved Hide resolved
python_modules/dagster/dagster/_core/instance/__init__.py Outdated Show resolved Hide resolved
@jamiedemaria jamiedemaria force-pushed the jamie/write-will-retry-tag branch from e2b5358 to 05e24b1 Compare November 20, 2024 20:40
@jamiedemaria jamiedemaria changed the base branch from master to jamie/retry-on-asset-op-failure-setting November 20, 2024 20:40
@jamiedemaria jamiedemaria changed the title Add dagster/will_retry tag to runs that will be retried via auto-reexecution Add dagster/will_retry tag to runs that fail Nov 20, 2024
@jamiedemaria jamiedemaria force-pushed the jamie/retry-on-asset-op-failure-setting branch from e3ce093 to 9ed5cc3 Compare November 20, 2024 21:08
@jamiedemaria jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 05e24b1 to dba589e Compare November 20, 2024 21:08
@jamiedemaria jamiedemaria force-pushed the jamie/retry-on-asset-op-failure-setting branch from 9ed5cc3 to 23d8e0a Compare November 21, 2024 15:31
@jamiedemaria jamiedemaria force-pushed the jamie/write-will-retry-tag branch from dba589e to 2cba332 Compare November 21, 2024 15:31
try:
max_retries = int(raw_max_retries_tag)
except ValueError:
warnings.warn(
Copy link
Contributor Author

@jamiedemaria jamiedemaria Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the retry daemon logs an engine event if the max retries tag is invalid. Since we shouldn't log engine events here, i swapped to a warning for now. I'm not sure what the best way to handle this is though @gibsondan

@jamiedemaria jamiedemaria changed the title Add dagster/will_retry tag to runs that fail [auto run retry tags] 2/n - Add dagster/will_retry tag to runs that fail Nov 21, 2024
python_modules/dagster/dagster/_core/instance/__init__.py Outdated Show resolved Hide resolved
python_modules/dagster/dagster/_core/instance/__init__.py Outdated Show resolved Hide resolved
@@ -2394,6 +2399,71 @@ def get_handlers(self) -> Sequence[logging.Handler]:
def store_event(self, event: "EventLogEntry") -> None:
self._event_storage.store_event(event)

def should_retry_run(self, run: DagsterRun) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe this name could be a little clearer about when and why it is called - I would think based on the name that it is something that you can call at any time on a run and it will tell for you whether it can be retried, but what it actually is (as I understand it) is a method that you call exactly once at the moment that the run fails. The result of that function is then put on a tag on the run, and after that, that tag is what you should use to determine whether or not the run is going to retry (and is what the actual retry logic in the daemon will use).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to auto_reexecution_should_retry_run, but that still doesn't get to your point about only calling this fn once to determine the tag value

@@ -2451,7 +2521,13 @@ def handle_new_event(
and event.get_dagster_event().is_job_event
):
self._run_storage.handle_run_event(run_id, event.get_dagster_event())

run = self.get_run_by_id(run_id)
if run and event.get_dagster_event().is_run_failure and self.run_retries_enabled:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to use this tag to know whether the run is going to retry from e.g. airflift, don't we need the tag to be set even if run retries are not enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i'm missing something on the airlift side - is there a case where run_retries would be false, but we'd still be retrying the run with a different mechanism?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just unsure whether airlift would want the tag to always be there to always know for sure that it was not going to retry (even if it is only not retrying because retries are turned off entirely)

I guess the tag just not being there at all also means that it is not going to retry (or that it's on an older version from before we added the tag)? So maybe its fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the tag just not being there at all also means that it is not going to retry

yeah that's right

probably worth documenting what the different values for the tag indicate. Where do you think is the right spot for that? tags.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a little comment where we define the tag, but can add some more info in other places if it seems relevant

@jamiedemaria jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 3f6cfcd to 900b1eb Compare November 25, 2024 21:55
Base automatically changed from jamie/retry-on-asset-op-failure-setting to master November 26, 2024 14:18
@jamiedemaria jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 5b28601 to 97dac07 Compare November 26, 2024 14:19
@jamiedemaria jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 97dac07 to 7380737 Compare November 27, 2024 15:06
@jamiedemaria jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 87972b3 to 9d50ea8 Compare November 27, 2024 19:53
remove circular import

write tag a string

account for None run

int

even more defensive about None runs

re-org

some comment fixes

add did_retry tag

pr comments

refactor engine event reporting to daemon

rename method

modify so it just writes the tag
@jamiedemaria jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 9d50ea8 to 08e7927 Compare November 27, 2024 20:18
@jamiedemaria jamiedemaria merged commit 5ab8f7f into master Nov 27, 2024
1 check passed
@jamiedemaria jamiedemaria deleted the jamie/write-will-retry-tag branch November 27, 2024 21:44
cmpadden pushed a commit that referenced this pull request Dec 5, 2024
… fail (#25932)

## Summary & Motivation
Based on discussion in
#25771 (comment)

We don't have a centralized way to determine if a run is going to be
retried by the retry daemon. This results in different methods being
used throughout the code base.

This PR adds a `dagster/will_retry` tag to any run that will be retried
according to the retry maximums set by the user.
`dagster/will_retry=false` is applied to any run that failed, but will
not be retried

This PR does not change how the re-execution daemon decides if a run
should be retried. That is in a stacked PR so that we have more control
over how the changes are rolled out

associated internal pr dagster-io/internal#12765

## How I Tested These Changes
updated existing tests for auto reexecution to assert that the tag
exists when we expect it to
pskinnerthyme pushed a commit to pskinnerthyme/dagster that referenced this pull request Dec 16, 2024
… fail (dagster-io#25932)

## Summary & Motivation
Based on discussion in
dagster-io#25771 (comment)

We don't have a centralized way to determine if a run is going to be
retried by the retry daemon. This results in different methods being
used throughout the code base.

This PR adds a `dagster/will_retry` tag to any run that will be retried
according to the retry maximums set by the user.
`dagster/will_retry=false` is applied to any run that failed, but will
not be retried

This PR does not change how the re-execution daemon decides if a run
should be retried. That is in a stacked PR so that we have more control
over how the changes are rolled out

associated internal pr https://github.com/dagster-io/internal/pull/12765

## How I Tested These Changes
updated existing tests for auto reexecution to assert that the tag
exists when we expect it to
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants