[auto run retry tags] 2/n - Add `dagster/will_retry` tag to runs that fail #25932

jamiedemaria · 2024-11-14T20:06:18Z

Summary & Motivation

We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base.

This PR adds a dagster/will_retry tag to any run that will be retried according to the retry maximums set by the user. dagster/will_retry=false is applied to any run that failed, but will not be retried

This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out

associated internal pr https://github.com/dagster-io/internal/pull/12765

How I Tested These Changes

updated existing tests for auto reexecution to assert that the tag exists when we expect it to

jamiedemaria · 2024-11-14T20:06:36Z

[backfill daemon run retries 3/n] retries of runs in completed backfills should not be considered part of the backfill #25900
[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853
[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771
[auto run retry tags] 3/n - when a run is retried, add theauto_retry_run_id tag #26054
[auto run retry tags] 2/n - Add dagster/will_retry tag to runs that fail #25932 👈 (View in Graphite)
[auto run retry tags] 1/n - make run_retries retry_on_asset_or_op_failure setting available on instance #26046
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

python_modules/dagster/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py

jamiedemaria · 2024-11-18T15:43:23Z

python_modules/dagster/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py

-                yield (run, retry_number)
+        if get_boolean_tag_value(run.tags.get(WILL_RETRY_TAG), default_value=False):
+            retry_number = int(run.tags.get(RETRY_NUMBER_TAG, "0")) + 1
+            yield (run, retry_number)


@clairelin135 comment #25771 (comment)

We should delete that dagster/run_will_retry tag after the retry kicks off, right? So that a get_runs call filtering on that tag will only fetch runs that are about to retry, and not runs that have already retried

A point sort of in favor of this approach is how the backfill daemon will use the will_retry tag.

When determining if the backfill is over we need to iterate over all failed runs to see if they will be retried. If we don't remove the will_retry tag and we only look for the truthiness of the will_retry tag the backfill will never complete since a failed run that will retry will always have that tag, even if the retry has been kicked off. So to fix that we would need to fetch the run group and see if the retry has been kicked off for that run. This doesn't seem like a huge problem, but is some additional logic to keep track of.

If we remove the tag after the retry is submitted, then we don't need to add that additional info. But with the current implementation that would leave us with:

run that was never planned to be retried has tag dagster/will_retry=false

run that will be retried has tag `dagster/will_retry=true

run that was retried has no tag
which is a bit strange

Another option is to rename the tag dagster/waiting_to_retry tag that moves from True to False after the retry is submitted. I think that is a bit easier to follow. Definitely having some way to mark that the retry has been kicked off will be very useful. Whether thats deleting the tag, updating its value, or adding a new tag could all work though

Other option could be to add the run id of the retried run. So a run would start with dagster/will_retry=true and we would know if the retry has been launched if a dagster/child_run=run_id tag was added

I am strongly in favor of adding some sort of child_run tag functionality. Any process waiting on the completion of the run + it's retries will be implemented most efficiently using a tag like that.

Perhaps you were making this point already, but just to make it explicit, you actually don't need to delete the will_retry tag if you have the child_run tag right

yeah if we added a child_run tag then the will_retry tag would not be deleted

to close the loop on this based on our convo in slack - we're gonna add a did_retry tag once the retry has been kicked off

jamiedemaria · 2024-11-18T15:52:03Z

@gibsondan @dpeng817 @clairelin135 this PR isn't quite ready for full review (no unit tests etc) but i want to get a first pass of review to work out some of the details for this approach before updating existing tests, writing new tests, etc

dpeng817

some initial comments, and I agree with your proposal to add the run id of the retried run. I think that will have a lot of utility.

python_modules/dagster/dagster/_core/instance/__init__.py

python_modules/dagster/dagster/_core/storage/runs/sql_run_storage.py

jamiedemaria · 2024-11-18T17:23:20Z

python_modules/dagster/dagster/_core/instance/__init__.py

-
+                run = self.get_run_by_id(run_id)
+                if run and event.get_dagster_event().event_type == DagsterEventType.RUN_FAILURE:
+                    self.add_run_tags(


is there a potential for an issue here if we can't access the DB? like if the run fails because of a DB connection issue, we won't be able to write this tag and then the run will never get retried. Not sure if this is really something to worry about though

note that in cloud, handles_run_events_in_store_event is True, so we will need to make sure that this change is propagated to cloud as well (which helps with the case that you mentioned as well)

for future traceability - see https://github.com/dagster-io/internal/pull/12765

gibsondan

one high level thing is we are going to want to be sure to test this in cloud as well as OSS

gibsondan · 2024-11-19T14:07:34Z

python_modules/dagster/dagster/_core/instance/__init__.py

@@ -932,6 +937,10 @@ def run_retries_enabled(self) -> bool:
    def run_retries_max_retries(self) -> int:
        return self.get_settings("run_retries").get("max_retries", 0)

+    @property
+    def retry_on_asset_or_op_failure(self) -> bool:


if this is going to be a top-level instance field i think it is worth clarifying that it's about run retries in the name

Also cloud accesses this field in a different way than get_settings() (it looks at deployment settings) so we will need to account for that / override it if needed)

We'll want to make sure that we test this change on cloud as well as OSS

for future traceability - see https://github.com/dagster-io/internal/pull/12763 #26046

gibsondan · 2024-11-19T14:09:48Z

python_modules/dagster/dagster/_core/instance/__init__.py

-
+                run = self.get_run_by_id(run_id)
+                if run and event.get_dagster_event().event_type == DagsterEventType.RUN_FAILURE:
+                    self.add_run_tags(


note that in cloud, handles_run_events_in_store_event is True, so we will need to make sure that this change is propagated to cloud as well (which helps with the case that you mentioned as well)

python_modules/dagster/dagster/_core/instance/__init__.py

python_modules/dagster/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py

gibsondan · 2024-11-19T14:15:06Z

python_modules/dagster/dagster/_core/instance/__init__.py

+                run = self.get_run_by_id(run_id)
+                if run and event.get_dagster_event().event_type == DagsterEventType.RUN_FAILURE:
+                    self.add_run_tags(
+                        run_id, {WILL_RETRY_TAG: str(self._should_retry_run(run)).lower()}


there's a theoretical race condition here that was not there before where two runs from the same run group fail at the exact same time and both have this set to True, where before the daemon would have processed them sequentially and set the first one to True and the second one to False - I don't think 'two runs from the same run group' is a particularly common thing in the first place so this concern may be theoretical

python_modules/dagster/dagster/_core/instance/__init__.py

jamiedemaria · 2024-11-21T18:21:11Z

python_modules/dagster/dagster/_core/instance/__init__.py

+                try:
+                    max_retries = int(raw_max_retries_tag)
+                except ValueError:
+                    warnings.warn(


the retry daemon logs an engine event if the max retries tag is invalid. Since we shouldn't log engine events here, i swapped to a warning for now. I'm not sure what the best way to handle this is though @gibsondan

python_modules/dagster/dagster_tests/execution_tests/versioning_tests/test_data_versions.py

python_modules/dagster/dagster/_core/instance/__init__.py

gibsondan · 2024-11-25T19:36:35Z

python_modules/dagster/dagster/_core/instance/__init__.py

@@ -2394,6 +2399,71 @@ def get_handlers(self) -> Sequence[logging.Handler]:
    def store_event(self, event: "EventLogEntry") -> None:
        self._event_storage.store_event(event)

+    def should_retry_run(self, run: DagsterRun) -> bool:


I think maybe this name could be a little clearer about when and why it is called - I would think based on the name that it is something that you can call at any time on a run and it will tell for you whether it can be retried, but what it actually is (as I understand it) is a method that you call exactly once at the moment that the run fails. The result of that function is then put on a tag on the run, and after that, that tag is what you should use to determine whether or not the run is going to retry (and is what the actual retry logic in the daemon will use).

renamed to auto_reexecution_should_retry_run, but that still doesn't get to your point about only calling this fn once to determine the tag value

gibsondan · 2024-11-25T19:42:54Z

python_modules/dagster/dagster/_core/instance/__init__.py

@@ -2451,7 +2521,13 @@ def handle_new_event(
                and event.get_dagster_event().is_job_event
            ):
                self._run_storage.handle_run_event(run_id, event.get_dagster_event())
-
+                run = self.get_run_by_id(run_id)
+                if run and event.get_dagster_event().is_run_failure and self.run_retries_enabled:


If we are going to use this tag to know whether the run is going to retry from e.g. airflift, don't we need the tag to be set even if run retries are not enabled?

i think i'm missing something on the airlift side - is there a case where run_retries would be false, but we'd still be retrying the run with a different mechanism?

I was just unsure whether airlift would want the tag to always be there to always know for sure that it was not going to retry (even if it is only not retrying because retries are turned off entirely)

I guess the tag just not being there at all also means that it is not going to retry (or that it's on an older version from before we added the tag)? So maybe its fine?

I guess the tag just not being there at all also means that it is not going to retry

yeah that's right

probably worth documenting what the different values for the tag indicate. Where do you think is the right spot for that? tags.py?

added a little comment where we define the tag, but can add some more info in other places if it seems relevant

python_modules/dagster/dagster/_core/execution/retries.py

remove circular import write tag a string account for None run int even more defensive about None runs re-org some comment fixes add did_retry tag pr comments refactor engine event reporting to daemon rename method modify so it just writes the tag

… fail (#25932) ## Summary & Motivation Based on discussion in #25771 (comment) We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base. This PR adds a `dagster/will_retry` tag to any run that will be retried according to the retry maximums set by the user. `dagster/will_retry=false` is applied to any run that failed, but will not be retried This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out associated internal pr dagster-io/internal#12765 ## How I Tested These Changes updated existing tests for auto reexecution to assert that the tag exists when we expect it to

… fail (dagster-io#25932) ## Summary & Motivation Based on discussion in dagster-io#25771 (comment) We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base. This PR adds a `dagster/will_retry` tag to any run that will be retried according to the retry maximums set by the user. `dagster/will_retry=false` is applied to any run that failed, but will not be retried This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out associated internal pr https://github.com/dagster-io/internal/pull/12765 ## How I Tested These Changes updated existing tests for auto reexecution to assert that the tag exists when we expect it to

graphite-app bot reviewed Nov 14, 2024

View reviewed changes

python_modules/dagster/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py Outdated Show resolved Hide resolved

jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 6123ef9 to e2b5358 Compare November 18, 2024 15:39

jamiedemaria changed the title ~~see if this works~~ Add `` tag to runs that will be retried via auto-reexecution Nov 18, 2024

jamiedemaria changed the title ~~Add `` tag to runs that will be retried via auto-reexecution~~ Add dagster/will_retry tag to runs that will be retried via auto-reexecution Nov 18, 2024

jamiedemaria mentioned this pull request Nov 18, 2024

[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771

Merged

jamiedemaria commented Nov 18, 2024

View reviewed changes

jamiedemaria marked this pull request as ready for review November 18, 2024 15:50

jamiedemaria requested review from gibsondan, dpeng817 and clairelin135 November 18, 2024 15:50

dpeng817 reviewed Nov 18, 2024

View reviewed changes

python_modules/dagster/dagster/_core/instance/__init__.py Outdated Show resolved Hide resolved

python_modules/dagster/dagster/_core/storage/runs/sql_run_storage.py Outdated Show resolved Hide resolved

jamiedemaria commented Nov 18, 2024

View reviewed changes

gibsondan requested changes Nov 19, 2024

View reviewed changes

jamiedemaria force-pushed the jamie/write-will-retry-tag branch from e2b5358 to 05e24b1 Compare November 20, 2024 20:40

jamiedemaria changed the base branch from master to jamie/retry-on-asset-op-failure-setting November 20, 2024 20:40

jamiedemaria changed the title ~~Add dagster/will_retry tag to runs that will be retried via auto-reexecution~~ Add dagster/will_retry tag to runs that fail Nov 20, 2024

jamiedemaria force-pushed the jamie/retry-on-asset-op-failure-setting branch from e3ce093 to 9ed5cc3 Compare November 20, 2024 21:08

jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 05e24b1 to dba589e Compare November 20, 2024 21:08

jamiedemaria force-pushed the jamie/retry-on-asset-op-failure-setting branch from 9ed5cc3 to 23d8e0a Compare November 21, 2024 15:31

jamiedemaria force-pushed the jamie/write-will-retry-tag branch from dba589e to 2cba332 Compare November 21, 2024 15:31

jamiedemaria commented Nov 21, 2024

View reviewed changes

jamiedemaria changed the title ~~Add dagster/will_retry tag to runs that fail~~ [auto run retry tags] 2/n - Add dagster/will_retry tag to runs that fail Nov 21, 2024

jamiedemaria requested review from gibsondan and dpeng817 November 21, 2024 20:10

jamiedemaria commented Nov 21, 2024

View reviewed changes

python_modules/dagster/dagster_tests/execution_tests/versioning_tests/test_data_versions.py Outdated Show resolved Hide resolved

This was referenced Nov 25, 2024

[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853

Merged

[backfill daemon run retries 3/n] retries of runs in completed backfills should not be considered part of the backfill #25900

Merged

gibsondan reviewed Nov 25, 2024

View reviewed changes

jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 3f6cfcd to 900b1eb Compare November 25, 2024 21:55

Base automatically changed from jamie/retry-on-asset-op-failure-setting to master November 26, 2024 14:18

jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 5b28601 to 97dac07 Compare November 26, 2024 14:19

jamiedemaria requested a review from gibsondan November 26, 2024 17:22

jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 97dac07 to 7380737 Compare November 27, 2024 15:06

gibsondan approved these changes Nov 27, 2024

View reviewed changes

python_modules/dagster/dagster/_core/execution/retries.py Outdated Show resolved Hide resolved

python_modules/dagster/dagster/_core/execution/retries.py Outdated Show resolved Hide resolved

jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 87972b3 to 9d50ea8 Compare November 27, 2024 19:53

jamiedemaria added 14 commits November 27, 2024 15:18

write the will_retry tag

4c8335a

remove circular import write tag a string account for None run int even more defensive about None runs re-org some comment fixes add did_retry tag pr comments refactor engine event reporting to daemon rename method modify so it just writes the tag

warn on invalid max retries tag

a9e6be1

add tests for new tag

e5940eb

reorg

16400f0

make tests better

9282442

move run retries enabled condition

21fa9f0

move run retries enabled condition

6578de6

setting

8d02702

add asserts to tests

92aea14

docstring

c94a3a6

note

2a309bd

re-org

9494503

comment on tag

daae492

pr comments

08e7927

jamiedemaria force-pushed the jamie/write-will-retry-tag branch from 9d50ea8 to 08e7927 Compare November 27, 2024 20:18

fix goof

08ecc66

jamiedemaria merged commit 5ab8f7f into master Nov 27, 2024
1 check passed

jamiedemaria deleted the jamie/write-will-retry-tag branch November 27, 2024 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[auto run retry tags] 2/n - Add `dagster/will_retry` tag to runs that fail #25932

[auto run retry tags] 2/n - Add `dagster/will_retry` tag to runs that fail #25932

jamiedemaria commented Nov 14, 2024 •

edited

Loading

jamiedemaria commented Nov 14, 2024 •

edited

Loading

jamiedemaria Nov 18, 2024 •

edited

Loading

jamiedemaria Nov 18, 2024 •

edited

Loading

jamiedemaria Nov 18, 2024

dpeng817 Nov 18, 2024

dpeng817 Nov 18, 2024

jamiedemaria Nov 18, 2024

jamiedemaria Nov 20, 2024

jamiedemaria commented Nov 18, 2024

dpeng817 left a comment

jamiedemaria Nov 18, 2024

gibsondan Nov 19, 2024

jamiedemaria Nov 26, 2024

gibsondan left a comment

gibsondan Nov 19, 2024

jamiedemaria Nov 26, 2024

gibsondan Nov 19, 2024

gibsondan Nov 19, 2024

jamiedemaria Nov 21, 2024 •

edited

Loading

gibsondan Nov 25, 2024

jamiedemaria Nov 25, 2024

gibsondan Nov 25, 2024

jamiedemaria Nov 25, 2024

gibsondan Nov 25, 2024

jamiedemaria Nov 25, 2024

jamiedemaria Nov 25, 2024

[auto run retry tags] 2/n - Add dagster/will_retry tag to runs that fail #25932

[auto run retry tags] 2/n - Add dagster/will_retry tag to runs that fail #25932

Conversation

jamiedemaria commented Nov 14, 2024 • edited Loading

Summary & Motivation

How I Tested These Changes

jamiedemaria commented Nov 14, 2024 • edited Loading

jamiedemaria Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

jamiedemaria Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamiedemaria commented Nov 18, 2024

dpeng817 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibsondan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamiedemaria Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[auto run retry tags] 2/n - Add `dagster/will_retry` tag to runs that fail #25932

[auto run retry tags] 2/n - Add `dagster/will_retry` tag to runs that fail #25932

jamiedemaria commented Nov 14, 2024 •

edited

Loading

jamiedemaria commented Nov 14, 2024 •

edited

Loading

jamiedemaria Nov 18, 2024 •

edited

Loading

jamiedemaria Nov 18, 2024 •

edited

Loading

jamiedemaria Nov 21, 2024 •

edited

Loading