[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771

jamiedemaria · 2024-11-06T19:57:48Z

Summary & Motivation

The backfill daemon doesn't account for run retries. See https://github.com/dagster-io/internal/discussions/12460 for more context. We've decided that we want the daemon to account for automatic and manual retries of runs that occur while the backfill is still in progress. This requires two changes: ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried; and updating the daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration.

This PR addresses the first point, ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried.

Currently a backfill is marked complete when all targeted asset partitions are in a terminal state (successfully materialized, failed, or downstream of a failed partition). Since failed runs may be retried, there is a case where all asset partitions are in a terminal state, but there is a retry in progress that could change the state of some asset partitions. This means that if there are any runs in progress for the partition we need to wait for them to complete before marking the backfill complete.

Additionally, we need to account for a race condition where a failed run may have a retry automatically launched for it, but the daemon marks the backfill complete before the retried run is queued. This PR adds an additional check to ensure that no failed runs are about to be retried.

How I Tested These Changes

new unit tests

manually ran a backfill with automatic run retries configured and saw that the backfill didn't complete until all automatic retries were complete

jamiedemaria · 2024-11-06T19:58:04Z

[backfill daemon run retries 3/n] retries of runs in completed backfills should not be considered part of the backfill #25900
[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853
[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771 👈 (View in Graphite)
[auto run retry tags] 3/n - when a run is retried, add theauto_retry_run_id tag #26054
[auto run retry tags] 2/n - Add dagster/will_retry tag to runs that fail #25932
[auto run retry tags] 1/n - make run_retries retry_on_asset_or_op_failure setting available on instance #26046
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

python_modules/dagster/dagster/_core/execution/asset_backfill.py

jamiedemaria · 2024-11-07T15:42:21Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

@@ -167,11 +171,12 @@ def with_latest_storage_id(self, latest_storage_id: Optional[int]) -> "AssetBack
    def with_requested_runs_for_target_roots(self, requested_runs_for_target_roots: bool):
        return self._replace(requested_runs_for_target_roots=requested_runs_for_target_roots)

-    def is_complete(self) -> bool:
+    def all_targeted_partitions_have_materialization_status(self) -> bool:


renaming this since it is no longer the only thing used to determine backfill completion

python_modules/dagster/dagster/_core/execution/asset_backfill.py

python_modules/dagster/dagster_tests/daemon_tests/test_backfill.py

github-actions · 2024-11-11T20:55:05Z

Deploy preview for dagster-university ready!

✅ Preview
https://dagster-university-5hid137gc-elementl.vercel.app
https://jamie-backfill-daemon-termination-change.dagster-university.dagster-docs.io

Built with commit bf580e4.
This pull request is being automatically deployed with vercel-action

github-actions · 2024-11-11T20:56:46Z

Deploy preview for dagster-docs ready!

Preview available at https://dagster-docs-4a8ut7u30-elementl.vercel.app
https://jamie-backfill-daemon-termination-change.dagster.dagster-docs.io

Direct link to changed pages:

github-actions · 2024-11-11T20:57:25Z

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-kx2qczuyi-elementl.vercel.app
https://jamie-backfill-daemon-termination-change.components-storybook.dagster-docs.io

Built with commit bf580e4.
This pull request is being automatically deployed with vercel-action

github-actions · 2024-11-11T20:59:27Z

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-pw23urgfv-elementl.vercel.app
https://jamie-backfill-daemon-termination-change.core-storybook.dagster-docs.io

Built with commit bf580e4.
This pull request is being automatically deployed with vercel-action

python_modules/dagster/dagster_tests/daemon_tests/test_backfill.py

jamiedemaria · 2024-11-12T20:30:27Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

+        # Condition 3 - there are no failed runs that will be retried
+        and len(
+            list(
+                filter_runs_to_should_retry(


I'm a little worried about the performance this function call. filter_runs_to_should_retry involves fetching the run group for each run that is passed to the function. Is there a different/better way to find out if a run will be retried?

@dpeng817 this code solves a similar problem (calculating which runs are actually going to result in a retry) that you ran into on that airlift PR

just thinking this out a bit - one possibility here would be to delegate the work of determining whether a failed run should run to the run retry daemon, and have it write the result of its determination to a tag that this daemon then checks?

So the run retry daemon would iterate over every failed run and decide whether it should retry or not (like it does now) but then writes the result of whether it will retry as a tag on the run (even if it does not retry, like dagster/run_will_retry: False) - then this daemon would wait for that tag to be present and make decisions based on whether it is set or not.

I'm not as worried about the perf angle here but I do like the idea of this complicated bit of logic for determining whether a run should retry having a single source of truth and place that it's determined, rather than needing to copy it across multiple places (run retry daemon, backfill daemon, airlift calculations, etc.)

The slightly tricky thing there is there would be some amount of time between the run failing and the retry getting checked, and we would need to account for that here - like only make the final determination of whet

We would also need to account for the case where run retries are not enabled, so the tag wouldn't be getting set.

wdyt?

This is the airlift PR I mentioned running into similar challenges with making decisions based on whether or not a run will actually retry: #25761

We should delete that dagster/run_will_retry tag after the retry kicks off, right? So that a get_runs call filtering on that tag will only fetch runs that are about to retry, and not runs that have already retried

I don't think there's a case where a run would have multiple automatic retries, but it could have multiple manual retries or one automatic N manual.

With the scheme as is in #25932 the dagster/child_run tag would only be added for automatic retries. but that could also cause issues later on if we wanted to follow the chain of child runs down for any retry (manual or automatic)

Manual retries already don't count towards your retry limit though right? So I think it's kinda irrelevant?

Manual retries already don't count towards your retry limit though right?

I just tested this and i think they can.

i have this asset

@dg.asset def always_fail(context): raise Exception("Always failing")

and set the max_retries to 5 in my dagster.yaml.

If i launch a run of the asset, then quickly launch a re-execution after the asset fails I see

run 0 - original run
run 1 - my manual retry - parent is run 0 - no retry number
run 2 - automatic retry - parent is run 1 - retry number is 2
run 3 - automatic retry - parent is run 2 - retry number is 3
run 4 - automatic retry - parent is run 3- retry number is 4
run 5 - automatic retry - parent is run 4 - retry number is 5

my hypothesis is that my manual retry is completing before the automatic retry system processes the original run failure. it sees run 1 as part of the run group and so sets the retry number from there. but the same thing would happen if a manual retry finished between any of the retried runs

to close the loop on this based on our convo in slack - we're gonna add a did_retry tag once the retry has been kicked off

graphite-app · 2024-11-12T20:42:44Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

+        and len(
+            instance.get_run_ids(
+                filters=RunsFilter(
+                    statuses=NOT_FINISHED_STATUSES,
+                    tags={BACKFILL_ID_TAG: backfill_id},
+                ),
+                limit=1,
+            )
+        )


The limit=1 parameter on get_run_ids could lead to incorrect behavior. If the first run returned happens to not be in NOT_FINISHED_STATUSES, other in-progress runs would be missed, causing the backfill to be marked complete prematurely. Consider either removing the limit parameter to check all runs, or using get_runs_count with the same filter instead.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

… fail (#25932) ## Summary & Motivation Based on discussion in #25771 (comment) We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base. This PR adds a `dagster/will_retry` tag to any run that will be retried according to the retry maximums set by the user. `dagster/will_retry=false` is applied to any run that failed, but will not be retried This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out associated internal pr dagster-io/internal#12765 ## How I Tested These Changes updated existing tests for auto reexecution to assert that the tag exists when we expect it to

jamiedemaria · 2024-12-02T18:26:57Z

@gibsondan this is ready for review now that the retry tag changes have landed!

.

gibsondan · 2024-12-04T16:14:30Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

+            instance.get_run_ids(
+                filters=RunsFilter(
+                    statuses=NOT_FINISHED_STATUSES,
+                    tags={BACKFILL_ID_TAG: backfill_id},


this is one of those magical tags that the run storage can use to make the query always efficient right? I assume so since job backfills use it too (thinking of should_tag_be_used_for_indexing_filtering here)

yeah we special case the backfill_id tag for plus so that it doesn't use the tags table

gibsondan · 2024-12-04T16:16:33Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

+            get_boolean_tag_value(run.tags.get(WILL_RETRY_TAG), False)
+            and run.tags.get(AUTO_RETRY_RUN_ID_TAG) is None


this might make sense as a util function on the run? run_is_complete_and_will_not_automatically_retry?

i like that

gibsondan · 2024-12-04T16:18:37Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

+    ):
+        logger.info("Backfill has in progress runs. Backfill is still in progress.")
+        return False
+    # Condition 3 - if there are runs that will be retried, but have not yet been retried, the backfill is not complete


does this still behave reasonably on old versions of user code that are not necessarily setting WILL_RETRY_TAG? I think in that case we would just ignore this condition right? (and potentially finish the backfill 'early', like we were doing before)

maybe i made a bad assumption, but i figured that the version of the backfill daemon would be the same as the version of the auto-retry daemon. and the auto-retry daemon will add the will_retry tag if it wasn't added when the run failure event was handled, which made me think we could rely on this being set

but yes, in the case that the will_retry tag isn't getting added to runs, the runs will have is_complete_and_waiting_to_retry as False so that would result in the backfill being considered complete

… fail (#25932) ## Summary & Motivation Based on discussion in #25771 (comment) We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base. This PR adds a `dagster/will_retry` tag to any run that will be retried according to the retry maximums set by the user. `dagster/will_retry=false` is applied to any run that failed, but will not be retried This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out associated internal pr dagster-io/internal#12765 ## How I Tested These Changes updated existing tests for auto reexecution to assert that the tag exists when we expect it to

…mpletion to account for retried runs (#25771) ## Summary & Motivation The backfill daemon doesn't account for run retries. See dagster-io/internal#12460 for more context. We've decided that we want the daemon to account for automatic and manual retries of runs that occur while the backfill is still in progress. This requires two changes: ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried; and updating the daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration. This PR addresses the first point, ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried. Currently a backfill is marked complete when all targeted asset partitions are in a terminal state (successfully materialized, failed, or downstream of a failed partition). Since failed runs may be retried, there is a case where all asset partitions are in a terminal state, but there is a retry in progress that could change the state of some asset partitions. This means that if there are any runs in progress for the partition we need to wait for them to complete before marking the backfill complete. Additionally, we need to account for a race condition where a failed run may have a retry automatically launched for it, but the daemon marks the backfill complete before the retried run is queued. This PR adds an additional check to ensure that no failed runs are about to be retried. ## How I Tested These Changes new unit tests manually ran a backfill with automatic run retries configured and saw that the backfill didn't complete until all automatic retries were complete

… fail (dagster-io#25932) ## Summary & Motivation Based on discussion in dagster-io#25771 (comment) We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base. This PR adds a `dagster/will_retry` tag to any run that will be retried according to the retry maximums set by the user. `dagster/will_retry=false` is applied to any run that failed, but will not be retried This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out associated internal pr https://github.com/dagster-io/internal/pull/12765 ## How I Tested These Changes updated existing tests for auto reexecution to assert that the tag exists when we expect it to

…mpletion to account for retried runs (dagster-io#25771) ## Summary & Motivation The backfill daemon doesn't account for run retries. See https://github.com/dagster-io/internal/discussions/12460 for more context. We've decided that we want the daemon to account for automatic and manual retries of runs that occur while the backfill is still in progress. This requires two changes: ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried; and updating the daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration. This PR addresses the first point, ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried. Currently a backfill is marked complete when all targeted asset partitions are in a terminal state (successfully materialized, failed, or downstream of a failed partition). Since failed runs may be retried, there is a case where all asset partitions are in a terminal state, but there is a retry in progress that could change the state of some asset partitions. This means that if there are any runs in progress for the partition we need to wait for them to complete before marking the backfill complete. Additionally, we need to account for a race condition where a failed run may have a retry automatically launched for it, but the daemon marks the backfill complete before the retried run is queued. This PR adds an additional check to ensure that no failed runs are about to be retried. ## How I Tested These Changes new unit tests manually ran a backfill with automatic run retries configured and saw that the backfill didn't complete until all automatic retries were complete

graphite-app bot reviewed Nov 6, 2024

View reviewed changes

python_modules/dagster/dagster/_core/execution/asset_backfill.py Outdated Show resolved Hide resolved

jamiedemaria closed this Nov 7, 2024

jamiedemaria reopened this Nov 7, 2024

jamiedemaria commented Nov 7, 2024

View reviewed changes

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 0561ad1 to 0493541 Compare November 7, 2024 18:47

jamiedemaria changed the base branch from master to jamie/test-utils-for-run-termination November 7, 2024 18:47

jamiedemaria mentioned this pull request Nov 7, 2024

add test utils to mark a run successful or failed #25791

Closed

graphite-app bot reviewed Nov 7, 2024

View reviewed changes

python_modules/dagster/dagster/_core/execution/asset_backfill.py Outdated Show resolved Hide resolved

graphite-app bot reviewed Nov 7, 2024

View reviewed changes

python_modules/dagster/dagster/_core/execution/asset_backfill.py Outdated Show resolved Hide resolved

python_modules/dagster/dagster/_core/execution/asset_backfill.py Outdated Show resolved Hide resolved

jamiedemaria mentioned this pull request Nov 11, 2024

[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853

Merged

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 0493541 to bf580e4 Compare November 11, 2024 20:52

jamiedemaria changed the base branch from jamie/test-utils-for-run-termination to master November 11, 2024 20:53

graphite-app bot reviewed Nov 11, 2024

View reviewed changes

python_modules/dagster/dagster_tests/daemon_tests/test_backfill.py Outdated Show resolved Hide resolved

graphite-app bot reviewed Nov 11, 2024

View reviewed changes

python_modules/dagster/dagster_tests/daemon_tests/test_backfill.py Show resolved Hide resolved

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch 2 times, most recently from 1fb8fc2 to 436cffd Compare November 12, 2024 18:57

jamiedemaria commented Nov 12, 2024

View reviewed changes

jamiedemaria marked this pull request as ready for review November 12, 2024 20:31

jamiedemaria requested review from gibsondan and clairelin135 November 12, 2024 20:32

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 436cffd to 6dbf15a Compare November 12, 2024 20:41

graphite-app bot reviewed Nov 12, 2024

View reviewed changes

jamiedemaria mentioned this pull request Nov 13, 2024

[backfill daemon run retries 3/n] retries of runs in completed backfills should not be considered part of the backfill #25900

Merged

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 6dbf15a to 6f60763 Compare November 13, 2024 16:03

jamiedemaria changed the title ~~update how we determine backfill completion to account for retried runs~~ [backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs Nov 13, 2024

jamiedemaria force-pushed the jamie/add-did-retry-tag-when-run-is-retried branch from 03a1104 to 75a6c0e Compare November 27, 2024 20:58

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from e7ea44d to f9c706b Compare November 27, 2024 20:58

jamiedemaria force-pushed the jamie/add-did-retry-tag-when-run-is-retried branch from 75a6c0e to a08529c Compare November 27, 2024 21:45

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from f9c706b to 7c44e69 Compare November 27, 2024 21:45

Base automatically changed from jamie/add-did-retry-tag-when-run-is-retried to master November 27, 2024 22:34

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 7c44e69 to 973062b Compare December 2, 2024 14:49

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 973062b to f810d43 Compare December 3, 2024 17:34

gibsondan approved these changes Dec 4, 2024

View reviewed changes

jamiedemaria added 12 commits December 5, 2024 12:14

update how we determine backfill completion to account for retried runs

fa99a25

formatting fix

382de7e

test

d1f0cdb

test for auto retries

2709f89

small fixes

14d2db6

fix

932a447

fix

6017c7f

small

247e4f4

update to use tags and add logging

ef8336f

turn run retries on for backfill tests

e9ec931

tag update

edfa1a3

util fn

701562d

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 3c089a8 to 701562d Compare December 5, 2024 17:15

jamiedemaria merged commit 513c1ab into master Dec 5, 2024
1 check passed

jamiedemaria deleted the jamie/backfill-daemon-termination-change branch December 5, 2024 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771

[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771

jamiedemaria commented Nov 6, 2024 •

edited

Loading

jamiedemaria commented Nov 6, 2024 •

edited

Loading

jamiedemaria Nov 7, 2024

github-actions bot commented Nov 11, 2024

github-actions bot commented Nov 11, 2024

github-actions bot commented Nov 11, 2024

github-actions bot commented Nov 11, 2024

jamiedemaria Nov 12, 2024 •

edited

Loading

gibsondan Nov 14, 2024

gibsondan Nov 14, 2024 •

edited

Loading

gibsondan Nov 14, 2024

clairelin135 Nov 15, 2024

jamiedemaria Nov 18, 2024 •

edited

Loading

dpeng817 Nov 18, 2024

jamiedemaria Nov 18, 2024

jamiedemaria Nov 18, 2024

jamiedemaria Nov 20, 2024

graphite-app bot Nov 12, 2024

jamiedemaria commented Dec 2, 2024

gibsondan Dec 4, 2024

jamiedemaria Dec 4, 2024

gibsondan Dec 4, 2024

jamiedemaria Dec 4, 2024

gibsondan Dec 4, 2024

jamiedemaria Dec 4, 2024

jamiedemaria Dec 5, 2024

		get_boolean_tag_value(run.tags.get(WILL_RETRY_TAG), False)
		and run.tags.get(AUTO_RETRY_RUN_ID_TAG) is None

[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771

[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771

Conversation

jamiedemaria commented Nov 6, 2024 • edited Loading

Summary & Motivation

How I Tested These Changes

jamiedemaria commented Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Nov 11, 2024

github-actions bot commented Nov 11, 2024

github-actions bot commented Nov 11, 2024

github-actions bot commented Nov 11, 2024

jamiedemaria Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibsondan Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamiedemaria Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graphite-app bot Nov 12, 2024

Choose a reason for hiding this comment

jamiedemaria commented Dec 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamiedemaria commented Nov 6, 2024 •

edited

Loading

jamiedemaria commented Nov 6, 2024 •

edited

Loading

jamiedemaria Nov 12, 2024 •

edited

Loading

gibsondan Nov 14, 2024 •

edited

Loading

jamiedemaria Nov 18, 2024 •

edited

Loading