-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771
[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771
Conversation
python_modules/dagster/dagster/_core/execution/asset_backfill.py
Outdated
Show resolved
Hide resolved
@@ -167,11 +171,12 @@ def with_latest_storage_id(self, latest_storage_id: Optional[int]) -> "AssetBack | |||
def with_requested_runs_for_target_roots(self, requested_runs_for_target_roots: bool): | |||
return self._replace(requested_runs_for_target_roots=requested_runs_for_target_roots) | |||
|
|||
def is_complete(self) -> bool: | |||
def all_targeted_partitions_have_materialization_status(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renaming this since it is no longer the only thing used to determine backfill completion
0561ad1
to
0493541
Compare
python_modules/dagster/dagster/_core/execution/asset_backfill.py
Outdated
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/execution/asset_backfill.py
Outdated
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/execution/asset_backfill.py
Outdated
Show resolved
Hide resolved
0493541
to
bf580e4
Compare
python_modules/dagster/dagster_tests/daemon_tests/test_backfill.py
Outdated
Show resolved
Hide resolved
Deploy preview for dagster-university ready! ✅ Preview Built with commit bf580e4. |
Deploy preview for dagit-storybook ready! ✅ Preview Built with commit bf580e4. |
Deploy preview for dagit-core-storybook ready! ✅ Preview Built with commit bf580e4. |
1fb8fc2
to
436cffd
Compare
# Condition 3 - there are no failed runs that will be retried | ||
and len( | ||
list( | ||
filter_runs_to_should_retry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about the performance this function call. filter_runs_to_should_retry
involves fetching the run group for each run that is passed to the function. Is there a different/better way to find out if a run will be retried?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dpeng817 this code solves a similar problem (calculating which runs are actually going to result in a retry) that you ran into on that airlift PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just thinking this out a bit - one possibility here would be to delegate the work of determining whether a failed run should run to the run retry daemon, and have it write the result of its determination to a tag that this daemon then checks?
So the run retry daemon would iterate over every failed run and decide whether it should retry or not (like it does now) but then writes the result of whether it will retry as a tag on the run (even if it does not retry, like dagster/run_will_retry: False
) - then this daemon would wait for that tag to be present and make decisions based on whether it is set or not.
I'm not as worried about the perf angle here but I do like the idea of this complicated bit of logic for determining whether a run should retry having a single source of truth and place that it's determined, rather than needing to copy it across multiple places (run retry daemon, backfill daemon, airlift calculations, etc.)
The slightly tricky thing there is there would be some amount of time between the run failing and the retry getting checked, and we would need to account for that here - like only make the final determination of whet
We would also need to account for the case where run retries are not enabled, so the tag wouldn't be getting set.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the airlift PR I mentioned running into similar challenges with making decisions based on whether or not a run will actually retry: #25761
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should delete that dagster/run_will_retry
tag after the retry kicks off, right? So that a get_runs
call filtering on that tag will only fetch runs that are about to retry, and not runs that have already retried
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a case where a run would have multiple automatic retries, but it could have multiple manual retries or one automatic N manual.
With the scheme as is in #25932 the dagster/child_run
tag would only be added for automatic retries. but that could also cause issues later on if we wanted to follow the chain of child runs down for any retry (manual or automatic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual retries already don't count towards your retry limit though right? So I think it's kinda irrelevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual retries already don't count towards your retry limit though right?
I just tested this and i think they can.
i have this asset
@dg.asset
def always_fail(context):
raise Exception("Always failing")
and set the max_retries
to 5 in my dagster.yaml
.
If i launch a run of the asset, then quickly launch a re-execution after the asset fails I see
run 0 - original run
run 1 - my manual retry - parent is run 0 - no retry number
run 2 - automatic retry - parent is run 1 - retry number is 2
run 3 - automatic retry - parent is run 2 - retry number is 3
run 4 - automatic retry - parent is run 3- retry number is 4
run 5 - automatic retry - parent is run 4 - retry number is 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my hypothesis is that my manual retry is completing before the automatic retry system processes the original run failure. it sees run 1
as part of the run group and so sets the retry number from there. but the same thing would happen if a manual retry finished between any of the retried runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to close the loop on this based on our convo in slack - we're gonna add a did_retry
tag once the retry has been kicked off
436cffd
to
6dbf15a
Compare
and len( | ||
instance.get_run_ids( | ||
filters=RunsFilter( | ||
statuses=NOT_FINISHED_STATUSES, | ||
tags={BACKFILL_ID_TAG: backfill_id}, | ||
), | ||
limit=1, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limit=1
parameter on get_run_ids
could lead to incorrect behavior. If the first run returned happens to not be in NOT_FINISHED_STATUSES
, other in-progress runs would be missed, causing the backfill to be marked complete prematurely. Consider either removing the limit
parameter to check all runs, or using get_runs_count
with the same filter instead.
Spotted by Graphite Reviewer
Is this helpful? React 👍 or 👎 to let us know.
6dbf15a
to
6f60763
Compare
03a1104
to
75a6c0e
Compare
e7ea44d
to
f9c706b
Compare
… fail (#25932) ## Summary & Motivation Based on discussion in #25771 (comment) We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base. This PR adds a `dagster/will_retry` tag to any run that will be retried according to the retry maximums set by the user. `dagster/will_retry=false` is applied to any run that failed, but will not be retried This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out associated internal pr dagster-io/internal#12765 ## How I Tested These Changes updated existing tests for auto reexecution to assert that the tag exists when we expect it to
75a6c0e
to
a08529c
Compare
f9c706b
to
7c44e69
Compare
7c44e69
to
973062b
Compare
@gibsondan this is ready for review now that the retry tag changes have landed! |
973062b
to
f810d43
Compare
instance.get_run_ids( | ||
filters=RunsFilter( | ||
statuses=NOT_FINISHED_STATUSES, | ||
tags={BACKFILL_ID_TAG: backfill_id}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is one of those magical tags that the run storage can use to make the query always efficient right? I assume so since job backfills use it too (thinking of should_tag_be_used_for_indexing_filtering
here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we special case the backfill_id tag for plus so that it doesn't use the tags table
get_boolean_tag_value(run.tags.get(WILL_RETRY_TAG), False) | ||
and run.tags.get(AUTO_RETRY_RUN_ID_TAG) is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might make sense as a util function on the run? run_is_complete_and_will_not_automatically_retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like that
): | ||
logger.info("Backfill has in progress runs. Backfill is still in progress.") | ||
return False | ||
# Condition 3 - if there are runs that will be retried, but have not yet been retried, the backfill is not complete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this still behave reasonably on old versions of user code that are not necessarily setting WILL_RETRY_TAG? I think in that case we would just ignore this condition right? (and potentially finish the backfill 'early', like we were doing before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i made a bad assumption, but i figured that the version of the backfill daemon would be the same as the version of the auto-retry daemon. and the auto-retry daemon will add the will_retry tag if it wasn't added when the run failure event was handled, which made me think we could rely on this being set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but yes, in the case that the will_retry tag isn't getting added to runs, the runs will have is_complete_and_waiting_to_retry
as False
so that would result in the backfill being considered complete
3c089a8
to
701562d
Compare
… fail (#25932) ## Summary & Motivation Based on discussion in #25771 (comment) We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base. This PR adds a `dagster/will_retry` tag to any run that will be retried according to the retry maximums set by the user. `dagster/will_retry=false` is applied to any run that failed, but will not be retried This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out associated internal pr dagster-io/internal#12765 ## How I Tested These Changes updated existing tests for auto reexecution to assert that the tag exists when we expect it to
…mpletion to account for retried runs (#25771) ## Summary & Motivation The backfill daemon doesn't account for run retries. See dagster-io/internal#12460 for more context. We've decided that we want the daemon to account for automatic and manual retries of runs that occur while the backfill is still in progress. This requires two changes: ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried; and updating the daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration. This PR addresses the first point, ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried. Currently a backfill is marked complete when all targeted asset partitions are in a terminal state (successfully materialized, failed, or downstream of a failed partition). Since failed runs may be retried, there is a case where all asset partitions are in a terminal state, but there is a retry in progress that could change the state of some asset partitions. This means that if there are any runs in progress for the partition we need to wait for them to complete before marking the backfill complete. Additionally, we need to account for a race condition where a failed run may have a retry automatically launched for it, but the daemon marks the backfill complete before the retried run is queued. This PR adds an additional check to ensure that no failed runs are about to be retried. ## How I Tested These Changes new unit tests manually ran a backfill with automatic run retries configured and saw that the backfill didn't complete until all automatic retries were complete
… fail (dagster-io#25932) ## Summary & Motivation Based on discussion in dagster-io#25771 (comment) We don't have a centralized way to determine if a run is going to be retried by the retry daemon. This results in different methods being used throughout the code base. This PR adds a `dagster/will_retry` tag to any run that will be retried according to the retry maximums set by the user. `dagster/will_retry=false` is applied to any run that failed, but will not be retried This PR does not change how the re-execution daemon decides if a run should be retried. That is in a stacked PR so that we have more control over how the changes are rolled out associated internal pr https://github.com/dagster-io/internal/pull/12765 ## How I Tested These Changes updated existing tests for auto reexecution to assert that the tag exists when we expect it to
…mpletion to account for retried runs (dagster-io#25771) ## Summary & Motivation The backfill daemon doesn't account for run retries. See https://github.com/dagster-io/internal/discussions/12460 for more context. We've decided that we want the daemon to account for automatic and manual retries of runs that occur while the backfill is still in progress. This requires two changes: ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried; and updating the daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration. This PR addresses the first point, ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried. Currently a backfill is marked complete when all targeted asset partitions are in a terminal state (successfully materialized, failed, or downstream of a failed partition). Since failed runs may be retried, there is a case where all asset partitions are in a terminal state, but there is a retry in progress that could change the state of some asset partitions. This means that if there are any runs in progress for the partition we need to wait for them to complete before marking the backfill complete. Additionally, we need to account for a race condition where a failed run may have a retry automatically launched for it, but the daemon marks the backfill complete before the retried run is queued. This PR adds an additional check to ensure that no failed runs are about to be retried. ## How I Tested These Changes new unit tests manually ran a backfill with automatic run retries configured and saw that the backfill didn't complete until all automatic retries were complete
Summary & Motivation
The backfill daemon doesn't account for run retries. See https://github.com/dagster-io/internal/discussions/12460 for more context. We've decided that we want the daemon to account for automatic and manual retries of runs that occur while the backfill is still in progress. This requires two changes: ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried; and updating the daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration.
This PR addresses the first point, ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried.
Currently a backfill is marked complete when all targeted asset partitions are in a terminal state (successfully materialized, failed, or downstream of a failed partition). Since failed runs may be retried, there is a case where all asset partitions are in a terminal state, but there is a retry in progress that could change the state of some asset partitions. This means that if there are any runs in progress for the partition we need to wait for them to complete before marking the backfill complete.
Additionally, we need to account for a race condition where a failed run may have a retry automatically launched for it, but the daemon marks the backfill complete before the retried run is queued. This PR adds an additional check to ensure that no failed runs are about to be retried.
How I Tested These Changes
new unit tests
manually ran a backfill with automatic run retries configured and saw that the backfill didn't complete until all automatic retries were complete