Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771

Merged
merged 12 commits into from
Dec 5, 2024

Conversation

jamiedemaria
Copy link
Contributor

@jamiedemaria jamiedemaria commented Nov 6, 2024

Summary & Motivation

The backfill daemon doesn't account for run retries. See https://github.com/dagster-io/internal/discussions/12460 for more context. We've decided that we want the daemon to account for automatic and manual retries of runs that occur while the backfill is still in progress. This requires two changes: ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried; and updating the daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration.

This PR addresses the first point, ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried.

Currently a backfill is marked complete when all targeted asset partitions are in a terminal state (successfully materialized, failed, or downstream of a failed partition). Since failed runs may be retried, there is a case where all asset partitions are in a terminal state, but there is a retry in progress that could change the state of some asset partitions. This means that if there are any runs in progress for the partition we need to wait for them to complete before marking the backfill complete.

Additionally, we need to account for a race condition where a failed run may have a retry automatically launched for it, but the daemon marks the backfill complete before the retried run is queued. This PR adds an additional check to ensure that no failed runs are about to be retried.

How I Tested These Changes

new unit tests

manually ran a backfill with automatic run retries configured and saw that the backfill didn't complete until all automatic retries were complete

@jamiedemaria jamiedemaria reopened this Nov 7, 2024
@@ -167,11 +171,12 @@ def with_latest_storage_id(self, latest_storage_id: Optional[int]) -> "AssetBack
def with_requested_runs_for_target_roots(self, requested_runs_for_target_roots: bool):
return self._replace(requested_runs_for_target_roots=requested_runs_for_target_roots)

def is_complete(self) -> bool:
def all_targeted_partitions_have_materialization_status(self) -> bool:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renaming this since it is no longer the only thing used to determine backfill completion

@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 0561ad1 to 0493541 Compare November 7, 2024 18:47
@jamiedemaria jamiedemaria changed the base branch from master to jamie/test-utils-for-run-termination November 7, 2024 18:47
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 0493541 to bf580e4 Compare November 11, 2024 20:52
@jamiedemaria jamiedemaria changed the base branch from jamie/test-utils-for-run-termination to master November 11, 2024 20:53
Copy link

Deploy preview for dagster-university ready!

✅ Preview
https://dagster-university-5hid137gc-elementl.vercel.app
https://jamie-backfill-daemon-termination-change.dagster-university.dagster-docs.io

Built with commit bf580e4.
This pull request is being automatically deployed with vercel-action

Copy link

Deploy preview for dagster-docs ready!

Preview available at https://dagster-docs-4a8ut7u30-elementl.vercel.app
https://jamie-backfill-daemon-termination-change.dagster.dagster-docs.io

Direct link to changed pages:

Copy link

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-kx2qczuyi-elementl.vercel.app
https://jamie-backfill-daemon-termination-change.components-storybook.dagster-docs.io

Built with commit bf580e4.
This pull request is being automatically deployed with vercel-action

Copy link

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-pw23urgfv-elementl.vercel.app
https://jamie-backfill-daemon-termination-change.core-storybook.dagster-docs.io

Built with commit bf580e4.
This pull request is being automatically deployed with vercel-action

@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch 2 times, most recently from 1fb8fc2 to 436cffd Compare November 12, 2024 18:57
# Condition 3 - there are no failed runs that will be retried
and len(
list(
filter_runs_to_should_retry(
Copy link
Contributor Author

@jamiedemaria jamiedemaria Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about the performance this function call. filter_runs_to_should_retry involves fetching the run group for each run that is passed to the function. Is there a different/better way to find out if a run will be retried?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dpeng817 this code solves a similar problem (calculating which runs are actually going to result in a retry) that you ran into on that airlift PR

Copy link
Member

@gibsondan gibsondan Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking this out a bit - one possibility here would be to delegate the work of determining whether a failed run should run to the run retry daemon, and have it write the result of its determination to a tag that this daemon then checks?

So the run retry daemon would iterate over every failed run and decide whether it should retry or not (like it does now) but then writes the result of whether it will retry as a tag on the run (even if it does not retry, like dagster/run_will_retry: False) - then this daemon would wait for that tag to be present and make decisions based on whether it is set or not.

I'm not as worried about the perf angle here but I do like the idea of this complicated bit of logic for determining whether a run should retry having a single source of truth and place that it's determined, rather than needing to copy it across multiple places (run retry daemon, backfill daemon, airlift calculations, etc.)

The slightly tricky thing there is there would be some amount of time between the run failing and the retry getting checked, and we would need to account for that here - like only make the final determination of whet

We would also need to account for the case where run retries are not enabled, so the tag wouldn't be getting set.

wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the airlift PR I mentioned running into similar challenges with making decisions based on whether or not a run will actually retry: #25761

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should delete that dagster/run_will_retry tag after the retry kicks off, right? So that a get_runs call filtering on that tag will only fetch runs that are about to retry, and not runs that have already retried

Copy link
Contributor Author

@jamiedemaria jamiedemaria Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a case where a run would have multiple automatic retries, but it could have multiple manual retries or one automatic N manual.

With the scheme as is in #25932 the dagster/child_run tag would only be added for automatic retries. but that could also cause issues later on if we wanted to follow the chain of child runs down for any retry (manual or automatic)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual retries already don't count towards your retry limit though right? So I think it's kinda irrelevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual retries already don't count towards your retry limit though right?

I just tested this and i think they can.

i have this asset

@dg.asset
def always_fail(context):
    raise Exception("Always failing")

and set the max_retries to 5 in my dagster.yaml.

If i launch a run of the asset, then quickly launch a re-execution after the asset fails I see

run 0 - original run
run 1 - my manual retry - parent is run 0 - no retry number
run 2 - automatic retry - parent is run 1 - retry number is 2
run 3 - automatic retry - parent is run 2 - retry number is 3
run 4 - automatic retry - parent is run 3- retry number is 4
run 5 - automatic retry - parent is run 4 - retry number is 5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my hypothesis is that my manual retry is completing before the automatic retry system processes the original run failure. it sees run 1 as part of the run group and so sets the retry number from there. but the same thing would happen if a manual retry finished between any of the retried runs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to close the loop on this based on our convo in slack - we're gonna add a did_retry tag once the retry has been kicked off

@jamiedemaria jamiedemaria marked this pull request as ready for review November 12, 2024 20:31
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 436cffd to 6dbf15a Compare November 12, 2024 20:41
Comment on lines 958 to 969
and len(
instance.get_run_ids(
filters=RunsFilter(
statuses=NOT_FINISHED_STATUSES,
tags={BACKFILL_ID_TAG: backfill_id},
),
limit=1,
)
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit=1 parameter on get_run_ids could lead to incorrect behavior. If the first run returned happens to not be in NOT_FINISHED_STATUSES, other in-progress runs would be missed, causing the backfill to be marked complete prematurely. Consider either removing the limit parameter to check all runs, or using get_runs_count with the same filter instead.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 6dbf15a to 6f60763 Compare November 13, 2024 16:03
@jamiedemaria jamiedemaria changed the title update how we determine backfill completion to account for retried runs [backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs Nov 13, 2024
@jamiedemaria jamiedemaria force-pushed the jamie/add-did-retry-tag-when-run-is-retried branch from 03a1104 to 75a6c0e Compare November 27, 2024 20:58
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from e7ea44d to f9c706b Compare November 27, 2024 20:58
jamiedemaria added a commit that referenced this pull request Nov 27, 2024
… fail (#25932)

## Summary & Motivation
Based on discussion in
#25771 (comment)

We don't have a centralized way to determine if a run is going to be
retried by the retry daemon. This results in different methods being
used throughout the code base.

This PR adds a `dagster/will_retry` tag to any run that will be retried
according to the retry maximums set by the user.
`dagster/will_retry=false` is applied to any run that failed, but will
not be retried

This PR does not change how the re-execution daemon decides if a run
should be retried. That is in a stacked PR so that we have more control
over how the changes are rolled out

associated internal pr dagster-io/internal#12765

## How I Tested These Changes
updated existing tests for auto reexecution to assert that the tag
exists when we expect it to
@jamiedemaria jamiedemaria force-pushed the jamie/add-did-retry-tag-when-run-is-retried branch from 75a6c0e to a08529c Compare November 27, 2024 21:45
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from f9c706b to 7c44e69 Compare November 27, 2024 21:45
Base automatically changed from jamie/add-did-retry-tag-when-run-is-retried to master November 27, 2024 22:34
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 7c44e69 to 973062b Compare December 2, 2024 14:49
@jamiedemaria
Copy link
Contributor Author

@gibsondan this is ready for review now that the retry tag changes have landed!

@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 973062b to f810d43 Compare December 3, 2024 17:34
instance.get_run_ids(
filters=RunsFilter(
statuses=NOT_FINISHED_STATUSES,
tags={BACKFILL_ID_TAG: backfill_id},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one of those magical tags that the run storage can use to make the query always efficient right? I assume so since job backfills use it too (thinking of should_tag_be_used_for_indexing_filtering here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we special case the backfill_id tag for plus so that it doesn't use the tags table

Comment on lines 980 to 981
get_boolean_tag_value(run.tags.get(WILL_RETRY_TAG), False)
and run.tags.get(AUTO_RETRY_RUN_ID_TAG) is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might make sense as a util function on the run? run_is_complete_and_will_not_automatically_retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like that

):
logger.info("Backfill has in progress runs. Backfill is still in progress.")
return False
# Condition 3 - if there are runs that will be retried, but have not yet been retried, the backfill is not complete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this still behave reasonably on old versions of user code that are not necessarily setting WILL_RETRY_TAG? I think in that case we would just ignore this condition right? (and potentially finish the backfill 'early', like we were doing before)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i made a bad assumption, but i figured that the version of the backfill daemon would be the same as the version of the auto-retry daemon. and the auto-retry daemon will add the will_retry tag if it wasn't added when the run failure event was handled, which made me think we could rely on this being set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but yes, in the case that the will_retry tag isn't getting added to runs, the runs will have is_complete_and_waiting_to_retry as False so that would result in the backfill being considered complete

@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 3c089a8 to 701562d Compare December 5, 2024 17:15
@jamiedemaria jamiedemaria merged commit 513c1ab into master Dec 5, 2024
1 check passed
@jamiedemaria jamiedemaria deleted the jamie/backfill-daemon-termination-change branch December 5, 2024 18:06
cmpadden pushed a commit that referenced this pull request Dec 5, 2024
… fail (#25932)

## Summary & Motivation
Based on discussion in
#25771 (comment)

We don't have a centralized way to determine if a run is going to be
retried by the retry daemon. This results in different methods being
used throughout the code base.

This PR adds a `dagster/will_retry` tag to any run that will be retried
according to the retry maximums set by the user.
`dagster/will_retry=false` is applied to any run that failed, but will
not be retried

This PR does not change how the re-execution daemon decides if a run
should be retried. That is in a stacked PR so that we have more control
over how the changes are rolled out

associated internal pr dagster-io/internal#12765

## How I Tested These Changes
updated existing tests for auto reexecution to assert that the tag
exists when we expect it to
cmpadden pushed a commit that referenced this pull request Dec 5, 2024
…mpletion to account for retried runs (#25771)

## Summary & Motivation
The backfill daemon doesn't account for run retries. See
dagster-io/internal#12460 for more
context. We've decided that we want the daemon to account for automatic
and manual retries of runs that occur while the backfill is still in
progress. This requires two changes: ensuring the backfill isn't marked
completed if there is an in progress run or a failed run that will be
automatically retried; and updating the daemon to take the results of
retried runs into account when deciding what partitions to materialize
in the next iteration.

This PR addresses the first point, ensuring the backfill isn't marked
completed if there is an in progress run or a failed run that will be
automatically retried.

Currently a backfill is marked complete when all targeted asset
partitions are in a terminal state (successfully materialized, failed,
or downstream of a failed partition). Since failed runs may be retried,
there is a case where all asset partitions are in a terminal state, but
there is a retry in progress that could change the state of some asset
partitions. This means that if there are any runs in progress for the
partition we need to wait for them to complete before marking the
backfill complete.

Additionally, we need to account for a race condition where a failed run
may have a retry automatically launched for it, but the daemon marks the
backfill complete before the retried run is queued. This PR adds an
additional check to ensure that no failed runs are about to be retried.

## How I Tested These Changes
new unit tests

manually ran a backfill with automatic run retries configured and saw
that the backfill didn't complete until all automatic retries were
complete
pskinnerthyme pushed a commit to pskinnerthyme/dagster that referenced this pull request Dec 16, 2024
… fail (dagster-io#25932)

## Summary & Motivation
Based on discussion in
dagster-io#25771 (comment)

We don't have a centralized way to determine if a run is going to be
retried by the retry daemon. This results in different methods being
used throughout the code base.

This PR adds a `dagster/will_retry` tag to any run that will be retried
according to the retry maximums set by the user.
`dagster/will_retry=false` is applied to any run that failed, but will
not be retried

This PR does not change how the re-execution daemon decides if a run
should be retried. That is in a stacked PR so that we have more control
over how the changes are rolled out

associated internal pr https://github.com/dagster-io/internal/pull/12765

## How I Tested These Changes
updated existing tests for auto reexecution to assert that the tag
exists when we expect it to
pskinnerthyme pushed a commit to pskinnerthyme/dagster that referenced this pull request Dec 16, 2024
…mpletion to account for retried runs (dagster-io#25771)

## Summary & Motivation
The backfill daemon doesn't account for run retries. See
https://github.com/dagster-io/internal/discussions/12460 for more
context. We've decided that we want the daemon to account for automatic
and manual retries of runs that occur while the backfill is still in
progress. This requires two changes: ensuring the backfill isn't marked
completed if there is an in progress run or a failed run that will be
automatically retried; and updating the daemon to take the results of
retried runs into account when deciding what partitions to materialize
in the next iteration.

This PR addresses the first point, ensuring the backfill isn't marked
completed if there is an in progress run or a failed run that will be
automatically retried.

Currently a backfill is marked complete when all targeted asset
partitions are in a terminal state (successfully materialized, failed,
or downstream of a failed partition). Since failed runs may be retried,
there is a case where all asset partitions are in a terminal state, but
there is a retry in progress that could change the state of some asset
partitions. This means that if there are any runs in progress for the
partition we need to wait for them to complete before marking the
backfill complete.

Additionally, we need to account for a race condition where a failed run
may have a retry automatically launched for it, but the daemon marks the
backfill complete before the retried run is queued. This PR adds an
additional check to ensure that no failed runs are about to be retried.

## How I Tested These Changes
new unit tests

manually ran a backfill with automatic run retries configured and saw
that the backfill didn't complete until all automatic retries were
complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants