-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Much simpler AssetExecutionContext #16417
Conversation
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
863fa1b
to
90b2f65
Compare
# seems like there should be both "asset_keys" and "selected_asset_keys" | ||
@property | ||
def selected_asset_keys(self) -> AbstractSet[AssetKey]: | ||
return self._op_execution_context.selected_asset_keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also likely will be adding selected_asset_checks here. This refactor might be an opportunity to condense them to some selection object
I like this a lot and think we absolutely have to solve this problem. When I put this on the roadmap originally I was hoping to get this level of simplification, as I think this is one of the big usability issues in the product today. The specific things I like:
However, I do think that we lose the ability to access the Here are a few options to solve these issues. partitions_def = DailyPartitionsDefinition(...)
@asset(partitions_def=partitions_def)
def my_asset(context: AssetExecutionContext):
# option 1: give the user parsed partition keys instead of strings
for partition_key in context.partition_keys(): # iterator version of partition_key_range()
# partition_key is a datetime
# option 2: give the user both parsed and unparsed partition keys
for partition_key in context.partition_keys(): # iterator version of partition_key_range()
# partition_key.raw is a string
# partition_key.value is a datetime
# option 3: give them the option to parse a string partition key as a datetime
for partition_key in context.partition_keys():
# partition_key.str contains the string representation
# partition_key.datetime tries to parse the string as a datetime
# partition_key.int tries to parse the string as an int
# etc etc
# option 4: don't mess with PartitionKeyRange and bundle it into the partitions def
# least discoverable option, but has the best static typing
for partition_key_datetime in partitions_def.partition_keys(context.partition_key_range()):
... I favor option 1 or option 2 |
return self._op_execution_context.asset_partition_key_range | ||
|
||
@public | ||
def partition_key_range_for_asset_key(self, asset_key: AssetKey) -> PartitionKeyRange: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be partition_key_range_for_dep()
?
return self._op_execution_context.asset_partition_key_range | ||
|
||
@public | ||
def partition_key_range_for_asset_key(self, asset_key: AssetKey) -> PartitionKeyRange: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just leaving a note that making the partition methods agnostic to "input" or "output" lead to some pretty complex errors. Some of the methods (like the partition_key_range one) have different code paths for "inputs" and "outputs" because of partition mapping. Not worth trying to do all of that implementation detail in this RFC, but for a final implementation, we should make sure this code path is really thoroughly tested
@jamiedemaria @schrockn I haven't reviewed this in detail but the Pyright errors are a result of passing While it is possible there is some magical way to make pyright understand, I doubt it and a search for |
Yeah this doesn't fix user's code though, unfortunately. |
} | ||
|
||
|
||
class AssetExecutionContext: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread for figuring out why AssetExecutionContext
was not made a subclass of OpExecutionContext
to begin with.
Background:
AssetExecutionContext
began as a type alias to align on naming and get a quick docs/examples improvement (here)AssetExecutionContext
was made a subclass ofOpExecutionContext
hereAssetExecutionContext
was reverted back to a type alias here
In the revert PR, the reasoning for reverting was:
Conditions like:
* manually constructing AssetsDefinition with a manually written @op
* @ops that make up a graph backed AssetsDefinition
make having different context objects trickier for users than originally anticipated.
There is also a slack thread mentioning this where alex says
the wall I hit with trying to split AssetExecutionContext and OpExecutionContext was resolving what the ops in a graph-backed-asset should receive.
based on this, my interpretation is that the issue wasn't a technical one (python limitation, inability to pass the correct context through, etc), but more a design issue "What is the correct context for an @op
in a @graph_backed_asset
to receive?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok got it. Thanks for digging that up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can figure out a reasonable solution for the graph-backed asset case. We could alter what instance the user gets based on their typehint. We could also make an asset_execution_context
property on OpExecutionContext
so you can do the reverse in that case.
Closing in favor of #16487 |
Summary & Motivation
This PR is not met for commit, but instead to drive conversation. If we align on this, I will split it up into many smaller PRs.
The internal discussion (6704) around consolidating partition APIs as well as #16378 which documents the surface area of our partitioning APIs has convinced me that more radical change is required to get our APIs on our core context API back in control. The current API is a rat's nest beyond saving, especially in light of our de-emphasis of i/o managers.
This PR proposes that we repurpose the
AssetExecutionContext
type alias to instead be an entirely new wrapper class with a clean slate API.On a temporary basis, we will make this context usable in all codepaths that expect
OpExecutionContext
. Will be do this by overriding__getattr__
and also adding a metaclass toOpExecutionContext
to make instances ofAssetExecutionContext
passisinstance(asset_execution_context, OpExecutionContext)
which will make this object usable in old code paths.The upside of this approach relative to subclassing is that this provides a much better in editor experience. The surface area of the API is far lower (14 methods instead of the >50 methods/properties on
OpExecutionContext
). The experience with vscode's typesahead speaks for itself:Before:
After:
An unknown for me here is can we make this work with pyright. I suspect yes, but it accomplishing that goal exceeds by current level of knowledge/ability re: Python witchcraft. I suspect @smackesey can help with this. If we cannot, we can fall back to AssetExecutionContext as a subclass as a solution.
The actual API changes
The easiest way to understand the delta here is to look at this chunk of code added in the compatibility layer to detail which methods and properties will be moved and communicate clear deprecation warnings to users. This is an explicit list of all the methods and properties that we will eventually eliminated from the context object, and an indication of the alternative mechanism.
Simple use cases don't need "input" and "output" anywhere. When paired with
deps
this ends up being much nicer. See this test case case:This also solves some other problems.
partition_key_range
gets the current range orpartition_key_range_for_asset
for getting information about upstreams replace all the "for_input" and "for_output" variants. This mapping layer introduces a ton of cognitive load for all users, and is completely bewildering if you are not usingAssetIn
orAssetOut
at all.add_output_metadata
andget_output_metadata
refer to completely different notions of metadata (materialization metadata and output definition metadata, respectively). There are now clearer APIs for this.has_tag
andget_tag
got the tags of the run but there are many different entities that have tags. Instead the user gets the run first and then gets it tags, which is much clearer.OpExecutionContext
.Follow up Work
context.partition_keys
AssetExecutionContext
so that the user can just callAssetExecutionContext.get()
anywhere and get it, without having to thread it through. Prototyped by @alangenfeld here: add indirect execution context access #14954materialize_single_run_with_partition_key_range
.How I Tested These Changes
Included tests.