Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make AssetExecutionContext a subclass of OpExecutionContext #16596

Closed

Conversation

jamiedemaria
Copy link
Contributor

@jamiedemaria jamiedemaria commented Sep 18, 2023

Summary & Motivation

Makes the AssetExecutionContext a subclass of OpExecutionContext. Since the plan is to split AssetExecutionContext into it's own class, we need to fire deprecation warnings for all methods on OpExecutionContext that we do not intend to keep on AssetExecutionContext longterm. We also fire a deprecation warning on isinstance(context, OpExecutionContext) when context is an AssetExecutionContext.

This class should implement the IContext from #16480, but I can't import from dagster-ext so need to figure that part out first...

How I Tested These Changes

stacked PR #16598

return self._op_execution_context.get_asset_provenance(asset_key)

@public
# TODO - method naming. this needs work
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternate names:

  • get_code_versions_for_assets

Alternately, we could delete this method entirely, and update the impls of code_version and code_version_by_asset_key to

    @property
    def code_version(self) -> Optional[str]:
        return self.code_version_by_asset_key[self.asset_key]

    @property
    def code_version_by_asset_key(self) -> Mapping[AssetKey, Optional[str]]:
        return self.op_execution_context.instance.get_latest_materialization_code_versions(
            self.asset_keys
        )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important that access each in-memory properties and expensive accesses to the instance feel very different, and that it is clear when the user is doing something expensive like a db query to get information, so something like code_version_by_asset_key is not a good idea. Imagine accessing that in a tight loop.

Copy link
Contributor Author

@jamiedemaria jamiedemaria Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense - the ExtContext also has a code_version_by_asset_key so whatever decisions we make here will also probably need to be applied there. Does making it a getter method seem like enough indication to the user that it's expensive?

    def get_code_version(self) -> Optional[str]:
        return self.code_version_by_asset_key[self.asset_key]

    def get_code_version_by_asset_key(self) -> Mapping[AssetKey, Optional[str]]:
        return self.op_execution_context.instance.get_latest_materialization_code_versions(
            self.asset_keys
        )

or something more dramatic like query_db_for_code_version or expecting the user to get the instance and call get_latest_materialization_code_versions themselves?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to be equivalent with the ext contract (code here:

def build_external_execution_context_data(
) only code versions on the asset_defs object would be accessible cheaply. Do we need to have a the variant that fetches from the instance at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @smackesey this convo could be of interest to you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code_version, code_version_by_asset_key, provenance and provenace_by_asset_key have all been updated so that the dictionaries are pre-fetched during context init. additionally code_versions are now pulled from the AssetsDefinition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does making it a getter method seem like enough indication to the user that it's expensive?

Maybe tangential to the immediate concern, but I like "fetch" as an indicator that a method is loading from elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additionally code_versions are now pulled from the AssetsDefinition.

Great, yeah it would be incorrect to fetch these from the instance since everywhere else "the asset's code version" means what is set on the definition, not what was used for a materialization at some undetermined time in the past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great, i didn't know that subtlety (thinking about it now it makes total sense) so thanks for clearing that up

@jamiedemaria jamiedemaria marked this pull request as ready for review September 18, 2023 18:33
Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of stuff. There are enough independent threads of discussion here that we might want to consider breaking up the PR if nothing else than to manage discussion and decisions.

return self._op_execution_context.get_asset_provenance(asset_key)

@public
# TODO - method naming. this needs work
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important that access each in-memory properties and expensive accesses to the instance feel very different, and that it is clear when the user is doing something expensive like a db query to get information, so something like code_version_by_asset_key is not a good idea. Imagine accessing that in a tight loop.

Comment on lines 1473 to 1475
@property
def asset_check_spec(self) -> AssetCheckSpec:
return self._op_execution_context.asset_check_spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the semantics of this? Ops can have zero checks, or can have many.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused on this myself - OpExecutionContext has the method:

@property
    def asset_check_spec(self) -> AssetCheckSpec:
        asset_checks_def = check.not_none(
            self.job_def.asset_layer.asset_checks_def_for_node(self.node_handle),
            "This context does not correspond to an AssetChecksDefinition",
        )
        return asset_checks_def.spec

but AssetsDefinition has check_specs_by_output_name and check_specs which seem more useful.

I think a check_specs_by_asset_key property would be the most flexible here, so i'll update the implementation with that as a starting point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @johannkm for opinions and insights on the check spec related context methods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the existing one errors on anything except @asset_check, which has one spec. The current method looks fine- you could also remove check specs property for now. I don't have a clear usecase for it, it's only used in tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok good to know!

Comment on lines 1370 to 1387
def provenance(self) -> Optional[DataProvenance]:
return self.get_asset_provenance(self.asset_key)

@property
def provenance_by_asset_key(self) -> Mapping[AssetKey, Optional[DataProvenance]]:
provenance_map = {}
for key in self.asset_keys:
provenance_map[key] = self.get_asset_provenance(key)

return provenance_map

@property
def code_version(self) -> Optional[str]:
return self.get_assets_code_version([self.asset_key])[self.asset_key]

@property
def code_version_by_asset_key(self) -> Mapping[AssetKey, Optional[str]]:
return self.get_assets_code_version(self.asset_keys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these all go to the db? Super dangerous to make that this easy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved via call - code version is now fetched from the asset definition, and provenance for the selected assets is pre-fetched during init

Comment on lines 1370 to 1379
def provenance(self) -> Optional[DataProvenance]:
return self.get_asset_provenance(self.asset_key)

@property
def provenance_by_asset_key(self) -> Mapping[AssetKey, Optional[DataProvenance]]:
provenance_map = {}
for key in self.asset_keys:
provenance_map[key] = self.get_asset_provenance(key)

return provenance_map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I think it is defensible to have separate asset_key and asset_keys properties for convenience, for the remainder of these properties we should not have both variants.

My recommendation is to only allow lookup by key, and also to strongly consider a single object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine w me - we'll also need to update ExtContext to match

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I think it is defensible to have separate asset_key and asset_keys properties for convenience, for the remainder of these properties we should not have both variants.

My recommendation is to only allow lookup by key, and also to strongly consider a single object.

IMO this makes for annoying UX:

context.provenance_by_asset_key(context.asset_key)

# instead of

context.provenance

I think it should be:

def provenance(self, asset_key: Optional[AssetKey]):

Then if there is only a single asset you default to that, otherwise require the asset_key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exploring a bit more on the single object thread. if i understand correctly, the object could be some kind of AssetInfo class that would hold the code_version, provenance, and maybe things like partition_keys and partition_time_window. Rather than having a set of code_version_by_asset_key, partition_time_window_by_asset_key etc methods, we'd just have a asset_info_by_asset_key that would return these objects.

it feels similar to AssetSpec but bound to a particular materialization, not a general definition

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels similar to AssetSpec but bound to a particular materialization, not a general definition

Great mental model.

How about a compromise between the two approaches, asI do find @smackesey's concern of divergence between single and multi asset case convincing:

def asset_info(self, asset_key: Optional[AssetKey] = None) -> AssetInfo
   ...

However I do not think we should name this AssetInfo. @jamiedemaria what is the complete set of properties that will exist on this object? That might help with naming.

partition_keys and partition_time_window are also interesting. They only vary on a per-asset-key basis when partition mappings are present, so if someone doesn't know about mappings they may find a "by_key" variant of this APi confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not fully complete, but the properties I can think to include are:

  • code_version
  • provenance
  • partition_time_window
  • partition_key_range
  • partition_keys
  • upstream_result_value - no good name for this right now, but I'm thinking if we allow this:
@asset 
def my_asset():
    return MaterializeResult(value="foo")

then this object would be a good way to give that value to the user downstream

@asset(
    deps=[my_asset]
)
def another_asset(context):
    context.asset_info_by_asset_key[my_asset].upstream_result_value == "foo"
  • metadata
  • potentially even things like the time the asset was materialized. There's a small collection of user questions in the "how to i get X info about the upstream asset?" category. we could look through those and see if there's anything worth including

Comment on lines 1394 to 1420
@property
def partition_key(self) -> str:
return self.op_execution_context.partition_key

@public
@property
def partition_key_range(self) -> PartitionKeyRange:
return self._op_execution_context.asset_partition_key_range

@property
def partition_time_window(self) -> TimeWindow:
return self.op_execution_context.partition_time_window
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jamie internal discussion 6704 very relevant here.

Comment on lines +1422 to +1461
@public
@property
def instance(self) -> DagsterInstance:
"""DagsterInstance: The current Dagster instance."""
return self._op_execution_context.instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prha how are are you in thinking about subsetting the instance API? This would be the time to start to introduce an alternative interface.

@jamiedemaria jamiedemaria force-pushed the jamie/context-interface branch from 59fd5df to 7891ea4 Compare September 20, 2023 18:23
@jamiedemaria jamiedemaria force-pushed the jamie/subclass-asset-context branch from 9d2c497 to 92457f9 Compare September 20, 2023 18:23
def selected_asset_keys(self) -> AbstractSet[AssetKey]:
return self._selected_asset_keys

# TODO - both get_asset_provenance and get_assets_code_version query the instance - do we want to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep get_asset_code_version and get_asset_provenance? they would allow users to get the code version/provenance for ay asset (not just those in the assets definition), but require instance queries.

I think i'm inclined to remove them and add back if we get requests, but not strongly tied to that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sean suggested switching to a fetch prefix instead of get #16596 (comment) so that is another option

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to investigate where get_asset_provenance is used but we should definitely not have get_asset_code_version that returns a value from the last materialization (the last materialization code version is on the provenance anyway).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpExecutionContext.get_asset_provenance is used to create the ext context here, used in the init of the AssetExecutionContext in this PR, in an asset in the toys project, and in a test for data versioning.

@jamiedemaria jamiedemaria force-pushed the jamie/subclass-asset-context branch from e688144 to 27152b9 Compare September 22, 2023 13:08
@jamiedemaria jamiedemaria force-pushed the jamie/context-interface branch from 9ab63fd to 65b6c1e Compare September 22, 2023 14:27
@jamiedemaria jamiedemaria force-pushed the jamie/subclass-asset-context branch from 27152b9 to d01d5d2 Compare September 22, 2023 14:27
@jamiedemaria jamiedemaria force-pushed the jamie/context-interface branch from 65b6c1e to 587dde2 Compare September 22, 2023 17:46
@jamiedemaria jamiedemaria force-pushed the jamie/subclass-asset-context branch from 2e71fd1 to 49b0a60 Compare September 22, 2023 17:46
@jamiedemaria jamiedemaria force-pushed the jamie/context-interface branch from 587dde2 to 51620ff Compare September 26, 2023 14:06
@jamiedemaria jamiedemaria force-pushed the jamie/subclass-asset-context branch from 1448509 to 5ba005b Compare September 26, 2023 14:06
@jamiedemaria
Copy link
Contributor Author

closing in favor of #16761 - I'll open a new PR with method deprecations in stages as we align on what the new API should be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants