[ext] Use MaterializeResult for ext_protocol #16624

smackesey · 2023-09-19T15:58:34Z

Summary & Motivation

Make ext rely on yielding MaterializeResult to register metadata and data version, as opposed to modifying OpExecutionContext. This allows results to stream back as they are reported rather than being bulk reported when computation completes.

This required addition of a report_asset_materialization method that can be called on the ExtContext. This will queue a MaterializationResult on the orchestration side. The queue can be cleared from the ExtOrchestrationContext at any time by calling ExtOrchestrationContext.get_results. Errors are raised if attempting materialize an asset twice or report data version/metadata after materialization.

Once the ext_protocol block exits, any as-yet-unmaterialized assets are queued on the MessageHandler, so that calling ExtOrchestrationContext.get_results after exit will yield all the remaining MaterializeResult objects. Note that yielding from this method after ext_protocol close is required to guarantee all buffered data is yielded, since there is no guarantee that all messages have been processed before ext_protocol completes its exit routine.

To head off the confusing scenario where a user forgets to yield outside the block and sees auto-created materializations that lack any reported metadata, we call set_require_typed_event_stream on the OpExecutionContext. This will cause an error during output processing if an expected output was not returned/yielded.

How I Tested These Changes

New unit tests.

smackesey · 2023-09-19T15:58:45Z

Current dependencies on/for this PR:

master
- PR Add require_typed_event_stream to compute contexts #16706
  - PR [ext] Use MaterializeResult for ext_protocol #16624 👈
    - PR [ext] asset check support #16466
      - PR [pipes] databricks unstructured log forwarding #16674

This comment was auto-generated by Graphite.

alangenfeld · 2023-09-19T16:42:51Z

python_modules/dagster/dagster/_core/ext/subprocess.py

@@ -88,6 +89,9 @@ def run(
                raise DagsterExternalExecutionError(
                    f"External execution process failed with code {process.returncode}"
                )
+            _ext_context = ext_context


did you get some error that caused to you do this re-assign? From what I understand the values from a with statement are not scoped to the opened block, you have access to them post exit.

I think we can just do ext_context.get_materialize_results() and have some explicit runtime error if its called too early. Might need to have ExtOrchestrationContext(ContextManager) to do it cleanly

did you get some error that caused to you do this re-assign? From what I understand the values from a with statement are not scoped to the opened block, you have access to them post exit.

Hmm you're right. Not sure why I thought the yielded var became unbound.

I think we can just do ext_context.get_materialize_results() and have some explicit runtime error if its called too early. Might need to have ExtOrchestrationContext(ContextManager) to do it cleanly

I will do a quick rejigger

Might need to have ExtOrchestrationContext(ContextManager) to do it cleanly

I take this back, I think what you really care about is that the message_reader is in a post exit state. The context obj will exit before reader unless you do extra goofy shit, so you really want to just check on that thing

Did the rejigger. I just made ext_protocol set an is_task_finished on ExtOrchestrationContext when it exits. You are correct that we just care about the message reader, and we know the message reader has exited if ext_protocol has.

alangenfeld · 2023-09-19T17:16:26Z

python_modules/dagster/dagster/_core/ext/utils.py

            context_data=context_data,
            message_handler=msg_handler,
            context_injector_params=ci_params,
            message_reader_params=mr_params,
        )
+        yield ext_context
+        ext_context.is_task_finished = True


may need finally block protection to ensure this is set on exception driven exit ? trying to think if theres a case where you exit CM via exception and still have access to ExtOrchestrationContext such that it may be called

This needs a test either way to confirm behavior

alangenfeld · 2023-09-19T17:16:54Z

python_modules/dagster-ext/dagster_ext_tests/test_external_execution.py

@@ -313,6 +313,9 @@ def subproc_run(context: AssetExecutionContext):
            extras=extras,
        ) as ext_context:
            subprocess.run(cmd, env=ext_context.get_external_process_env_vars(), check=False)
+            _ext_context = ext_context


nit: grab this one too

alangenfeld · 2023-09-19T17:23:05Z

python_modules/dagster/dagster/_core/ext/context.py

@@ -108,6 +119,7 @@ class ExtOrchestrationContext:
    message_handler: ExtMessageHandler
    context_injector_params: ExtParams
    message_reader_params: ExtParams
+    is_task_finished: bool = False


nit: consider a name that more explicitly maps to what were tracking versus what we believe that corresponds to ie has_exited_cm something in that direction

Hmm-- something like has_exited_cm feels very opaque to me.

Feels like we need a term for what happens in the scope of an ext_protocol block-- maybe an "ext session"? Then it could be is_ext_session_closed, meaning that any comms with the external process are terminated, which is what happens on ext_protocol exit.

ext_session_closed sounds great to me

schrockn · 2023-09-19T17:39:42Z

What does the multi asset case look like here where we want to be able to stream materializations rather than wait for op completion?

schrockn · 2023-09-19T17:50:59Z

This makes me definitely want to do an approach more like @dbt_assets where we figure out how to stream these out using yield from

smackesey · 2023-09-19T18:54:03Z

What does the multi asset case look like here where we want to #be able to stream materializations rather than wait for op completion?

We would need to provide a method on the ExtContext called report_asset_materialized or something to signal that all metadata etc has been reported for an asset, then we could yield it. Then you could interleave yielding results with polling code in the ext_protocol block, but you would still need to do a final yield after exiting the block and the shutdown of the message reader (using subprocess here as an example):

with ext_protocol(...) as ext_context:

    # ...
    while True:
        yield from ext_context.message_handler.get_materialize_results()
        if process.poll() is not None:
            break
    # ...
yield from ext_context.get_materialize_results()

schrockn

Have to hop into a meeting but truly want to avoid a Union as the return type. We'll regret that. One way of doing things please.

schrockn · 2023-09-19T19:58:07Z

python_modules/dagster/dagster/_core/ext/client.py

@@ -21,7 +23,7 @@ def run(
        *,
        context: OpExecutionContext,
        extras: Optional[ExtExtras] = None,
-    ) -> None: ...
+    ) -> Union["MaterializeResult", Tuple["MaterializeResult", ...]]: ...


Would very much want this type signature to not be "dual state" like this. We shouldn't specialize the one versus many use case

Yeah this is really unfortunate and I want to avoid it. The impetus is that the framework complains if you return a 1-tuple when expecting a single result, and I wanted to be able to just return ExtClient.run from an @asset.

Possible solutions:

Go with the iterator approach described above

Adjust framework to accept a 1-tuple when expecting a single result

Always return a tuple and require the user to unpack to a single object for a 1-tuple

Provide a separate method get_materialize_result that returns a single MaterializeResult.

Atm I like solution (1).

schrockn · 2023-09-19T19:59:18Z

python_modules/dagster/dagster/_core/ext/context.py

        metadata_value = self._resolve_metadata_value(value, type)
-        self._context.add_output_metadata({label: metadata_value}, output_name)
+        self._metadata.setdefault(resolved_asset_key, {})[label] = metadata_value


prefer a not mutative version of this

As in reassigning the whole of self._metadata whenever a new value is added?

self._metadata = { **self._metadata, resolved_asset_key: { **self._metadata.get(resolved_asset_key, {}), label: metadata_value } }

IMO this is a lot harder to parse and mutation is appropriate here since we are buffering values as they come in.

I would prefer more explicit and less clever code

if resolved_asset_key not in self._metadata: self._metadata[resolved_asset_key] = {} self._metadata[resolved_asset_key][label] = metadata_value

rather than what is currently there:

self._metadata.setdefault(resolved_asset_key, {})[label] = metadata_value

which is less code but hurts my brain

schrockn

Let's see what the iterator version looks like

github-actions · 2023-09-20T19:25:50Z

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-bb18ij0s5-elementl.vercel.app
https://sean-ext-use-materialize-result.components-storybook.dagster-docs.io

Built with commit 8830b33.
This pull request is being automatically deployed with vercel-action

github-actions · 2023-09-20T19:26:43Z

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-gqs4791av-elementl.vercel.app
https://sean-ext-use-materialize-result.core-storybook.dagster-docs.io

Built with commit 903ce6f.
This pull request is being automatically deployed with vercel-action

smackesey · 2023-09-20T19:27:34Z

pyproject.toml

-# We use `id` in many places and almost never want to use the python builtin.
-builtins-ignorelist = ["id"]
+# Id and type are frequently helpful as local variable or parameter names.
+builtins-ignorelist = ["id", "type"]


We already exclude id, which is a Python builtin, and type is in the same category of common and useful variable name. IMO we should also change the keyword arg on report_asset_metadata to just type

Did not realize that we do this for id. I'm not convinced that is a good idea either. I would rather not make this policy decision coupled with this PR.

github-actions · 2023-09-20T19:28:47Z

Deploy preview for dagster-docs ready!

Preview available at https://dagster-docs-i0i5axror-elementl.vercel.app
https://sean-ext-use-materialize-result.dagster.dagster-docs.io

Direct link to changed pages:

https://dagster-docs-i0i5axror-elementl.vercel.app
https://sean-ext-use-materialize-result.dagster.dagster-docs.io/guides/limiting-concurrency-in-data-pipelines

smackesey · 2023-09-20T19:38:35Z

Let's see what the iterator version looks like

It's up.

schrockn

Sweet. I would like @rexledesma to take and look and approve as well to ensure that this is usable in the dbt integration

schrockn · 2023-09-22T12:04:46Z

python_modules/dagster/dagster/_core/ext/utils.py

@@ -185,6 +185,12 @@ def extract_message_or_forward_to_stdout(handler: "ExtMessageHandler", log_line:
        sys.stdout.writelines((log_line, "\n"))


+_FAIL_TO_YIELD_ERROR_MESSAGE = (
+    "Did you forget to `yield from ext_context.get_results()`? This should be called once after the"


s/This/get_results/

rexledesma · 2023-09-22T21:18:14Z

Left a comment on the original PR that introduces the context taint: #16706 (comment). I'll have confidence in this once we have a test case for its usage against a subsetted multi asset.

smackesey mentioned this pull request Sep 19, 2023

[ext] asset check support #16466

Merged

smackesey force-pushed the sean/ext-use-materialize-result branch 2 times, most recently from a0ae406 to df0afcb Compare September 19, 2023 16:29

smackesey marked this pull request as ready for review September 19, 2023 16:29

smackesey requested review from alangenfeld and schrockn September 19, 2023 16:29

alangenfeld reviewed Sep 19, 2023

View reviewed changes

smackesey force-pushed the sean/ext-use-materialize-result branch from df0afcb to dd06840 Compare September 19, 2023 16:57

smackesey requested a review from alangenfeld September 19, 2023 17:22

alangenfeld reviewed Sep 19, 2023

View reviewed changes

smackesey force-pushed the sean/ext-use-materialize-result branch 2 times, most recently from 1d81248 to 01edc65 Compare September 19, 2023 18:29

smackesey force-pushed the sean/ext-use-materialize-result branch from 01edc65 to 5230330 Compare September 19, 2023 19:08

schrockn requested changes Sep 19, 2023

View reviewed changes

smackesey force-pushed the sean/ext-use-materialize-result branch from 5230330 to 8830b33 Compare September 20, 2023 19:20

smackesey commented Sep 20, 2023

View reviewed changes

smackesey force-pushed the sean/ext-use-materialize-result branch from 8830b33 to bf82121 Compare September 20, 2023 19:38

smackesey requested a review from schrockn September 20, 2023 19:38

smackesey force-pushed the sean/ext-use-materialize-result branch 2 times, most recently from 57b514a to 7ccfd18 Compare September 20, 2023 23:49

smackesey mentioned this pull request Sep 20, 2023

[pipes] databricks unstructured log forwarding #16674

Merged

smackesey force-pushed the sean/explicit-mode branch from c733061 to eafa0ea Compare September 21, 2023 23:27

smackesey force-pushed the sean/ext-use-materialize-result branch from 7661113 to 5ba294c Compare September 21, 2023 23:27

smackesey force-pushed the sean/explicit-mode branch 3 times, most recently from f5469be to e6198f2 Compare September 22, 2023 11:18

smackesey force-pushed the sean/ext-use-materialize-result branch from 5ba294c to 698d78b Compare September 22, 2023 11:18

smackesey force-pushed the sean/explicit-mode branch from e6198f2 to 7124a54 Compare September 22, 2023 11:28

smackesey force-pushed the sean/ext-use-materialize-result branch 3 times, most recently from 6d635e3 to 76fe8e0 Compare September 22, 2023 12:00

schrockn requested a review from rexledesma September 22, 2023 12:05

schrockn approved these changes Sep 22, 2023

View reviewed changes

smackesey force-pushed the sean/explicit-mode branch from 7124a54 to d18dea9 Compare September 22, 2023 12:33

smackesey force-pushed the sean/ext-use-materialize-result branch 3 times, most recently from 82e09c3 to 47acf2c Compare September 22, 2023 14:27

smackesey force-pushed the sean/explicit-mode branch from d18dea9 to e9ff404 Compare September 22, 2023 19:04

smackesey force-pushed the sean/ext-use-materialize-result branch from 47acf2c to 762811c Compare September 22, 2023 19:04

smackesey force-pushed the sean/explicit-mode branch from e9ff404 to 71b478f Compare September 22, 2023 19:19

smackesey force-pushed the sean/ext-use-materialize-result branch from 762811c to 9a7b448 Compare September 22, 2023 19:19

Base automatically changed from sean/explicit-mode to master September 22, 2023 20:09

[ext] Use MaterializeResult for ext_protocol

02af50e

smackesey force-pushed the sean/ext-use-materialize-result branch 3 times, most recently from 979cbc7 to d884d4b Compare September 22, 2023 20:33

yield MaterializeResult

f7f8147

smackesey force-pushed the sean/ext-use-materialize-result branch from d884d4b to f7f8147 Compare September 22, 2023 20:39

smackesey merged commit f7b3fee into master Sep 22, 2023

smackesey deleted the sean/ext-use-materialize-result branch September 22, 2023 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ext] Use MaterializeResult for ext_protocol #16624

[ext] Use MaterializeResult for ext_protocol #16624

smackesey commented Sep 19, 2023 •

edited

Loading

smackesey commented Sep 19, 2023 •

edited

Loading

alangenfeld Sep 19, 2023

smackesey Sep 19, 2023

alangenfeld Sep 19, 2023

smackesey Sep 19, 2023

alangenfeld Sep 19, 2023

schrockn Sep 19, 2023

alangenfeld Sep 19, 2023

alangenfeld Sep 19, 2023

smackesey Sep 19, 2023

alangenfeld Sep 19, 2023

schrockn commented Sep 19, 2023

schrockn commented Sep 19, 2023

smackesey commented Sep 19, 2023 •

edited

Loading

schrockn left a comment

schrockn Sep 19, 2023

smackesey Sep 19, 2023 •

edited

Loading

schrockn Sep 19, 2023

smackesey Sep 19, 2023

schrockn Sep 19, 2023

schrockn left a comment

github-actions bot commented Sep 20, 2023

github-actions bot commented Sep 20, 2023 •

edited

Loading

smackesey Sep 20, 2023

schrockn Sep 21, 2023

github-actions bot commented Sep 20, 2023 •

edited

Loading

smackesey commented Sep 20, 2023

schrockn left a comment

schrockn Sep 22, 2023

rexledesma commented Sep 22, 2023

[ext] Use MaterializeResult for ext_protocol #16624

[ext] Use MaterializeResult for ext_protocol #16624

Conversation

smackesey commented Sep 19, 2023 • edited Loading

Summary & Motivation

How I Tested These Changes

smackesey commented Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schrockn commented Sep 19, 2023

schrockn commented Sep 19, 2023

smackesey commented Sep 19, 2023 • edited Loading

schrockn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smackesey Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schrockn left a comment

Choose a reason for hiding this comment

github-actions bot commented Sep 20, 2023

github-actions bot commented Sep 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 20, 2023 • edited Loading

smackesey commented Sep 20, 2023

schrockn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rexledesma commented Sep 22, 2023

smackesey commented Sep 19, 2023 •

edited

Loading

smackesey commented Sep 19, 2023 •

edited

Loading

smackesey commented Sep 19, 2023 •

edited

Loading

smackesey Sep 19, 2023 •

edited

Loading

github-actions bot commented Sep 20, 2023 •

edited

Loading

github-actions bot commented Sep 20, 2023 •

edited

Loading