Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[brownfield-essentials] Refactor ExternalExecutionTask to better support multiple environments #15892

Merged
merged 2 commits into from
Aug 22, 2023

Conversation

smackesey
Copy link
Collaborator

@smackesey smackesey commented Aug 17, 2023

Summary & Motivation

This is another iteration of the external execution API, stacking on and motivated by @alangenfeld's DockerExecutionTask in #15820. The goal here is to move toward base external execution APIs that can easily be extended to different environments.

  • Split ExternalExecutionTask into:
    • ExternalExecutionTaskBase: an abstract base class that provides extension points at _input_context_manager, _output_context_manager, and _launch.
    • ExternalExecutionSubprocessTask: implements ExternalExecutionTaskBase for the subprocess case.
  • Refactor DockerExecutionTask:
    • Inherit from ExternalExecutionTaskBase
    • Instead of overriding run, override _launch, _input_context_manager, _output_context_manager

Detailed explanation of current design

Let's call the combination of the dagster-external library and associated set of orchestration-side abstractions the "Dagster External API". An "adapter" is an environment-specific implementation of the Dagster externals API. Currently there are two adapters in master: subprocess and Docker. In these PR, these are each defined with similar formatting in dedicated modules (dagster._core.external_execution.subprocess and dagster_docker.external_resource).

An adapter should define the following parts:

  • AdapterExecutionTask: A class inheriting from abstract base class ExternalExecutionTask (e.g. SubprocessExecutionTask)
  • AdapterExecutionResource: A resource inheriting from ExternalExecutionResource (e.g. SubprocessExecutionResource)
  • AdapterTaskParams: A dataclass inheriting from ExternalTaskParams (e.g. SubprocessTaskParams)
  • AdapterTaskIOParams An optional dataclass inheriting from ExternalTaskIOParams (e.g. SubprocessTaskIOParams)

The user-facing component of an adapter is the AdapterExecutionResource. This exposes a method run that is responsible for taking a flat list of user-provided parameters, instantiating an AdapterExternalTask, and calling run on this instance. run returns once the AdapterExternalTask has completed successfully and all associated resources have been cleaned up. run throws an error if AdapterExternalTask is not successful.

In order to provide common infrastructure for all adapters, the AdapterExecutionTask uses different channels for parameters standardized across the Dagster Externals API ("API parameters") and those that are specific to an adapter ("adapter parameters"). This distinction is immaterial to the user, who mixes these parameters together at the level of the configuration of AdapterResource and call to AdapterResource.run.

TheAdapterResource.run method makes the distinction, partitioning the params into the API parameters and adapter parameters. The API parameters are passed to the AdapterExecutionTask constructor. The adapter parameters are packaged into an instance of AdapterTaskParams and passed to AdapterExecutionTask.run.

The adapter task is responsible for launching the external task, supplying it with orchestration context info, and reading notifications streamed back during process execution. Most of the plumbing for this is provided by the base ExternalExecutionTask. Adapter tasks plug in by defining three required methods:

  • _input_context_manager: returns a context manager that wraps _launch and handles IPC input (i.e. writing context) to the external task. This context manager must yield an instance of AdapterIOTaskParams, which at a minimum contains environment variables that dagster-external uses to establish communication with the orchestration process.
  • _output_context_manager: returns a context manager that wraps _launch and handles IPC output from the external task. Just like the input context manager, this context manager also yields an instance of AdapterIOTaskParams.
  • _launch: launches the external process and monitors it for completion. It returns if execution was successful, throws an error if not (e.g. a subprocess with non-zero exit). It does not need to handle any I/O or associated resource cleanup-- this is done by the base-class provided plumbing and adapter input/output context managers. _launch receives the AdapterTaskParams instance provided by the calling AdapterExecutionResource as well as the two AdapterIOTaskParams instances yielded by the context managers. It may use arbitrary logic to combine these parameters to configure the external system.

NOTE: @alangenfeld has suggested composition as an alternative to inheritance, and I'm not opposed to that. I thought about trying to implement it but the path forward wasn't clear, whereas moving to an ABC was a straightforward improvement over the status quo. We can continue iterating.

How I Tested These Changes

Existing test suite.

@smackesey smackesey force-pushed the sean/dagster-external-refactor branch 2 times, most recently from 0618b7e to 22020ee Compare August 17, 2023 14:15
DAGSTER_EXTERNAL_ENV_KEYS["host"]: "host.docker.internal",
DAGSTER_EXTERNAL_ENV_KEYS["port"]: port,
}
yield DockerTaskIOParams(env=env, ports=ipc_ports)


class DockerExecutionResource(ExternalExecutionResource):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any thoughts on how to improve this layer ? Should ExternalExecutionIOMode be shared or should we have separate mode enums? Can you compose config nicely along ConfigurableResource inheritance or should we change tactics here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ExternalExecutionIOMode be shared or should we have separate mode enums?

It should be shared because it is used by both the orchestration and transform process-- i.e. dagster-externals uses it and that is shared between all orch-side adapters (what are we calling these things?)

any thoughts on how to improve this layer? ... Can you compose config nicely along ConfigurableResource inheritance or should we change tactics here

Not clear on exactly what you mean by "compose config nicely along ConfigurableResource inheritance". If the question is just whether you can add additional fields with ConfigurableResource inheritance, the answer is yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I was thinking about was that the valid input/output modes differ between docker and subprocess, and ideally thats reflected in the config. I guess one answer is to sort of mirror the TaskIOParams tree with these resources and only have the modes defined on the subprocess/docker resources

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I was thinking about was that the valid input/output modes differ between docker and subprocess, and ideally thats reflected in the config.

Yeah this is a good point.

I think one solution is to type the resources with Literal instead of the Enum class and just use the enum class inside our internal machinery:

class DockerExecutionResource(ExternalExecutionResource):
    input_mode: Literal["file", "socket"]  # overrides input mode from above

That currently doesn't work but I think it might be a minor addition to the config machinery to make it work, since that machinery does accept enums and this is effectively an inline enum.

@smackesey smackesey requested a review from schrockn August 17, 2023 15:43
@alangenfeld alangenfeld force-pushed the al/08-11-dagster-external_docker_asset branch 6 times, most recently from 44078a5 to 327506e Compare August 17, 2023 21:48
Base automatically changed from al/08-11-dagster-external_docker_asset to master August 17, 2023 22:22
@smackesey smackesey force-pushed the sean/dagster-external-refactor branch 2 times, most recently from 25fc082 to 7cac386 Compare August 18, 2023 14:16
@smackesey smackesey marked this pull request as ready for review August 18, 2023 14:39
@smackesey smackesey force-pushed the sean/dagster-external-refactor branch from 7cac386 to 647f687 Compare August 18, 2023 21:58
@github-actions
Copy link

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-3aa4rbb7p-elementl.vercel.app
https://sean-dagster-external-refactor.components-storybook.dagster-docs.io

Built with commit 647f687.
This pull request is being automatically deployed with vercel-action

@github-actions
Copy link

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-e8hjkf7q5-elementl.vercel.app
https://sean-dagster-external-refactor.core-storybook.dagster-docs.io

Built with commit 647f687.
This pull request is being automatically deployed with vercel-action

@smackesey smackesey changed the base branch from master to sean/externals-no-orchestration August 21, 2023 15:00
@smackesey smackesey force-pushed the sean/dagster-external-refactor branch from 647f687 to 45cdf4c Compare August 21, 2023 15:00
@smackesey smackesey force-pushed the sean/externals-no-orchestration branch from 733612e to 4f3a22d Compare August 21, 2023 15:12
@smackesey smackesey force-pushed the sean/dagster-external-refactor branch from 45cdf4c to 068b9f1 Compare August 21, 2023 15:12
Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all feels prematurely abstracted to me.

Please push the "mode" stuff to be subprocess-specific only to contain and eventually eliminate it.

Push out a quick PR to renaming ExternalExecutionResource to SubprocessExecutionResource (or introduce a trivial base class). I'd like to start writing against that API asap.

Comment on lines 12 to 13
input_mode: ExternalExecutionIOMode = Field(default="stdio")
output_mode: ExternalExecutionIOMode = Field(default="stdio")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the inclusion of modes on this generic class is upping my priority of getting rid of the concept entirely

Comment on lines 27 to 36
@dataclass
class SubprocessTaskParams(ExternalTaskParams):
command: Sequence[str]
cwd: Optional[str] = None
env: Mapping[str, str] = field(default_factory=dict)


@dataclass
class SubprocessTaskIOParams(ExternalTaskIOParams):
stdio_fd: Optional[int] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strongly agree with alex that we should not do this "family inheritance" approach.

Base automatically changed from sean/externals-no-orchestration to master August 21, 2023 16:57
@smackesey smackesey changed the base branch from master to sean/external-tail-file August 21, 2023 18:30
@smackesey smackesey force-pushed the sean/dagster-external-refactor branch from 068b9f1 to d836ca6 Compare August 21, 2023 18:30
@smackesey smackesey force-pushed the sean/external-tail-file branch 2 times, most recently from fa108d2 to b6634ea Compare August 21, 2023 21:05
Base automatically changed from sean/external-tail-file to master August 21, 2023 21:28
@smackesey smackesey force-pushed the sean/dagster-external-refactor branch from d836ca6 to ff4643a Compare August 22, 2023 10:32
@smackesey smackesey force-pushed the sean/dagster-external-refactor branch from 42018fb to 00d6815 Compare August 22, 2023 11:24
@smackesey
Copy link
Collaborator Author

smackesey commented Aug 22, 2023

@schrockn OK this PR has been updated to confine IO modes on the orchestration side to only the subprocess adapter.

There is an upstack PR that removes IO modes entirely: #16008

Can build off that upstack PR to move away from "family inheritance".

@smackesey smackesey requested a review from schrockn August 22, 2023 11:45
Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments. I think we should halt this approach.

@smackesey
Copy link
Collaborator Author

See comments. I think we should halt this approach.

IO mode removal is stacked on this PR. The simplest way to proceed is just to merge this and then IO mode removal. I am experimenting with getting away from Task entirely on top of the IO mode removal one, and having per-adapter params extracted like this is an easier state to work with than the existing "override run with a bunch of ad hoc conditionals" logic.

@schrockn schrockn self-requested a review August 22, 2023 13:04
Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per slack conversation going to press forward with this to allow further refactors to continue.

Posting here for posterity:

Screenshot 2023-08-22 at 6 05 29 AM Screenshot 2023-08-22 at 6 05 34 AM

@smackesey smackesey merged commit 4236b60 into master Aug 22, 2023
@smackesey smackesey deleted the sean/dagster-external-refactor branch August 22, 2023 13:07
sirawats pushed a commit to sirawats/dagster that referenced this pull request Aug 24, 2023
…ort multiple environments (dagster-io#15892)

## Summary & Motivation

This is another iteration of the external execution API, stacking on and
motivated by @alangenfeld's `DockerExecutionTask` in dagster-io#15820. The goal
here is to move toward base external execution APIs that can easily be
extended to different environments.

- Split `ExternalExecutionTask` into:
- `ExternalExecutionTaskBase`: an abstract base class that provides
extension points at `_input_context_manager`, `_output_context_manager`,
and `_launch`.
- `ExternalExecutionSubprocessTask`: implements
`ExternalExecutionTaskBase` for the subprocess case.
- Refactor `DockerExecutionTask`:
    - Inherit from `ExternalExecutionTaskBase`
- Instead of overriding `run`, override `_launch`,
`_input_context_manager`, `_output_context_manager`

### Detailed explanation of current design

Let's call the combination of the `dagster-external` library and
associated set of orchestration-side abstractions the "Dagster External
API". An "adapter" is an environment-specific implementation of the
Dagster externals API. Currently there are two adapters in master:
subprocess and Docker. In these PR, these are each defined with similar
formatting in dedicated modules
(`dagster._core.external_execution.subprocess` and
`dagster_docker.external_resource`).

An adapter should define the following parts:

- `AdapterExecutionTask`: A class inheriting from abstract base class
`ExternalExecutionTask` (e.g. `SubprocessExecutionTask`)
- `AdapterExecutionResource:` A resource inheriting from
`ExternalExecutionResource` (e.g. `SubprocessExecutionResource`)
- `AdapterTaskParams`: A dataclass inheriting from `ExternalTaskParams`
(e.g. `SubprocessTaskParams`)
- `AdapterTaskIOParams` An optional dataclass inheriting from
`ExternalTaskIOParams` (e.g. `SubprocessTaskIOParams`)

The user-facing component of an adapter is the
`AdapterExecutionResource`. This exposes a method `run` that is
responsible for taking a flat list of user-provided parameters,
instantiating an `AdapterExternalTask`, and calling `run` on this
instance. `run` returns once the `AdapterExternalTask` has completed
successfully and all associated resources have been cleaned up. `run`
throws an error if `AdapterExternalTask` is not successful.

In order to provide common infrastructure for all adapters, the
`AdapterExecutionTask` uses different channels for parameters
standardized across the Dagster Externals API ("API parameters") and
those that are specific to an adapter ("adapter parameters"). This
distinction is immaterial to the user, who mixes these parameters
together at the level of the configuration of `AdapterResource` and call
to `AdapterResource.run`.

The`AdapterResource.run` method makes the distinction, partitioning the
params into the API parameters and adapter parameters. The API
parameters are passed to the `AdapterExecutionTask` constructor. The
adapter parameters are packaged into an instance of `AdapterTaskParams`
and passed to `AdapterExecutionTask.run`.

The adapter task is responsible for launching the external task,
supplying it with orchestration context info, and reading notifications
streamed back during process execution. Most of the plumbing for this is
provided by the base `ExternalExecutionTask`. Adapter tasks plug in by
defining three required methods:

- `_input_context_manager`: returns a context manager that wraps
`_launch` and handles IPC input (i.e. writing context) to the external
task. This context manager must yield an instance of
`AdapterIOTaskParams`, which at a minimum contains environment variables
that `dagster-external` uses to establish communication with the
orchestration process.
- `_output_context_manager`: returns a context manager that wraps
`_launch` and handles IPC output from the external task. Just like the
input context manager, this context manager also yields an instance of
`AdapterIOTaskParams`.
- `_launch`: launches the external process and monitors it for
completion. It returns if execution was successful, throws an error if
not (e.g. a subprocess with non-zero exit). It does not need to handle
any I/O or associated resource cleanup-- this is done by the base-class
provided plumbing and adapter input/output context managers. `_launch`
receives the `AdapterTaskParams` instance provided by the calling
`AdapterExecutionResource` as well as the two `AdapterIOTaskParams`
instances yielded by the context managers. It may use arbitrary logic to
combine these parameters to configure the external system.

**NOTE**: @alangenfeld has suggested composition as an alternative to
inheritance, and I'm not opposed to that. I thought about trying to
implement it but the path forward wasn't clear, whereas moving to an ABC
was a straightforward improvement over the status quo. We can continue
iterating.

## How I Tested These Changes

Existing test suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants