[ext] databricks EXT integration #15955

smackesey · 2023-08-21T14:02:45Z

Summary & Motivation

Add dagster-ext integration for Databricks. This is designed to interfere with the official databricks SDK as little as possible-- you pass in Databricks SDK data structures to ExtDatabricks and the only modification it makes is injecting the necessary environment variables. The rest of cluster config etc is left to the user. It is separate from the rest of the databricks integration.
Add example usage to dagster_databricks/README.md

How I Tested These Changes

New unit tests (they are skipped on BK though)

smackesey · 2023-08-21T14:02:56Z

Current dependencies on/for this PR:

master
- PR [externals] Pluggable orchestration-side context-source/message-sink #16169
  - PR [ext] databricks EXT integration #15955 👈

This comment was auto-generated by Graphite.

github-actions · 2023-08-22T14:52:13Z

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-5lfkrj9h8-elementl.vercel.app
https://sean-numbers-databricks.core-storybook.dagster-docs.io

Built with commit fba44c4.
This pull request is being automatically deployed with vercel-action

schrockn · 2023-09-14T13:43:32Z

python_modules/dagster-ext/dagster_ext/__init__.py

+class ExtBufferedFilesystemMessageWriterChannel(ExtBlobStoreMessageWriterChannel):
+    def __init__(self, path: str, *, interval: float = 10):
+        super().__init__(interval=interval)
+        self._path = path
+
+    def upload_messages_chunk(self, payload: IO, index: int) -> None:
+        message_path = os.path.join(self._path, f"{index}.json")
+        with open(message_path, "w") as f:
+            f.write(payload.read())
+
+


dbfs exposes a traditional I/O interface in python?

yes it is mounted on /dbfs

python_modules/libraries/dagster-databricks/README.md

schrockn · 2023-09-14T13:48:10Z

python_modules/libraries/dagster-databricks/dagster_databricks/ext.py

+            while True:
+                run = self.client.jobs.get_run(run_id)
+                if run.state.life_cycle_state in (
+                    jobs.RunLifeCycleState.TERMINATED,
+                    jobs.RunLifeCycleState.SKIPPED,
+                ):
+                    if run.state.result_state == jobs.RunResultState.SUCCESS:
+                        return
+                    else:
+                        raise DagsterExternalExecutionError(
+                            f"Error running Databricks job: {run.state.state_message}"
+                        )
+                elif run.state.life_cycle_state == jobs.RunLifeCycleState.INTERNAL_ERROR:
+                    raise DagsterExternalExecutionError(
+                        f"Error running Databricks job: {run.state.state_message}"
+                    )
+                time.sleep(5)


I think we should log on every tick so the user is confident something is happening

schrockn · 2023-09-14T13:53:45Z

python_modules/libraries/dagster-databricks/README.md

+## EXT Example
+
+This package includes a prototype API for launching databricks jobs with
+Dagster's EXT protocol. There are two ways to use the API:
+
+### (1) `ExtDatabricks` resource
+
+The `ExtDatabricks` resource provides a high-level API for launching
+databricks jobs using Dagster's EXT protocol.
+
+It takes a single `databricks.sdk.service.jobs.SubmitTask` specification. After
+setting up EXT communications channels (which by default use DBFS), it injects
+the information needed to connect to these channels from Databricks into the
+task specification. It then launches a Databricks job by passing the
+specification to `WorkspaceClient.jobs.submit`. It polls the job state and
+exits gracefully on success or failure:


i think we should lowercase ext. It's not principled, but it looks cooler

I might be wrong/misguide on this

let's discuss in standup

We didn't end up discussing but I changed it. Personally am ambivalent.

python_modules/libraries/dagster-databricks/README.md

schrockn · 2023-09-14T13:56:28Z

python_modules/libraries/dagster-databricks/dagster_databricks/ext.py

+        context_injector: Optional[ExtContextInjector] = None,
+        message_reader: Optional[ExtMessageReader] = None,


rebase on my PR

schrockn · 2023-09-14T13:57:23Z

python_modules/libraries/dagster-databricks/dagster_databricks/ext.py

+        with dbfs_tempdir(self.dbfs_client) as tempdir:
+            self.tempdir = tempdir
+            yield
+


i thought we were doing this sort of mutable state business

It was necessary given the existing setup of ExtBlobStoreMessageReader Updated with some changes to make it unnecessary.

alangenfeld · 2023-09-14T14:36:46Z

python_modules/libraries/dagster-databricks/dagster_databricks/ext.py

+        dbfs_client.delete(tempdir, recursive=True)
+
+
+class ExtDbfsContextInjector(ExtContextInjector):


its probably not the most consequential but I'm curious what would be the motivations to using this instead of ExtEnvContextInjector. Is it anything beyond just concerns over env var size limits? Does databricks call those limits out explicitly?

In our meeting with Enigma, their guy said that they frequently run into size limits when passing their "context" (in our ontology, extras) over CLI, thought it would happen with env vars too, and explicitly suggested a DBFS mechanism.

schrockn

I'm not a fan of the implications of a setup method on the message reader. Do you anticipate needing that to modify state?
What is the plan for automated testing?

schrockn · 2023-09-14T21:32:23Z

python_modules/libraries/dagster-databricks/README.md

+Internally, `ExtDatabricks` is using the `ext_protocol` context manager to set
+up communications. If you have existing code to launch/poll the job you do not
+want to change, or you just want more control than is permitted by
+`ExtDatabricks`, you can use this lower level API directly. All that is
+necessary is that (1) your Databricks job be launched within the scope of the
+`ext_process` context manager; (2) your job is launched on a cluster containing
+the environment variables available on the yielded `ext_context`. 


I would drop "internally". It makes people think they shouldn't use it. This is a first-class supported API. Actually just drop the first line.

If you have existing code to launch/poll the job you do not
want to change, or you just want more control than is permitted by
ExtDatabricks, you can use ext_protocol.

schrockn · 2023-09-14T21:33:55Z

python_modules/libraries/dagster-databricks/dagster_databricks/ext.py

+    tempdir: Optional[str] = None
+
+    def __init__(self, *, interval: int = 10, client: WorkspaceClient):
+        super().__init__(interval=interval)
+        self.dbfs_client = files.DbfsAPI(client.api_client)
+
+    @contextmanager
+    def setup(self) -> Iterator[ExtParams]:
+        with dbfs_tempdir(self.dbfs_client) as tempdir:
+            self.tempdir = tempdir
+            yield {"path": tempdir}


why does this need to be a property? can it not be a local variable in setup

sorry the property was leftover from before, it is now passed down thru params-- removed property

smackesey · 2023-09-14T21:56:14Z

I'm not a fan of the implications of a setup method on the message reader. Do you anticipate needing that to modify state?

The setup method pre-existed this PR-- I use it here to set up a tempdir on databricks. It only exists for BlobStoreMessageReader, it's needed as a hook for subclass-specific setup since read_messages is already occupied.

With the modifications in the current state of the PR, it shouldn't need to modify state on the class, instead it yields params which get passed down.

What is the plan for automated testing?

Add our databricks test account secrets to BK and remove the skip on BK pytest marks.

schrockn

req'ing changed based on error swallowing

schrockn · 2023-09-14T23:35:29Z

python_modules/dagster/dagster/_core/ext/utils.py

    @abstractmethod
-    def get_params(self) -> ExtParams:
+    @contextmanager
+    def setup(self) -> Iterator[ExtParams]:


please name get_params or with_params or something like that

Renamed to get_params

schrockn · 2023-09-14T23:40:13Z

python_modules/libraries/dagster-databricks/dagster_databricks/ext.py

+
+            while True:
+                run = self.client.jobs.get_run(run_id)
+                context.log.info(f"Run state: {run.state.life_cycle_state}")


Let's make this a more information message. Someone who is just viewing the run but has not read the code should be able to understand what is going on.

f"Current run state of databricks run {run_id}: {run.state.life_cycle_state}"

Would be cool to render a url the points to databricks as well, but that is probably context-specific?

Changed the message along suggested lines.

I'm not sure how to get the URL at present and trying to get this in for release this AM.

schrockn · 2023-09-14T23:40:52Z

python_modules/libraries/dagster-databricks/dagster_databricks/ext.py

+        except IOError:
+            return None


just silently swallow the error?

At minimum we should warn or something.

The error is swallowed because it's not an unexpected result. This is the polling mechanism for the presence of the next message chunk. If the chunk doesn't exist, it throws an IOError. I've added a comment explaining this.

schrockn

Great. It is critical that you follow up quickly with real automated tests here.

schrockn · 2023-09-15T13:11:36Z

python_modules/dagster-ext/dagster_ext/__init__.py

+        unmounted_path = _assert_env_param_type(params, "path", str, self.__class__)
+        path = os.path.join("/dbfs", unmounted_path.lstrip("/"))
+        with open(path, "r") as f:
+            data = json.load(f)


nit: can just yield directly yield json.load(f)

- Add `dagster-ext` integration for Databricks. This is designed to interfere with the official databricks SDK as little as possible-- you pass in Databricks SDK data structures to `ExtDatabricks` and the only modification it makes is injecting the necessary environment variables. The rest of cluster config etc is left to the user. It is separate from the rest of the databricks integration. - Add example usage to `dagster_databricks/README.md` New unit tests (they are skipped on BK though)

## Summary & Motivation - Add `dagster-ext` integration for Databricks. This is designed to interfere with the official databricks SDK as little as possible-- you pass in Databricks SDK data structures to `ExtDatabricks` and the only modification it makes is injecting the necessary environment variables. The rest of cluster config etc is left to the user. It is separate from the rest of the databricks integration. - Add example usage to `dagster_databricks/README.md` ## How I Tested These Changes New unit tests (they are skipped on BK though)

alangenfeld · 2023-09-20T21:30:26Z

python_modules/libraries/dagster-databricks/setup.py

        f"dagster-pyspark{pin}",
        "databricks-cli~=0.17",  # TODO: Remove this dependency in the next minor release.
        "databricks_api",  # TODO: Remove this dependency in the next minor release.
-        "databricks-sdk<0.7",  # Breaking changes occur in minor versions.
+        "databricks-sdk<0.9",  # Breaking changes occur in minor versions.


alangenfeld · 2023-09-20T21:32:05Z

python_modules/libraries/dagster-databricks/dagster_databricks/ext.py

+
+    def run(
+        self,
+        task: jobs.SubmitTask,


This was referenced Aug 21, 2023

dagster-external docker asset #15820

Merged

[brownfield-essentials] Refactor ExternalExecutionTask to better support multiple environments #15892

Merged

smackesey mentioned this pull request Aug 21, 2023

[dagster-databricks] Add type annotations and standardize public API #15954

Merged

smackesey force-pushed the sean/dagster-databricks-types branch from 0453b46 to 32b45f5 Compare August 21, 2023 14:17

smackesey force-pushed the sean/numbers-databricks branch from 14e83f4 to 8003b18 Compare August 21, 2023 14:17

smackesey force-pushed the sean/dagster-databricks-types branch from 32b45f5 to 22acced Compare August 21, 2023 15:00

smackesey force-pushed the sean/numbers-databricks branch from 8003b18 to a89f1d7 Compare August 21, 2023 15:00

smackesey force-pushed the sean/dagster-databricks-types branch 2 times, most recently from fbd39a2 to ba73bf3 Compare August 22, 2023 14:47

smackesey force-pushed the sean/numbers-databricks branch from a89f1d7 to 635ca4e Compare August 22, 2023 14:47

smackesey force-pushed the sean/dagster-databricks-types branch from ba73bf3 to 43f3716 Compare August 22, 2023 14:49

smackesey force-pushed the sean/numbers-databricks branch from 635ca4e to 2f23d4b Compare August 22, 2023 14:50

Base automatically changed from sean/dagster-databricks-types to master August 22, 2023 15:03

smackesey force-pushed the sean/numbers-databricks branch from 2f23d4b to 353da68 Compare August 22, 2023 15:05

smackesey changed the title ~~[externals] Update numbers example for databricks and dbfs~~ [externals] databricks adapter Aug 22, 2023

smackesey changed the base branch from master to sean/json-schema August 24, 2023 12:21

smackesey force-pushed the sean/numbers-databricks branch from 353da68 to 6062d8f Compare August 24, 2023 12:21

smackesey mentioned this pull request Aug 24, 2023

[externals] Rename dagster-external -> dagster-externals #16022

Merged

smackesey force-pushed the sean/json-schema branch from ec813a2 to 80918d2 Compare August 24, 2023 15:21

smackesey force-pushed the sean/numbers-databricks branch 2 times, most recently from a47c7ed to f1dde17 Compare August 24, 2023 18:59

smackesey changed the base branch from sean/json-schema to sean/externals-io-refactor August 25, 2023 13:52

smackesey force-pushed the sean/numbers-databricks branch from f1dde17 to cbd192d Compare August 25, 2023 13:52

smackesey force-pushed the sean/numbers-databricks branch 2 times, most recently from 6af3f7e to 59bff80 Compare September 13, 2023 22:08

schrockn reviewed Sep 14, 2023

View reviewed changes

schrockn requested changes Sep 14, 2023

View reviewed changes

alangenfeld reviewed Sep 14, 2023

View reviewed changes

smackesey force-pushed the sean/numbers-databricks branch 7 times, most recently from 066249e to 684284d Compare September 14, 2023 21:13

schrockn requested changes Sep 14, 2023

View reviewed changes

smackesey force-pushed the sean/numbers-databricks branch from 684284d to 2654a60 Compare September 14, 2023 21:51

smackesey requested a review from schrockn September 14, 2023 21:56

smackesey force-pushed the sean/numbers-databricks branch from 2654a60 to 2e5ab2e Compare September 14, 2023 22:09

schrockn reviewed Sep 14, 2023

View reviewed changes

smackesey force-pushed the sean/numbers-databricks branch from 2e5ab2e to fba44c4 Compare September 15, 2023 12:20

smackesey requested a review from schrockn September 15, 2023 12:21

fix

5ae2367

schrockn approved these changes Sep 15, 2023

View reviewed changes

address feedback

096980a

smackesey force-pushed the sean/numbers-databricks branch from fba44c4 to 096980a Compare September 15, 2023 13:18

smackesey merged commit 59bcc1f into master Sep 15, 2023

smackesey deleted the sean/numbers-databricks branch September 15, 2023 13:50

alangenfeld reviewed Sep 20, 2023

View reviewed changes

python_modules/libraries/dagster-databricks/dagster_databricks/ext.py

def run(

self,

task: jobs.SubmitTask,

Copy link

Member

alangenfeld Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ext] databricks EXT integration #15955

[ext] databricks EXT integration #15955

smackesey commented Aug 21, 2023 •

edited

Loading

smackesey commented Aug 21, 2023 •

edited

Loading

github-actions bot commented Aug 22, 2023 •

edited

Loading

schrockn Sep 14, 2023

smackesey Sep 14, 2023

schrockn Sep 14, 2023

schrockn Sep 14, 2023

schrockn Sep 14, 2023

smackesey Sep 14, 2023

smackesey Sep 14, 2023

schrockn Sep 14, 2023

schrockn Sep 14, 2023

smackesey Sep 14, 2023

alangenfeld Sep 14, 2023

smackesey Sep 14, 2023

schrockn left a comment

schrockn Sep 14, 2023

schrockn Sep 14, 2023

smackesey Sep 14, 2023

smackesey commented Sep 14, 2023

schrockn left a comment

schrockn Sep 14, 2023

smackesey Sep 15, 2023

schrockn Sep 14, 2023

smackesey Sep 15, 2023

schrockn Sep 14, 2023

smackesey Sep 15, 2023

schrockn left a comment

schrockn Sep 15, 2023

alangenfeld Sep 20, 2023

alangenfeld Sep 20, 2023

		context_injector: Optional[ExtContextInjector] = None,
		message_reader: Optional[ExtMessageReader] = None,

		dbfs_client.delete(tempdir, recursive=True)


		class ExtDbfsContextInjector(ExtContextInjector):

[ext] databricks EXT integration #15955

[ext] databricks EXT integration #15955

Conversation

smackesey commented Aug 21, 2023 • edited Loading

Summary & Motivation

How I Tested These Changes

smackesey commented Aug 21, 2023 • edited Loading

github-actions bot commented Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schrockn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smackesey commented Sep 14, 2023

schrockn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schrockn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smackesey commented Aug 21, 2023 •

edited

Loading

smackesey commented Aug 21, 2023 •

edited

Loading

github-actions bot commented Aug 22, 2023 •

edited

Loading