Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ext] databricks EXT integration #15955

Merged
merged 2 commits into from
Sep 15, 2023
Merged

[ext] databricks EXT integration #15955

merged 2 commits into from
Sep 15, 2023

Conversation

smackesey
Copy link
Collaborator

@smackesey smackesey commented Aug 21, 2023

Summary & Motivation

  • Add dagster-ext integration for Databricks. This is designed to interfere with the official databricks SDK as little as possible-- you pass in Databricks SDK data structures to ExtDatabricks and the only modification it makes is injecting the necessary environment variables. The rest of cluster config etc is left to the user. It is separate from the rest of the databricks integration.
  • Add example usage to dagster_databricks/README.md

How I Tested These Changes

New unit tests (they are skipped on BK though)

@smackesey
Copy link
Collaborator Author

smackesey commented Aug 21, 2023

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@github-actions
Copy link

github-actions bot commented Aug 22, 2023

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-5lfkrj9h8-elementl.vercel.app
https://sean-numbers-databricks.core-storybook.dagster-docs.io

Built with commit fba44c4.
This pull request is being automatically deployed with vercel-action

Base automatically changed from sean/dagster-databricks-types to master August 22, 2023 15:03
@smackesey smackesey force-pushed the sean/numbers-databricks branch from 2f23d4b to 353da68 Compare August 22, 2023 15:05
@smackesey smackesey changed the title [externals] Update numbers example for databricks and dbfs [externals] databricks adapter Aug 22, 2023
@smackesey smackesey changed the base branch from master to sean/json-schema August 24, 2023 12:21
@smackesey smackesey force-pushed the sean/numbers-databricks branch from 353da68 to 6062d8f Compare August 24, 2023 12:21
@smackesey smackesey force-pushed the sean/numbers-databricks branch 2 times, most recently from a47c7ed to f1dde17 Compare August 24, 2023 18:59
@smackesey smackesey changed the base branch from sean/json-schema to sean/externals-io-refactor August 25, 2023 13:52
@smackesey smackesey force-pushed the sean/numbers-databricks branch from f1dde17 to cbd192d Compare August 25, 2023 13:52
@smackesey smackesey force-pushed the sean/numbers-databricks branch 2 times, most recently from 6af3f7e to 59bff80 Compare September 13, 2023 22:08
Comment on lines +414 to +424
class ExtBufferedFilesystemMessageWriterChannel(ExtBlobStoreMessageWriterChannel):
def __init__(self, path: str, *, interval: float = 10):
super().__init__(interval=interval)
self._path = path

def upload_messages_chunk(self, payload: IO, index: int) -> None:
message_path = os.path.join(self._path, f"{index}.json")
with open(message_path, "w") as f:
f.write(payload.read())


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbfs exposes a traditional I/O interface in python?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is mounted on /dbfs

python_modules/libraries/dagster-databricks/README.md Outdated Show resolved Hide resolved
Comment on lines 84 to 118
while True:
run = self.client.jobs.get_run(run_id)
if run.state.life_cycle_state in (
jobs.RunLifeCycleState.TERMINATED,
jobs.RunLifeCycleState.SKIPPED,
):
if run.state.result_state == jobs.RunResultState.SUCCESS:
return
else:
raise DagsterExternalExecutionError(
f"Error running Databricks job: {run.state.state_message}"
)
elif run.state.life_cycle_state == jobs.RunLifeCycleState.INTERNAL_ERROR:
raise DagsterExternalExecutionError(
f"Error running Databricks job: {run.state.state_message}"
)
time.sleep(5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should log on every tick so the user is confident something is happening

Comment on lines 6 to 21
## EXT Example

This package includes a prototype API for launching databricks jobs with
Dagster's EXT protocol. There are two ways to use the API:

### (1) `ExtDatabricks` resource

The `ExtDatabricks` resource provides a high-level API for launching
databricks jobs using Dagster's EXT protocol.

It takes a single `databricks.sdk.service.jobs.SubmitTask` specification. After
setting up EXT communications channels (which by default use DBFS), it injects
the information needed to connect to these channels from Databricks into the
task specification. It then launches a Databricks job by passing the
specification to `WorkspaceClient.jobs.submit`. It polls the job state and
exits gracefully on success or failure:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should lowercase ext. It's not principled, but it looks cooler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong/misguide on this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss in standup

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't end up discussing but I changed it. Personally am ambivalent.

python_modules/libraries/dagster-databricks/README.md Outdated Show resolved Hide resolved
python_modules/libraries/dagster-databricks/README.md Outdated Show resolved Hide resolved
Comment on lines 51 to 52
context_injector: Optional[ExtContextInjector] = None,
message_reader: Optional[ExtMessageReader] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase on my PR

with dbfs_tempdir(self.dbfs_client) as tempdir:
self.tempdir = tempdir
yield

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we were doing this sort of mutable state business

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was necessary given the existing setup of ExtBlobStoreMessageReader Updated with some changes to make it unnecessary.

dbfs_client.delete(tempdir, recursive=True)


class ExtDbfsContextInjector(ExtContextInjector):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its probably not the most consequential but I'm curious what would be the motivations to using this instead of ExtEnvContextInjector. Is it anything beyond just concerns over env var size limits? Does databricks call those limits out explicitly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our meeting with Enigma, their guy said that they frequently run into size limits when passing their "context" (in our ontology, extras) over CLI, thought it would happen with env vars too, and explicitly suggested a DBFS mechanism.

@smackesey smackesey force-pushed the sean/numbers-databricks branch 7 times, most recently from 066249e to 684284d Compare September 14, 2023 21:13
Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I'm not a fan of the implications of a setup method on the message reader. Do you anticipate needing that to modify state?
  • What is the plan for automated testing?

Comment on lines 109 to 115
Internally, `ExtDatabricks` is using the `ext_protocol` context manager to set
up communications. If you have existing code to launch/poll the job you do not
want to change, or you just want more control than is permitted by
`ExtDatabricks`, you can use this lower level API directly. All that is
necessary is that (1) your Databricks job be launched within the scope of the
`ext_process` context manager; (2) your job is launched on a cluster containing
the environment variables available on the yielded `ext_context`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop "internally". It makes people think they shouldn't use it. This is a first-class supported API. Actually just drop the first line.

If you have existing code to launch/poll the job you do not
want to change, or you just want more control than is permitted by
ExtDatabricks, you can use ext_protocol.

Comment on lines 150 to 159
tempdir: Optional[str] = None

def __init__(self, *, interval: int = 10, client: WorkspaceClient):
super().__init__(interval=interval)
self.dbfs_client = files.DbfsAPI(client.api_client)

@contextmanager
def setup(self) -> Iterator[ExtParams]:
with dbfs_tempdir(self.dbfs_client) as tempdir:
self.tempdir = tempdir
yield {"path": tempdir}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be a property? can it not be a local variable in setup

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry the property was leftover from before, it is now passed down thru params-- removed property

@smackesey smackesey force-pushed the sean/numbers-databricks branch from 684284d to 2654a60 Compare September 14, 2023 21:51
@smackesey
Copy link
Collaborator Author

I'm not a fan of the implications of a setup method on the message reader. Do you anticipate needing that to modify state?

The setup method pre-existed this PR-- I use it here to set up a tempdir on databricks. It only exists for BlobStoreMessageReader, it's needed as a hook for subclass-specific setup since read_messages is already occupied.

With the modifications in the current state of the PR, it shouldn't need to modify state on the class, instead it yields params which get passed down.

What is the plan for automated testing?

Add our databricks test account secrets to BK and remove the skip on BK pytest marks.

@smackesey smackesey requested a review from schrockn September 14, 2023 21:56
@smackesey smackesey force-pushed the sean/numbers-databricks branch from 2654a60 to 2e5ab2e Compare September 14, 2023 22:09
Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req'ing changed based on error swallowing

@abstractmethod
def get_params(self) -> ExtParams:
@contextmanager
def setup(self) -> Iterator[ExtParams]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please name get_params or with_params or something like that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to get_params


while True:
run = self.client.jobs.get_run(run_id)
context.log.info(f"Run state: {run.state.life_cycle_state}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a more information message. Someone who is just viewing the run but has not read the code should be able to understand what is going on.

f"Current run state of databricks run {run_id}: {run.state.life_cycle_state}"

Would be cool to render a url the points to databricks as well, but that is probably context-specific?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the message along suggested lines.

I'm not sure how to get the URL at present and trying to get this in for release this AM.

Comment on lines +164 to +170
except IOError:
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just silently swallow the error?

At minimum we should warn or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is swallowed because it's not an unexpected result. This is the polling mechanism for the presence of the next message chunk. If the chunk doesn't exist, it throws an IOError. I've added a comment explaining this.

@smackesey smackesey force-pushed the sean/numbers-databricks branch from 2e5ab2e to fba44c4 Compare September 15, 2023 12:20
@smackesey smackesey requested a review from schrockn September 15, 2023 12:21
Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. It is critical that you follow up quickly with real automated tests here.

unmounted_path = _assert_env_param_type(params, "path", str, self.__class__)
path = os.path.join("/dbfs", unmounted_path.lstrip("/"))
with open(path, "r") as f:
data = json.load(f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can just yield directly yield json.load(f)

@smackesey smackesey force-pushed the sean/numbers-databricks branch from fba44c4 to 096980a Compare September 15, 2023 13:18
@smackesey smackesey merged commit 59bcc1f into master Sep 15, 2023
@smackesey smackesey deleted the sean/numbers-databricks branch September 15, 2023 13:50
smackesey added a commit that referenced this pull request Sep 15, 2023
- Add `dagster-ext` integration for Databricks. This is designed to
interfere with the official databricks SDK as little as possible-- you
pass in Databricks SDK data structures to `ExtDatabricks` and the only
modification it makes is injecting the necessary environment variables.
The rest of cluster config etc is left to the user. It is separate from
the rest of the databricks integration.
- Add example usage to `dagster_databricks/README.md`

New unit tests (they are skipped on BK though)
zyd14 pushed a commit to zyd14/dagster that referenced this pull request Sep 15, 2023
## Summary & Motivation

- Add `dagster-ext` integration for Databricks. This is designed to
interfere with the official databricks SDK as little as possible-- you
pass in Databricks SDK data structures to `ExtDatabricks` and the only
modification it makes is injecting the necessary environment variables.
The rest of cluster config etc is left to the user. It is separate from
the rest of the databricks integration.
- Add example usage to `dagster_databricks/README.md`

## How I Tested These Changes

New unit tests (they are skipped on BK though)
f"dagster-pyspark{pin}",
"databricks-cli~=0.17", # TODO: Remove this dependency in the next minor release.
"databricks_api", # TODO: Remove this dependency in the next minor release.
"databricks-sdk<0.7", # Breaking changes occur in minor versions.
"databricks-sdk<0.9", # Breaking changes occur in minor versions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X


def run(
self,
task: jobs.SubmitTask,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants