-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[pipes] databricks unstructured log forwarding (#16674)
## Summary & Motivation This adds stdout/stderr forwarding to the dagster-databricks pipes integration. It was a long road getting here with several dead ends. The current approach in this PR is to modify `PipesBlobStoreMessageReader` with `forward_{stdout,stderr}` boolean params and corresponding hooks for downloading stdout, stderr chunks. If `forward_{stdout,stderr}` is enabled, threads will be launched for the streams (alongside the message chunk thread) that periodically download stderr/stdout chunks and write them to the corresponding orchestration process streams. In the `PipesDbfsMessageReader`, instead of using an incrementing counter (as is used for messages), the stdout/stderr chunk downloaders are written to track a string index in the file corresponding to the stream. We repeatedly download the full file. Every time the file is downloaded, we only forward starting from the offset index. This approach of repeatedly downloading the full file and applying the offset only on the orchestration end can surely be improved to just download starting from the offset, but I did not implement that yet (there are some concerns around getting the indexing right given that the files are stored as base64, I think with padding). While it is possible that other integrations would need to make some changes on the pipes end, I didn't need to make any for databricks, because a DBFS location for `stdout`/`stderr` is configured when launching the job, so we don't need to do anything in the orchestration process. This introduces a potential asymmetry between the `PipesMessageReader` and `PipesMessageWriter`-- probably we will end up with just a `PipesReader`. There are definitely some other rough patches here: - Databricks does not let you directly configure the directory where stdout/stderr are written. Instead you set a root directory, and then logs get stored in `<root>/<cluster-id>/driver/{stdout,stderr}`. This introduces a difficulty because the cluster id does not exist until the job is launched (and you can't set it manually). Because the message reader gets set up before the job is launched, the message reader doesn't know where to look. - I got around this by setting the log root to a temporary directory and polling that directory for the first child to appear, which will be where the logs are stored. This is not ideal because users may want to retain the logs in DBFS. - Another approach would be to send the cluster id back in the new `opened` message, but to then thread this into the message reader requires additional plumbing work. For those who want to play with this, the workflow is to repeatedly run `dagster_databricks_tests/test_pipes.py::test_pipes_client`. This requires you to have `DATABRICKS_HOST` and `DATABRICKS_TOKEN` set in your env. `DATABRICKS_HOST` should be `https://dbc-07902917-6487.cloud.databricks.com`. `DATABRICKS_TOKEN` should be set to a value you generate by going to User Settings > Developer > Access Tokens in the Databricks UI. ## How I Tested These Changes Tested via `capsys` to make sure logs are forwarded.
- Loading branch information
Showing
7 changed files
with
332 additions
and
58 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.