Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] parse SQL to generate list of columns #152

Open
olhmr opened this issue May 29, 2024 · 3 comments
Open

[Feature request] parse SQL to generate list of columns #152

olhmr opened this issue May 29, 2024 · 3 comments

Comments

@olhmr
Copy link

olhmr commented May 29, 2024

In a similar vein to #57, I'm looking for ways to run this without needing to connect to the data warehouse.

My understanding of the current implementation is that we either pull columns from the warehouse, or rely on them being manually generated ahead of time. What I would like to see additionally is an option to parse the compiled SQL, derive the names of the columns outputted by the model, and scaffold those into the yaml files - with appropriate upstream descriptions if applicable.

Presumably this would be added somewhere here, and controlled through a CLI flag. We can require a manifest, similar to the catalog option, to avoid parsing dbt during runtime.

This would enable fully self-contained generation and propagation of documentation, without the need for large manual data entry or connections to external systems.

I'm happy to take a stab at this if you think it makes sense to include?

@z3z1ma
Copy link
Owner

z3z1ma commented Jun 12, 2024

Thats a cool idea. Implementation wise, its not an easy task. Its fraught with edge cases. Star expansions, lexical scoping, etc. Not to mention dbt SQL is not valid sql since it has jinja markup so we cannot generate an AST. The best bet for making this successful would be to use the compiled SQL. Compilation can be done without a connection to the DWH I think? Unless there are introspective queries ofc. So maybe we start there and do a best effort?

@kokorin
Copy link
Contributor

kokorin commented Jun 15, 2024

There is built-in DBT macro get_columns_in_query

But I think the main problem would be dealing with model dependencies. If we have 2 models A and B.
A as select 1 as id and B as select * from {{ ref('A') }}. A has to be ran before deducing columns and types using get_columns_in_query. So after every change to A we have to run it to update model B's columns.

The only alternative to that is tweaking materialization of every upstream model to ephemeral and then using get_columns_in_query.

I think the only reliable approach is to use new DBT 1.8 feature --empty to build changed models and all downstream models and then running dbt-osmosis.

@z3z1ma
Copy link
Owner

z3z1ma commented Jun 18, 2024

In a similar vein to #57, I'm looking for ways to run this without needing to connect to the data warehouse.

get_columns_in_query defeats the purpose since it needs to connect to the DWH and also we use the same mechanism under the hood basically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants