[Feature request] parse SQL to generate list of columns #152

olhmr · 2024-05-29T13:44:07Z

In a similar vein to #57, I'm looking for ways to run this without needing to connect to the data warehouse.

My understanding of the current implementation is that we either pull columns from the warehouse, or rely on them being manually generated ahead of time. What I would like to see additionally is an option to parse the compiled SQL, derive the names of the columns outputted by the model, and scaffold those into the yaml files - with appropriate upstream descriptions if applicable.

Presumably this would be added somewhere here, and controlled through a CLI flag. We can require a manifest, similar to the catalog option, to avoid parsing dbt during runtime.

This would enable fully self-contained generation and propagation of documentation, without the need for large manual data entry or connections to external systems.

I'm happy to take a stab at this if you think it makes sense to include?

z3z1ma · 2024-06-12T20:50:08Z

Thats a cool idea. Implementation wise, its not an easy task. Its fraught with edge cases. Star expansions, lexical scoping, etc. Not to mention dbt SQL is not valid sql since it has jinja markup so we cannot generate an AST. The best bet for making this successful would be to use the compiled SQL. Compilation can be done without a connection to the DWH I think? Unless there are introspective queries ofc. So maybe we start there and do a best effort?

kokorin · 2024-06-15T03:46:03Z

There is built-in DBT macro get_columns_in_query

But I think the main problem would be dealing with model dependencies. If we have 2 models A and B.
A as select 1 as id and B as select * from {{ ref('A') }}. A has to be ran before deducing columns and types using get_columns_in_query. So after every change to A we have to run it to update model B's columns.

The only alternative to that is tweaking materialization of every upstream model to ephemeral and then using get_columns_in_query.

I think the only reliable approach is to use new DBT 1.8 feature --empty to build changed models and all downstream models and then running dbt-osmosis.

z3z1ma · 2024-06-18T21:15:28Z

In a similar vein to #57, I'm looking for ways to run this without needing to connect to the data warehouse.

get_columns_in_query defeats the purpose since it needs to connect to the DWH and also we use the same mechanism under the hood basically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] parse SQL to generate list of columns #152

[Feature request] parse SQL to generate list of columns #152

olhmr commented May 29, 2024

z3z1ma commented Jun 12, 2024

kokorin commented Jun 15, 2024 •

edited

Loading

z3z1ma commented Jun 18, 2024

[Feature request] parse SQL to generate list of columns #152

[Feature request] parse SQL to generate list of columns #152

Comments

olhmr commented May 29, 2024

z3z1ma commented Jun 12, 2024

kokorin commented Jun 15, 2024 • edited Loading

z3z1ma commented Jun 18, 2024

kokorin commented Jun 15, 2024 •

edited

Loading