Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: BigQuery Repeated field should be ARRAY[<data_type>] in model yml #181

Open
Thrasi opened this issue Nov 8, 2024 · 0 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@Thrasi
Copy link

Thrasi commented Nov 8, 2024

Given the model testModel:

version: 2
models:
  - name: testModel
    config:
      materialized: table
      contract:
        enforced: false
SELECT
  [ STRUCT( 1 AS nested_field_1,
    "a" AS nested_field_2),
  STRUCT( 2 AS nested_field_1,
    "b" AS nested_field_2),
  STRUCT( 3 AS nested_field_1,
    "c" AS nested_field_2) ] AS repeated_record,
    [1,2,3] AS repeated_int
dbt-osmosis yaml refactor testModel

produces the configuration:

version: 2
models:
  - name: testModel
    config:
      materialized: table
      contract:
        enforced: false
    columns:
      - name: repeated_record
        description: ''
        data_type: RECORD
      - name: repeated_record.nested_field_1
        description: ''
        data_type: INT64
      - name: repeated_record.nested_field_2
        description: ''
        data_type: STRING
      - name: repeated_int
        description: ''
        data_type: INT64

Setting config.contract.enforced: true and rerunning dbt
gives us a contract mismatch:

Column 'repeated_record' has type STRUCT<nested_field_1 INT64, nested_field_2 STRING> which cannot be coerced from query output type ARRAY<STRUCT<nested_field_1 INT64, nested_field_2 STRING>> at [9:5]

This does not match the BigQuery schema. The repeated_record column should have data_type: ARRAY instead of data_type: RECORD
For bigquery columns we need to take into account the mode of the column which can be REQUIRED, NULLABLE or REPEATED when specifying the data_type in the yml

For basic datatypes like repeated_int, the correct data_type there is data_type: ARRAY<INT64>

@z3z1ma z3z1ma self-assigned this Nov 15, 2024
@z3z1ma z3z1ma added bug Something isn't working enhancement New feature or request and removed enhancement New feature or request labels Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants