Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Determine osmosis_progenitor accurately at the same distance #172

Open
yamamoto-yuta opened this issue Jul 15, 2024 · 1 comment

Comments

@yamamoto-yuta
Copy link

Overview

When using the --add-progenitor-to-meta option, the resulting osmosis_progenitor: values can sometimes be incorrect.

Reproduction Steps

While I am currently investigating the exact scenarios that cause this issue, one example is provided below.

Consider the following set of models in a lineage graph (the letters A ~ C next to fct_item_shops denote the order of JOINs).

classDiagram
    class raw_shops {
        shop_id
        item_key
    }

    class raw_items {
        item_key
        item_code
    }

    class raw_item_shops {
        item_code
        shop_id
    }

    class stg_shops {
        shop_id
        item_key
    }

    class stg_items {
        item_key
        item_code
    }

    class stg_item_shops {
        item_code
        shop_id
    }

    class fct_item_shops {
        C.item_key
        A.item_code
        A.shop_id
    }

    %% 

    raw_shops --> stg_shops
    raw_items --> stg_items
    raw_item_shops --> stg_item_shops
    stg_shops --> fct_item_shops: LEFT JOIN shops B <br /> USING (shop_id) 
    stg_items --> fct_item_shops: LEFT JOIN items C  <br /> USING (item_code)
    stg_item_shops --> fct_item_shops: shop_items A
Loading

You can view the actual code in the following repository:

https://github.com/yamamoto-yuta/dbt-osmosis-inheritance-check

In this case, the propagation sources for each column in fct_item_shops should be as follows:

  • item_key ... raw_items
  • item_code ... raw_shop_items
  • shop_id ... raw_shop_items

However, the actual result is as follows, where the source of item_code is incorrectly identified as raw_items instead of raw_shop_items.

  • item_key ... raw_items
  • item_code ... raw_items ← Should be raw_shop_items
  • shop_id ... raw_shop_items
dbt-osmosis Execution Result
version: 2
models:
  - name: stg_shops
    columns:
      - name: shop_id
        description: ''
        data_type: INT64
        meta:
          osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_shops
      - name: item_key
        description: ''
        data_type: STRING
        meta:
          osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_shops
  - name: stg_item_shops
    columns:
      - name: item_code
        description: ''
        data_type: STRING
        meta:
          osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_item_shops
      - name: shop_id
        description: ''
        data_type: INT64
        meta:
          osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_item_shops
  - name: stg_items
    columns:
      - name: item_key
        description: ''
        data_type: STRING
        meta:
          osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_items
      - name: item_code
        description: ''
        data_type: STRING
        meta:
          osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_items
  - name: fct_item_shops
    columns:
      - name: item_key
        description: ''
        meta:
          osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_items
        data_type: STRING
      - name: item_code
        description: ''
        meta:
          osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_items
        data_type: STRING
      - name: shop_id
        description: ''
        meta:
          osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_shops
        data_type: INT64

Execution Environment

❯ dbt-osmosis --version
dbt-osmosis, version 0.13.2
❯ dbt --version
Core:
  - installed: 1.8.3
  - latest:    1.8.3 - Up-to-date!

Plugins:
  - bigquery: 1.8.2 - Up-to-date!
@yamamoto-yuta
Copy link
Author

I was informed by @syou6162 that this is not a bug but rather an intended behavior.

dbt-osmosis determines the propagation source based on the distance of nodes, and in the case of the same distance, it cannot make an exact determination.

def _build_node_ancestor_tree(
manifest: ManifestNode,
node: ManifestNode,
family_tree: Optional[Dict[str, List[str]]] = None,
members_found: Optional[List[str]] = None,
depth: int = 0,
) -> Dict[str, List[str]]:
"""Recursively build dictionary of parents in generational order"""
if family_tree is None:
family_tree = {}
if members_found is None:
members_found = []
if not hasattr(node, "depends_on"):
return family_tree
for parent in getattr(node.depends_on, "nodes", []):
member = manifest.nodes.get(parent, manifest.sources.get(parent))
if member and parent not in members_found:
family_tree.setdefault(f"generation_{depth}", []).append(parent)
members_found.append(parent)
# Recursion
family_tree = _build_node_ancestor_tree(
manifest, member, family_tree, members_found, depth + 1
)
return family_tree

Based on that, I have changed the issue title from a bug to a feature request.

@yamamoto-yuta yamamoto-yuta changed the title Bug: Incorrect osmosis_progenitor: Output When Using --add-progenitor-to-meta Option [Feature Request] Determine osmosis_progenitor accurately at the same distance Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant