Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow dispatching items to nested tables when specified parent_table_name equals resource.table_name #2106

Open
wants to merge 3 commits into
base: devel
Choose a base branch
from

Conversation

joscha
Copy link
Contributor

@joscha joscha commented Nov 28, 2024

Description

I am trying to dispatch items from one resource to different tables, but connect them via a _dlt_parent_id reference.

Expected

Two tables.

my_table:

_dlt_id id
xxx 1

other_table:

_dlt_id _dlt_parent_id id
yyy xxx 10

however instead the test case in this PR yields an exception:

E           dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1732801491.314115 with exception:
E
E           <class 'dlt.common.schema.exceptions.TablePropertiesConflictException'>
E           In schema: pipe_90aea10afd1f0c9f22f9154408b787ba: Cannot merge partial tables into table `my_other_table` due to property `resource` with different values: "my_table" != "my_table"

Where the error gets thrown, there is a comment:

# this should not really happen

:-D

And as you can see from the error message, the assertion is not quite right:

with different values: "my_table" != "my_table"

A few ideas:

  1. Am I holding it wrong? If yes, how can we prevent other people from doing the same?
  2. Should the error only be raised if the specified parent_table_name is ACTUALLY different from the table_name of the enclosing resource?

My feeling is 2., but I am not 100% sure. The test passes with my proposed change.

You can run the test via:

poetry run pytest -s tests/pipeline/test_pipeline.py -k test_mark_parent_table

There is a commented out chunk of code that does something similar actually and raises the same exception:

# def compare_tables(tab_a: TTableSchema, tab_b: TTableSchema) -> bool:
# try:
# table_name = tab_a["name"]
# if table_name != tab_b["name"]:
# raise TablePropertiesConflictException(table_name, "name", table_name, tab_b["name"])
# diff_table = diff_tables(tab_a, tab_b, ignore_table_name=False)
# # columns cannot differ
# return len(diff_table["columns"]) == 0
# except SchemaException:
# return False


On a side note, I did just notice that there is no _dlt_parent_id in my_other_table with this test case, it looks like this:

 {'my_other_table': [{'id': 10, '_dlt_load_id': '1732802504.14759', '_dlt_id': 'IUDf2Ch1OievNg'}]}

unsure where to pass the hint to create the parent child relationship.

Copy link

netlify bot commented Nov 28, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 6fa2e06
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/675726100dc0aa00086c9bab
😎 Deploy Preview https://deploy-preview-2106--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@joscha joscha force-pushed the joscha/mark-parent-table branch from 88d8192 to b8b5ecd Compare November 28, 2024 13:56
@joscha joscha changed the title test: add test with hint for parent table fix: allow dispatching items to nested tables when specified parent_table_name equals resource.table_name Nov 28, 2024
@joscha
Copy link
Contributor Author

joscha commented Dec 11, 2024

@sh-rp or @burnash can I get your 👀 on this, please? I think it might potentially be a simple fix to accept. Tests are green, no changes to current behavior.

@sh-rp
Copy link
Collaborator

sh-rp commented Dec 16, 2024

Hey @joscha, nested tables are currently completely managed by dlt which ensures that the references from child tables to parent tables always are set correctly. This relationship is needed for merges and replaces downstream. If we allow to manually override this, there would need to be some kind of consistency checks after the extraction is done I think. I'm not 100% sure, but my sense is that we will not allow / merge anything like this for the time being, but rather try to understand why you are doing this and maybe improve any other interfaces we have to allow for what your actual goal here is. Maybe you can explain why you want to set these relationships manually so I can understand the use case better.

@joscha
Copy link
Contributor Author

joscha commented Dec 16, 2024

Hey @joscha, nested tables are currently completely managed by dlt which ensures that the references from child tables to parent tables always are set correctly. This relationship is needed for merges and replaces downstream. If we allow to manually override this, there would need to be some kind of consistency checks after the extraction is done I think. I'm not 100% sure, but my sense is that we will not allow / merge anything like this for the time being, but rather try to understand why you are doing this and maybe improve any other interfaces we have to allow for what your actual goal here is. Maybe you can explain why you want to set these relationships manually so I can understand the use case better.

This pull request doesn't change behavior as far as I can tell, it only fixes a bug where when the parent table name is specified explicitly but matches the one that was determined anyway, it doesn't throw an error.

So basically when running the code:

    def my_table():
        yield dlt.mark.with_hints(
            {"id": 10},
            dlt.mark.make_hints(table_name="my_other_table"),
        )

it currently does this:

  1. dispatch from table my_table
  2. parent_table_name="my_table" is set for the new table my_other_table.

When running the code above with:

    def my_table():
        yield dlt.mark.with_hints(
            {"id": 10},
            dlt.mark.make_hints(table_name="my_other_table",parent_table_name="my_table"),
        )

it errors, even though parent_table_name is my_table as well. When the explicit value matches the implicitly calculated value, it shouldn't error. That's the fix in this PR.

Do I misunderstand?

@sh-rp
Copy link
Collaborator

sh-rp commented Dec 16, 2024

@joscha this bit of code prevents you from merging hints that link to different parent tables. You are first yielding items from your iterator that do not have a parent table and then items that do have a parent table. @rudolfix can give his opinion on this maybe also, but imho this is to prevent you from creating your own child tables because the behavior is possibly not defined.

@joscha
Copy link
Contributor Author

joscha commented Dec 16, 2024

this bit of code prevents you from merging hints that link to different parent tables.

I understand that generically we don't want to allow any parent definitions, the change in this PR, however, only allows a parent table that is the original table anyway.

You are first yielding items from your iterator that do not have a parent table and then items that do have a parent table.

Yes. Maybe I am missing some background info here, but the yield order shouldn't make a difference to the behavior though, should it? Given we're in an async context we can't guarantee different yields being processed in the order they were yielded anyway?

prevent you from creating your own child tables

if that's the case, does the parent_table_name parameter need to be removed from

parent_table_name: TTableHintTemplate[str] = None,
altogether?

@sh-rp
Copy link
Collaborator

sh-rp commented Dec 16, 2024

@joscha the change in this PR allows you to mess with the parent_table_relationships which may lead to undefined behavior in dlt, like I was saying, for now we assume the parent child relationships to be handled by dlt internally. We may change this, but this will require many more tests and probably also better documentation. But this again is for @rudolfix to decide probably. We could add a second make_hints function for external use without this argument, you are right about that. But anyway I'd still be interested in your usecase.

@joscha
Copy link
Contributor Author

joscha commented Dec 16, 2024

for now we assume the parent child relationships to be handled by dlt internally

Sorry, I need to back pedal for a second, I am not following completely.

  1. parent_table_name is a public API parameter. It's documented here: https://dlthub.com/docs/api_reference/extract/hints - is this correct, or incorrect?
  2. When I make hints from a resource that usually yields to a, but I yield to a new table b, then dlt implicitly sets parent_table_name to a. If I explicitly set parent_table_name in exactly this case to a, the code errors. Is this expected and if yes, why?

@sh-rp
Copy link
Collaborator

sh-rp commented Dec 16, 2024

@joscha if you yield an item with a different table_name, then there will not be a parent_table_name set implicitely. parent_table is only used for nested tables that are normalized in the normalizer stage. If you have a parent_table set, you probably still have schema hints from a previous run of the same pipeline name or something like that.

    @dlt.resource()
    def my_table():
        yield {"id": 4}
        yield dlt.mark.with_hints(
            {"id": 10},
            dlt.mark.make_hints(table_name="my_other_table"),
        )
        
    p = dlt.pipeline("my_pipeline", destination="duckdb", dev_mode=True)
    p.run(my_table)

    print(p.default_schema.to_pretty_yaml())

Also make_hints is used internally and can be used externally. That's why I said above we should probably use a different signature for the one we expose as long as we do not support manual linking of child and parents tables.

@joscha
Copy link
Contributor Author

joscha commented Dec 16, 2024

But anyway I'd still be interested in your usecase.

I have a source that multi-nested data structures with relations to each other; Example:

company -> founders -> List[person]
                 -> c-level -> List[person]
meeting -> attendees (List[person])
company -> -> meetings -> List[meeting]
companies -> List[company] 

notes -> List[note] (belonging to one or more persons, companies, meetings)
persons -> List[person]
field -> *  (dynamic) each field has a data type, one one of them is a dropdown type for example, which can have 1 to many options of varous types.
lists -> List[field] referencing various entities, These options do belong to the respective `list` table.

For the fields, the data type is dynamic, e.g. could be List[company] or List[person] or List[str], see definition here (file is collapsed, v2.py)

The API returns a multitude of different entities at each level. For a person object, no matter how deeply it is nested, I want to always reference the persons table (every person has a unique identifier), so no matter where the entity is found, the parent table is always persons.
For the notes for example, the parent table depends on the origin of the note. If I were to load all notes for each entity, then the load would become too slow. E.g. if I have 100,000 notes, they are distributed across companies and persons for example. If I have 100,000 companies and 100,000 persons I'd need to query the API 200,000 times to get the respective results with the correct parent table. Therefore I need to use the /notes endpoint that gives me all notes (independent of the parent). When I yield these notes, I then would love to say, parent table of this note is X, but of this other note it is Y (because I know after I fetched the note, but I can't get to the data in a batched way via the persons or companies resource easily.

@joscha
Copy link
Contributor Author

joscha commented Dec 16, 2024

If you have a parent_table set, you probably still have schema hints from a previous run of the same pipeline name or something like that.

That rings a bell - possibly, most likely due to #2109 then.

Also make_hints is used internally and can be used externally. That's why I said above we should probably use a different signature for the one we expose as long as we do not support manual linking of child and parents tables.

OK. Would love to fix this up then. It is really confusing to read a public API but not being able to use the options. From https://dlthub.com/docs/api_reference/extract/hints is parent_table_name the only internal one or are there more we don't expect people to use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants