-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: allow dispatching items to nested tables when specified parent_table_name
equals resource.table_name
#2106
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
88d8192
to
b8b5ecd
Compare
parent_table_name
equals resource.table_name
Hey @joscha, nested tables are currently completely managed by dlt which ensures that the references from child tables to parent tables always are set correctly. This relationship is needed for merges and replaces downstream. If we allow to manually override this, there would need to be some kind of consistency checks after the extraction is done I think. I'm not 100% sure, but my sense is that we will not allow / merge anything like this for the time being, but rather try to understand why you are doing this and maybe improve any other interfaces we have to allow for what your actual goal here is. Maybe you can explain why you want to set these relationships manually so I can understand the use case better. |
This pull request doesn't change behavior as far as I can tell, it only fixes a bug where when the parent table name is specified explicitly but matches the one that was determined anyway, it doesn't throw an error. So basically when running the code: def my_table():
yield dlt.mark.with_hints(
{"id": 10},
dlt.mark.make_hints(table_name="my_other_table"),
) it currently does this:
When running the code above with: def my_table():
yield dlt.mark.with_hints(
{"id": 10},
dlt.mark.make_hints(table_name="my_other_table",parent_table_name="my_table"),
) it errors, even though Do I misunderstand? |
@joscha this bit of code prevents you from merging hints that link to different parent tables. You are first yielding items from your iterator that do not have a parent table and then items that do have a parent table. @rudolfix can give his opinion on this maybe also, but imho this is to prevent you from creating your own child tables because the behavior is possibly not defined. |
I understand that generically we don't want to allow any parent definitions, the change in this PR, however, only allows a parent table that is the original table anyway.
Yes. Maybe I am missing some background info here, but the yield order shouldn't make a difference to the behavior though, should it? Given we're in an async context we can't guarantee different yields being processed in the order they were yielded anyway?
if that's the case, does the Line 79 in b8bac75
|
@joscha the change in this PR allows you to mess with the parent_table_relationships which may lead to undefined behavior in dlt, like I was saying, for now we assume the parent child relationships to be handled by dlt internally. We may change this, but this will require many more tests and probably also better documentation. But this again is for @rudolfix to decide probably. We could add a second make_hints function for external use without this argument, you are right about that. But anyway I'd still be interested in your usecase. |
Sorry, I need to back pedal for a second, I am not following completely.
|
@joscha if you yield an item with a different table_name, then there will not be a parent_table_name set implicitely. parent_table is only used for nested tables that are normalized in the normalizer stage. If you have a parent_table set, you probably still have schema hints from a previous run of the same pipeline name or something like that.
Also make_hints is used internally and can be used externally. That's why I said above we should probably use a different signature for the one we expose as long as we do not support manual linking of child and parents tables. |
I have a source that multi-nested data structures with relations to each other; Example:
For the The API returns a multitude of different entities at each level. For a |
That rings a bell - possibly, most likely due to #2109 then.
OK. Would love to fix this up then. It is really confusing to read a public API but not being able to use the options. From https://dlthub.com/docs/api_reference/extract/hints is |
Description
I am trying to dispatch items from one resource to different tables, but connect them via a
_dlt_parent_id
reference.Expected
Two tables.
my_table
:other_table
:however instead the test case in this PR yields an exception:
Where the error gets thrown, there is a comment:
dlt/dlt/common/schema/utils.py
Line 513 in eefe77b
:-D
And as you can see from the error message, the assertion is not quite right:
A few ideas:
parent_table_name
is ACTUALLY different from thetable_name
of the enclosing resource?My feeling is 2., but I am not 100% sure. The test passes with my proposed change.
You can run the test via:
There is a commented out chunk of code that does something similar actually and raises the same exception:
dlt/dlt/common/schema/utils.py
Lines 522 to 531 in eefe77b
On a side note, I did just notice that there is no
_dlt_parent_id
inmy_other_table
with this test case, it looks like this:unsure where to pass the hint to create the parent child relationship.