Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource_state serialize/deserialize types #2156

Open
zilto opened this issue Dec 16, 2024 · 1 comment
Open

resource_state serialize/deserialize types #2156

zilto opened this issue Dec 16, 2024 · 1 comment
Labels
bug Something isn't working question Further information is requested

Comments

@zilto
Copy link
Collaborator

zilto commented Dec 16, 2024

dlt version

1.4.1

Describe the problem

I am using dlt.current.resource_state() to store a dictionary about processed entities. On first pipeline execution, I store a dict[int, int] mapping. On second pipeline execution, the deserialized state returns a dict[str, int] mapping. This causes issues when checking keys on the 2nd execution. The problem seems to be specific to the keys, since type is preserved for the values.

Reading the documentation section about supported types, this seems to be an unintended behavior. Could be listed as a caveat in the meantime.

Expected behavior

The serialization / deserialization of the pipeline state should properly preserve types.

Steps to reproduce

By running the following resource twice, the seen keys go from int to str after deserialization.

@dlt.resource(standalone=True)
def mock_resource():
    seen = dlt.current.resource_state().setdefault("seen", {})
    print("before: ", seen)
    for i in range(2):
        if i not in seen:
            seen[i] = i  # dict[int, int]
        yield {"id": i, "name": f"name_{i}"}

    print("after: ", seen)

Execution 1

pipe = dlt.pipeline(
    pipeline_name="mock", destination="duckdb", dev_mode=True
)
pipe.drop()

dlt.run(mock_resource())
# before:  {}
# after:  {0: 0, 1: 1}

Execution 2; Notice the keys "1" and 1

dlt.run(mock_resource())
# before:  {'0': 0, '1': 1}
# after:  {'0': 0, '1': 1, 0: 0, 1: 1}

Operating system

Linux

Runtime environment

Local

Python version

3.11

dlt data source

No response

dlt destination

DuckDB

Other deployment details

No response

Additional information

This could be backend-dependent.

@zilto zilto added the bug Something isn't working label Dec 16, 2024
@rudolfix
Copy link
Collaborator

rudolfix commented Dec 17, 2024

@zilto AFAIK this is how JSON "standard" defines dictionaries. keys are deserialized as strings. tldr;> orjson does that. you could try to switch to simplejson to see what happens:
https://dlthub.com/docs/reference/performance#use-built-in-json-parser
or hack dlt code and set this:
https://github.com/ijl/orjson?tab=readme-ov-file#migrating

also feel free to add a note on this behavior to docs :)

@rudolfix rudolfix moved this from Todo to Planned in dlt core library Dec 17, 2024
@rudolfix rudolfix added the question Further information is requested label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
Status: Planned
Development

No branches or pull requests

2 participants