`resource_state` serialize/deserialize types #2156

zilto · 2024-12-16T23:41:36Z

dlt version

1.4.1

Describe the problem

I am using dlt.current.resource_state() to store a dictionary about processed entities. On first pipeline execution, I store a dict[int, int] mapping. On second pipeline execution, the deserialized state returns a dict[str, int] mapping. This causes issues when checking keys on the 2nd execution. The problem seems to be specific to the keys, since type is preserved for the values.

Reading the documentation section about supported types, this seems to be an unintended behavior. Could be listed as a caveat in the meantime.

Expected behavior

The serialization / deserialization of the pipeline state should properly preserve types.

Steps to reproduce

By running the following resource twice, the seen keys go from int to str after deserialization.

@dlt.resource(standalone=True)
def mock_resource():
    seen = dlt.current.resource_state().setdefault("seen", {})
    print("before: ", seen)
    for i in range(2):
        if i not in seen:
            seen[i] = i  # dict[int, int]
        yield {"id": i, "name": f"name_{i}"}

    print("after: ", seen)

Execution 1

pipe = dlt.pipeline(
    pipeline_name="mock", destination="duckdb", dev_mode=True
)
pipe.drop()

dlt.run(mock_resource())
# before:  {}
# after:  {0: 0, 1: 1}

Execution 2; Notice the keys "1" and 1

dlt.run(mock_resource())
# before:  {'0': 0, '1': 1}
# after:  {'0': 0, '1': 1, 0: 0, 1: 1}

Operating system

Linux

Runtime environment

Local

Python version

3.11

dlt data source

No response

dlt destination

DuckDB

Other deployment details

No response

Additional information

This could be backend-dependent.

The text was updated successfully, but these errors were encountered:

rudolfix · 2024-12-17T13:49:18Z

@zilto AFAIK this is how JSON "standard" defines dictionaries. keys are deserialized as strings. tldr;> orjson does that. you could try to switch to simplejson to see what happens:
https://dlthub.com/docs/reference/performance#use-built-in-json-parser
or hack dlt code and set this:
https://github.com/ijl/orjson?tab=readme-ov-file#migrating

also feel free to add a note on this behavior to docs :)

github-project-automation bot moved this to Todo in dlt core library Dec 16, 2024

github-project-automation bot added this to dlt core library Dec 16, 2024

zilto added the bug Something isn't working label Dec 16, 2024

rudolfix moved this from Todo to Planned in dlt core library Dec 17, 2024

rudolfix added the question Further information is requested label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`resource_state` serialize/deserialize types #2156

`resource_state` serialize/deserialize types #2156

zilto commented Dec 16, 2024 •

edited

Loading

rudolfix commented Dec 17, 2024 •

edited

Loading

resource_state serialize/deserialize types #2156

resource_state serialize/deserialize types #2156

Comments

zilto commented Dec 16, 2024 • edited Loading

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

rudolfix commented Dec 17, 2024 • edited Loading

`resource_state` serialize/deserialize types #2156

`resource_state` serialize/deserialize types #2156

zilto commented Dec 16, 2024 •

edited

Loading

rudolfix commented Dec 17, 2024 •

edited

Loading