Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running a pipeline mutates the original catalog object #4340

Open
FlorianGD opened this issue Nov 19, 2024 · 2 comments
Open

Running a pipeline mutates the original catalog object #4340

FlorianGD opened this issue Nov 19, 2024 · 2 comments
Labels
Community Issue/PR opened by the open-source community

Comments

@FlorianGD
Copy link
Contributor

Description

When a pipeline is run, a shallow_copy of a catalog is made with the extra_dataset_pattern of the runner, and the original object is mutated. It adds an extra_dataset_patterns that is a "catch all". This prevents for example running a pipeline twice with the same catalog, as the second one will not have any free outputs.

Context

I have a pipeline that exists with 2 namespaces. I wanted to test it for both, and I parametrized the tests. The tests used the same catalog as a fixture. This fixture had the module scope so as not to re create it every time. When updating to kedro 0.19, the test fails for the second parameter, when my test pipeline does not have any free output. After investigating, I found that the pipeline does not have any free output because the second time the catalog has catalog._extra_dataset_patterns={'{default}': {'type': 'MemoryDataset'}} which matches all the datasets.

Steps to Reproduce

To reproduce the symptoms (running the pipeline twice does not have any free output the second time):

>>> from kedro.io import DataCatalog
>>> from kedro.io.memory_dataset import MemoryDataset
>>> from kedro.pipeline import node, pipeline
>>> from kedro.runner import SequentialRunner
>>> my_pipeline = pipeline([node(lambda x: x, inputs=["foo"], outputs="bar")])
>>> catalog = DataCatalog()
>>> catalog.add("foo", MemoryDataset())
>>> catalog.save("foo", 1)
>>> runner = SequentialRunner()
>>> vars(catalog._config_resolver)
{
    '_runtime_patterns': {},
    '_dataset_patterns': {},
    '_default_pattern': {},
    '_resolved_configs': {}
}
>>> runner.run(my_pipeline, catalog)
{'bar': 1}
>>> vars(catalog._config_resolver)
{
    '_runtime_patterns': {'{default}': {'type': 'MemoryDataset'}},
    '_dataset_patterns': {},
    '_default_pattern': {},
    '_resolved_configs': {}
}
>>> runner.run(my_pipeline, catalog)
{}

I think the problem comes from this behavior:

>>> catalog = DataCatalog()
>>> runner = SequentialRunner()
>>> catalog.config_resolver.list_patterns()
[]
>>> catalog.shallow_copy(extra_dataset_patterns=runner._extra_dataset_patterns)
None
>>> catalog.config_resolver.list_patterns()
['{default}']

Expected Result

I want that bar was kept as a free output when running the pipeline the second time

Actual Result

The second time the pipeline runs, the output is an empty dictionary.

Your Environment

  • Kedro version used (pip show kedro or kedro -V): v0.19.9
  • Python version used (python -V): Python 3.11.8
  • Operating system and version: Ubuntu 20.04
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Nov 19, 2024
@datajoely
Copy link
Contributor

@ElenaKhaustova would this still be the case in the new KedroDataCatalog object?

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Nov 19, 2024

@ElenaKhaustova would this still be the case in the new KedroDataCatalog object?

Yes, this will be the case for the new catalog as well. shallow_copy will be removed as now we have a dedicated method to add_runtime_patterns via CatalogConfigResolver but since we do not remove patterns added at the runtime now.

Related to #4235

There was a suggestion to remove patterns after run: 5813d01

But we might need to think on something better to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

No branches or pull requests

4 participants