Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingestion): fix stateful ingestion for GCS source #11879

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

josges
Copy link

@josges josges commented Nov 18, 2024

Remove pipeline name before passing context to equivalent s3 source to avoid error "Checkpointing provider DatahubIngestionCheckpointingProvider already registered." Fixes this issue

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Nov 18, 2024
@hsheth2 hsheth2 requested a review from treff7es November 19, 2024 01:03
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 20, 2024
@@ -88,7 +89,10 @@ def __init__(self, config: GCSSourceConfig, ctx: PipelineContext):
super().__init__(config, ctx)
self.config = config
self.report = GCSSourceReport()
self.s3_source = self.create_equivalent_s3_source(ctx)
self.platform: str = "gcs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this constant is available

Suggested change
self.platform: str = "gcs"
self.platform: str = PLATFORM_GCS

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, thank you!

Comment on lines 93 to 94
s3_ctx = copy.deepcopy(ctx)
s3_ctx.pipeline_name = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may move these two lines inside create_equivalent_s3_source

Also, how is s3_ctx.pipeline_name = None is required?
Note pipeline_name is required when stateful_ingestion is enabled.

Copy link
Author

@josges josges Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your feedback!

Without this, running stateful ingestion leads to IndexError: Checkpointing provider DatahubIngestionCheckpointingProvider already registered.. This is because StatefulIngestionSourceBase is initialized twice, first for GCSSource and then in create_equivalent_s3_source with S3Source. This can also be verified by the change I did in the unit test.
I think, stateful ingestion was always meant to be switched off for the equivalent S3 source object, because in create_equivalent_s3_config, DataLakeSourceConfig is initialized without stateful_ingestion, which implicitly means stateful_ingestion=None. Since pipeline_name is set, the logic in StateProviderWrapper leads to stateful ingestion being switched on, which leads to the error.

I put the lines in the function, as suggested.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 24, 2024
Remove pipeline name before passing context to equivalent s3 source to avoid error "Checkpointing provider DatahubIngestionCheckpointingProvider already registered."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ingestion for GCS ingest fails with stateful ingestion
3 participants