Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Open Source ClickHouse Deployments #1496

Merged
merged 42 commits into from
Jul 15, 2024

Conversation

Pipboyguy
Copy link
Collaborator

@Pipboyguy Pipboyguy commented Jun 19, 2024

Description

This will improves support for self-managed ClickHouse open source deployments, while maintaining compatibility with ClickHouse Cloud deployments.

  • Allow explicitly specifying the desired engine via table_engine_type in clickhouse_adapter. Valid types are merge_tree, replicated_merge_tree, shared_merge_tree, stripe_log, tiny_log
  • Default to MergeTree if no engine is specified, which now works for both Cloud and self-managed deployments
  • Update tests to check for the appropriate engine based on annotation
  • CH cloud has "date_time_input_format" set to "best_effort", while OS deployments don't. This caused some tests to fail for OS deployments. We override setting for clickhouse_connect sessions.
  • Clarify some networking details for making dlt work with OS deployments in docs

Related Issues

Additional Context

Note that this change does not include support for specifying replication, ZooKeeper or shard details for the ReplicatedMergeTree engine. Users requiring those customizations can continue to specify the full engine definition in their configuration.

@Pipboyguy Pipboyguy linked an issue Jun 19, 2024 that may be closed by this pull request
Copy link

netlify bot commented Jun 19, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 477e815
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/669175d24f3c6800089ff0d4
😎 Deploy Preview https://deploy-preview-1496--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@Pipboyguy Pipboyguy self-assigned this Jun 19, 2024
@Pipboyguy Pipboyguy added enhancement New feature or request tech-debt Leftovers from previous sprint that should be fixed over time labels Jun 19, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
@Pipboyguy Pipboyguy requested review from rudolfix and sh-rp and removed request for rudolfix June 20, 2024 12:35
@Pipboyguy Pipboyguy changed the title Add support for MergeTree engine in ClickHouse destination Support Open Source ClickHouse Deployments Jun 20, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Copy link
Collaborator

@jorritsandbrink jorritsandbrink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code itself looks good to me. Just two comments regarding test coverage and the use of sentinel tables.

@rudolfix rudolfix added the sprint Marks group of tasks with core team focus at this moment label Jun 26, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
# Conflicts:
#	dlt/destinations/impl/clickhouse/clickhouse.py
#	dlt/destinations/impl/clickhouse/sql_client.py
#	tests/load/clickhouse/test_clickhouse_adapter.py
@Pipboyguy Pipboyguy requested a review from jorritsandbrink July 1, 2024 14:13
Copy link
Collaborator

@jorritsandbrink jorritsandbrink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments, LGTM!

Just this comment regarding state interaction is still open: #1496 (comment)

@rudolfix rudolfix removed the sprint Marks group of tasks with core team focus at this moment label Jul 3, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pipboyguy @jorritsandbrink this is really good. thanks for working on it and all reviews. from my side

  • pls keep sentinel table. I elaborated quite a bit why
  • consider adding the table engine in the clickhouse configuration to have easy global swithc

Signed-off-by: Marcel Coetzee <[email protected]>
rudolfix
rudolfix previously approved these changes Jul 5, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I have just one small suggestion which may not make sense. let's wait for @jorritsandbrink review and we can merge it

self.execute_sql(f"""
CREATE TABLE {sentinel_table_name}
(_dlt_id String NOT NULL PRIMARY KEY)
ENGINE=MergeTree
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably nitpicking but should it use config.table_engine_type here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rudolfix Thanks for the spot!

Changed to config table engine:

sentinel_table_type = cast(TTableEngineType, self.credentials.table_engine_type)
self.execute_sql(f"""
CREATE TABLE {sentinel_table_name}
(_dlt_id String NOT NULL PRIMARY KEY)
ENGINE={TABLE_ENGINE_TYPE_TO_CLICKHOUSE_ATTR.get(sentinel_table_type)}
COMMENT 'internal dlt sentinel table'""")

Copy link
Collaborator

@jorritsandbrink jorritsandbrink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks great, just a comment regarding organization of config attributes.

@@ -36,6 +40,8 @@ class ClickHouseCredentials(ConnectionStringCredentials):
"""Timeout for sending and receiving data. Defaults to 300 seconds."""
dataset_table_separator: str = "___"
"""Separator for dataset table names, defaults to '___', i.e. 'database.dataset___table'."""
table_engine_type: Optional[TTableEngineType] = "merge_tree"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you define this attribute under ClickHouseCredentials?

I think it makes more sense to put it under ClickHouseClientConfiguration. Same for dataset_table_separator and dataset_sentinel_table_name. They aren't credentials.

As a reference: synapse has a default_table_index_type that's similar to table_engine_type, which is defined in the destination client class:

default_table_index_type: Optional[TTableIndexType] = "heap"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pipboyguy yes I agree with @jorritsandbrink
this is probably due to sql_client not seeing config but credentials - but you can pass additional arguments to it
so my take is that we move it and if it is hard then we'll need to change how sql_client is instantiated. I can help in that case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree. These don't belong in credentials.

Please find latest changes in both code base and docs.

Copy link
Collaborator

@jorritsandbrink jorritsandbrink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I like the typing module.

@rudolfix rudolfix merged commit 9156f44 into devel Jul 15, 2024
52 checks passed
@rudolfix rudolfix deleted the 1387-clickhouse-mergetree-support branch July 15, 2024 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request tech-debt Leftovers from previous sprint that should be fixed over time
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Open Source ClickHouse fails on Timestamp with Timezone ClickHouse MergeTree Support
3 participants