Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[schema] Support database path for specified database in catalog #2494

Merged
merged 1 commit into from
Dec 22, 2023

Conversation

FangYongs
Copy link
Contributor

Purpose

Linked issue: close #2493

Currently there will be FileIO when a catalog is created. We would like to support path for database and paimon need to get fileio for each database when it is created, dropped and tables are created, dropped for it. This PR aims to introduce DatabaseFileIOProvider in catalog to get fileio for given database name.

Tests

Exist tests can cover this feature.

API and Format

no

Documentation

no

@FangYongs FangYongs force-pushed the fileio_provider_for_database branch from e4a2619 to 50ff181 Compare December 12, 2023 11:21
Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @FangYongs
Does this come from real business needs? In my previous understanding, this situation is usually solved by the FileSystem of HDFS. (Or Flink FileSystem)

@FangYongs
Copy link
Contributor Author

Hi @FangYongs Does this come from real business needs? In my previous understanding, this situation is usually solved by the FileSystem of HDFS. (Or Flink FileSystem)

@JingsongLi Yes. Currently in our production environment, we always support one catalog for hive, hudi, iceberg or paimon, so they can easily read/write tables use paimon.databaseName.tableName. We have a meta service in company to manage all databases and tables for each catalog. When users create a database, they need to give a path of hdfs/s3 for it and then create tables in the database. So we need to support configuring warehosue path in database level.

@JingsongLi
Copy link
Contributor

Hi @FangYongs Does this come from real business needs? In my previous understanding, this situation is usually solved by the FileSystem of HDFS. (Or Flink FileSystem)

@JingsongLi Yes. Currently in our production environment, we always support one catalog for hive, hudi, iceberg or paimon, so they can easily read/write tables use paimon.databaseName.tableName. We have a meta service in company to manage all databases and tables for each catalog. When users create a database, they need to give a path of hdfs/s3 for it and then create tables in the database. So we need to support configuring warehosue path in database level.

Can Hadoop FileSystem solve this? #2504

@FangYongs
Copy link
Contributor Author

@JingsongLi It seems not, HadoopFileIO is for catalog too. As mentioned above, we have a meta service for all catalogs, so we create a paimon meta catalog based on it, and the catalog stores the database and path after users created it. When users create a table, it needs to find the database and its path from the paimon meta catalog and create FileIO for the table to store schema and data, that's why I would like to introduce the interface DatabaseFileIOProvider in AbstractCatalog to get FileIO for specified database.

What do you think?

@JingsongLi
Copy link
Contributor

@JingsongLi It seems not, HadoopFileIO is for catalog too. As mentioned above, we have a meta service for all catalogs, so we create a paimon meta catalog based on it, and the catalog stores the database and path after users created it. When users create a table, it needs to find the database and its path from the paimon meta catalog and create FileIO for the table to store schema and data, that's why I would like to introduce the interface DatabaseFileIOProvider in AbstractCatalog to get FileIO for specified database.

What do you think?

Can you list what FileSystems need to be supported?

@JingsongLi
Copy link
Contributor

For example, Hadoop FileSystem supports viewfs, there can be multiple underly filesystems.

@FangYongs
Copy link
Contributor Author

@JingsongLi Local file system, hdfs or s3 which have beed supported in paimon

@FangYongs
Copy link
Contributor Author

For example, Hadoop FileSystem supports viewfs, there can be multiple underly filesystems.

We may configure different hdfs cluster for different databases in the paimon catalog, do you mean we can get different filesystem from one FileIO?

@JingsongLi
Copy link
Contributor

For example, Hadoop FileSystem supports viewfs, there can be multiple underly filesystems.

We may configure different hdfs cluster for different databases in the paimon catalog, do you mean we can get different filesystem from one FileIO?

Yes

@JingsongLi
Copy link
Contributor

From Paimon Level, you can do this too. Just like current HadoopFileIO, there is a Map<Pair<String, String>, FileSystem> fsMap, which means scheme + authority => FileSystem.

@FangYongs
Copy link
Contributor Author

@JingsongLi Get it. Then I think we can add minor updates in paimon to support different database paths for catalog, I will update this PR

@FangYongs FangYongs force-pushed the fileio_provider_for_database branch from 50ff181 to ee6f079 Compare December 18, 2023 10:44
@FangYongs FangYongs changed the title [schema] Introduce DatabaseFileIOProvider to get fileio for database [schema] Support database path for specified database in catalog Dec 18, 2023
@FangYongs
Copy link
Contributor Author

@JingsongLi What do you think of this PR now? We can implement customized catalog and get warehouse path with specified database name

@FangYongs FangYongs force-pushed the fileio_provider_for_database branch from ee6f079 to 2316916 Compare December 18, 2023 11:09
@FangYongs FangYongs force-pushed the fileio_provider_for_database branch 2 times, most recently from eaa20e1 to b49f672 Compare December 18, 2023 11:58
@FangYongs FangYongs force-pushed the fileio_provider_for_database branch from b49f672 to ce99efd Compare December 20, 2023 17:17
@FangYongs
Copy link
Contributor Author

@JingsongLi Can you help to review this PR again? THX

@FangYongs FangYongs requested a review from JingsongLi December 22, 2023 02:48
Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 9f151ab into apache:master Dec 22, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature][schema] Support database path for specified database in catalog
2 participants