Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support jdbc catalog #2866

Merged
merged 6 commits into from
Mar 6, 2024

Conversation

sunxiaojian
Copy link
Contributor

@sunxiaojian sunxiaojian commented Feb 7, 2024

Support jdbc catalog

Linked issue : #841

@sunxiaojian sunxiaojian force-pushed the support-jdbc-catalog branch 3 times, most recently from 7cc4f4c to 442d343 Compare February 8, 2024 15:59
@sunxiaojian
Copy link
Contributor Author

@JingsongLi PTAL

@sunxiaojian sunxiaojian force-pushed the support-jdbc-catalog branch 6 times, most recently from 4fd21ac to 443dd36 Compare February 22, 2024 06:18
@sunxiaojian
Copy link
Contributor Author

sunxiaojian commented Feb 27, 2024

@FangYongs @yuzelin @JingsongLi If you have time, could you please help review it ? Thx

@JingsongLi
Copy link
Contributor

Thanks @sunxiaojian for your contribution.

This PR looks good! I will take a look this week~

@@ -175,3 +176,185 @@ Using the table option facilitates the convenient definition of Hive table param
Parameters prefixed with `hive.` will be automatically defined in the `TBLPROPERTIES` of the Hive table.
For instance, using the option `hive.table.owner=Jon` will automatically add the parameter `table.owner=Jon` to the table properties during the creation process.



## Creating a Catalog with Filesystem Metastore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why copy this?

-- 'jdbc.user' = '...',
-- 'jdbc.password' = '...',
-- 'initialize-catalog-tables'='true'
-- 'warehouse' = 'hdfs:///path/to/warehouse',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warehouse should be required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should remove --

-- 'uri' = 'jdbc:mysql://<host>:<port>/<databaseName>',
-- 'jdbc.user' = '...',
-- 'jdbc.password' = '...',
-- 'initialize-catalog-tables'='true'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this one? What is the scene?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this one? What is the scene?

@JingsongLi When creating a JdbcCatalog, it will be determined based on this parameter whether to automatically create a table or not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does it not need to be initialized? You added a public API. Please explain why, how to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can remove this option.


You can define any default table options with the prefix `table-default.` for tables created in the catalog.

Also, you can create [FlinkGenericCatalog]({{< ref "engines/flink" >}}).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, FlinkGenericCatalog must use HiveCatalog.

+ ")"
+ ")";

static final String DISTRIBUTED_LOCK_ACQUIRE_SQL =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this? How to lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this? How to lock?

@JingsongLi First, create a table for distributed locks and write a data entry to the database table using database.table as the lockId. If successful, the lock acquisition is considered successful. If the primary key conflicts, it is considered a failure

@coolderli
Copy link

@sunxiaojian Thanks for your work. This is really what we want!

/** Util for jdbc catalog. */
public class JdbcUtils {
private static final Logger LOG = LoggerFactory.getLogger(JdbcUtils.class);
public static final String METADATA_LOCATION_PROP = "metadata_location";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, Paimon considers the commit successful once the snapshot file is successfully committed to the file system. Storing the current metadata location in the catalog metastore will cause consistency problems because the two operations of committing the snapshot file to the file system and committing it to the metastore are not atomic. So there seems to be no point in storing this metadata location? @JingsongLi Can you please review this? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhongyujiang You understanding is very correct! Yes, Paimon only consider Snapshot files as source of truth.
metadata_location is come from Iceberg, we don't need this.


private static int tryClearExpireLock(JdbcClientPool connections, String lockName, long timeout)
throws SQLException, InterruptedException {
long expirationTimeMillis = System.currentTimeMillis() - timeout * 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this, could you please explain?

connection.prepareStatement(DISTRIBUTED_LOCK_ACQUIRE_SQL)) {
preparedStatement.setString(1, lockName);
preparedStatement.setTimestamp(
2, new Timestamp(System.currentTimeMillis()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are using timestamp of clients as acquired_at? I think this makes no sense since clients can be distributed on different machines, which have no consistent system time. Since acquired_at is the timestamp of acquiring the lock, I think we should use the system time of the metastore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are using timestamp of clients as acquired_at? I think this makes no sense since clients can be distributed on different machines, which have no consistent system time. Since acquired_at is the timestamp of acquiring the lock, I think we should use the system time of the metastore.

@zhongyujiang There may indeed be the issue you mentioned, but the database used for the meta storage system is different, and the generation system time function used is also different. May require separate compatibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial time setting was for metastore, but it was not compatible with all databases, so this version of the modification was made. In fact, specific database specific time functions can be set for each database

public static boolean acquire(JdbcClientPool connections, String lockName, long timeout)
throws SQLException, InterruptedException {
// Check and clear expire lock
int affectedRows = tryClearExpireLock(connections, lockName, timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think timeout is a property that should be set when acquiring the lock, not when the lock expires. Because different clients may configure different timeouts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think timeout is a property that should be set when acquiring the lock, not when the lock expires. Because different clients may configure different timeouts.

In fact, none of them are the most suitable. They all belong to the timeout set by the client, but your solution should be better

@sunxiaojian sunxiaojian force-pushed the support-jdbc-catalog branch 7 times, most recently from a4fd960 to 516b5cf Compare February 29, 2024 12:40
@sunxiaojian
Copy link
Contributor Author

@JingsongLi @zhongyujiang Thank you for the review, I have fixed the issues raised above.

@sunxiaojian sunxiaojian force-pushed the support-jdbc-catalog branch 3 times, most recently from 558025f to 9d41884 Compare March 1, 2024 04:08
@sunxiaojian sunxiaojian force-pushed the support-jdbc-catalog branch from 9d41884 to 6a70411 Compare March 1, 2024 04:09
@JingsongLi
Copy link
Contributor

Hi @sunxiaojian , you should make sure that every comment is resolved.

@sunxiaojian
Copy link
Contributor Author

Hi @sunxiaojian , you should make sure that every comment is resolved.

@JingsongLi All have been resolved

```
You can define any connection parameters for a database with the prefix "jdbc.".

You can define the "initialize-catalog-tables" configuration to automatically create tables required for the initial JdbcCatalog. If it is true, tables will be automatically created when initializing JdbcCatalog. If it is false, tables will not be automatically created and you must manually create them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is false, tables will not be automatically created and you must manually create them.

Why is there such a need? I don't understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there such a need? I don't understand.

Some online business databases do not allow table creation through jdbc and must be submitted for approval in advance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final decision is not to keep this configuration. It is recommended that users enable automatic table creation permissions for independent databases, which would be better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some online business databases do not allow table creation through jdbc and must be submitted for approval in advance

You can document this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some online business databases do not allow table creation through jdbc and must be submitted for approval in advance

You can document this.

Most users may not need it, I will remove this parameter and add it back if there is a strong requirement.

'uri' = 'jdbc:mysql://<host>:<port>/<databaseName>',
'jdbc.user' = '...',
'jdbc.password' = '...',
'catalog-name'='jdbc'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we already have the concept of a catalog name, which is my_jdbc.

Maybe this option can be renamed to store-key or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least, jdbc.catalog-name, you need documentation to explain that it is designed to store multiple catalogs, and it is different from the catalog name you use in Flink SQL or Spark SQL.

'jdbc.user' = '...',
'jdbc.password' = '...',
'catalog-name'='jdbc'
'initialize-catalog-tables'='true'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you forgot ,

import static org.apache.paimon.utils.Preconditions.checkState;

/** Source: [core/src/main/java/org/apache/iceberg/ClientPoolImpl.java]. */
public abstract class ClientPoolImpl<C, E extends Exception>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move them to paimon-common.

@sunxiaojian sunxiaojian force-pushed the support-jdbc-catalog branch from 332fea0 to 8c2e10d Compare March 4, 2024 06:06
@sunxiaojian
Copy link
Contributor Author

@JingsongLi All have been resolved

String lockUniqueName = String.format("%s.%s.%s", catalogName, database, table);
lock(lockUniqueName);
try {
return callable.call();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the lock expires before the commit is completed? Will there be conflicts? Does this break the atomicity of commits?

For example, let's consider the first column as logical timestamp, and the second column is the events that occurred at that time:
1 client-1 successfully aquired a lock of table a
2 client-1 starts committing but gets stuck for some reason
5 the lock of table a timeouts
6 client-2 cleaned the expired lock of table a and acquired a new lock of table a
7 client-2 starts committing
8 client-2 made a successful committment
9 client-1 can start committing

In this case, client-1's commit should fail, but here client-1 will continue to commit. If the underlying file system allows overwriting of files with the same name, then client-1's commit will overwrite client-2's commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can start a new thread renewal timeout to solve the problem, but you can also set a reasonable timeout to solve it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the current strategy, it is difficult for this situation to occur because currently using lock is only renaming the file, and this operation should be completed quickly.

@JingsongLi
Copy link
Contributor

@zhongyujiang Hi, do you have other comments or concerns?

Copy link
Contributor

@zhongyujiang zhongyujiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two minor comments, I have no other worries except the above one, Thanks @sunxiaojian

@@ -175,3 +176,40 @@ Using the table option facilitates the convenient definition of Hive table param
Parameters prefixed with `hive.` will be automatically defined in the `TBLPROPERTIES` of the Hive table.
For instance, using the option `hive.table.owner=Jon` will automatically add the parameter `table.owner=Jon` to the table properties during the creation process.

## Creating a Catalog with JDBC Metastore

By using the Paimon JDBC catalog, changes to the catalog will be directly stored in relational databases such as MySQL, postgres, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgres is not supported for now, should we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgres is not supported for now, should we remove it?

it's just that locking is not supported

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document this? only SQLITE and MYSQL supports catalog lock

@sunxiaojian sunxiaojian force-pushed the support-jdbc-catalog branch from 736b0e2 to 0b7b739 Compare March 5, 2024 11:57
'uri' = 'jdbc:mysql://<host>:<port>/<databaseName>',
'jdbc.user' = '...',
'jdbc.password' = '...',
'store-key'='jdbc',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you modify this to jdbc.catalog-key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you modify this to jdbc.catalog-key?

The prefix "jdbc." is used to connect to the database configuration and can be directly parsed and placed in the connection configuration. If it is necessary to replace it, it can be replaced with "catalog-key"?

@sunxiaojian sunxiaojian force-pushed the support-jdbc-catalog branch 2 times, most recently from 8809531 to 41fb584 Compare March 6, 2024 02:33
@sunxiaojian sunxiaojian force-pushed the support-jdbc-catalog branch from d6b8d88 to bd81c01 Compare March 6, 2024 02:53
@sunxiaojian
Copy link
Contributor Author

@JingsongLi The above issues have been resolved, please review again

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Thanks @sunxiaojian and @zhongyujiang

@JingsongLi JingsongLi merged commit 9d29737 into apache:master Mar 6, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants