Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[590] Add RunCatalogSync utility for synchronizing tables across catalogs #591

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

vinishjail97
Copy link
Contributor

@vinishjail97 vinishjail97 commented Dec 6, 2024

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

Introduced RunCatalogSync utility which does the following on a high level. This unblocks the ability to sync tables from a source catalog to multiple target catalogs, you can look at the sample configuration here xtable-utilities/src/test/resources/catalogConfig.yaml

See RFC for more details on the deign and flow chart -> #605

Brief change log

(for example:)

  • Add. ability to synchronize tables between source and target catalogs

Verify this pull request

(Please pick either of the following options)

This change added tests and can be verified as follows:

(example:)

  • xtable-utilities/src/test/java/org/apache/xtable/utilities/TestRunCatalogSync.java

@vinishjail97 vinishjail97 marked this pull request as draft December 6, 2024 22:41
@vinishjail97 vinishjail97 changed the title [590] Add interface for CatalogSyncClient and CatalogSyncOperations [590] Add interface for CatalogSyncClient and CatalogSync Dec 10, 2024
@vinishjail97 vinishjail97 marked this pull request as ready for review December 10, 2024 02:16
@vinishjail97
Copy link
Contributor Author

vinishjail97 commented Dec 10, 2024

I'm pushing another PR for the client side changes for CatalogSync.
https://github.com/apache/incubator-xtable/pull/597/files

@vinishjail97
Copy link
Contributor Author

vinishjail97 commented Dec 10, 2024

After putting more thought into it, I think we can keep the cross catalog sync as a separate function and not integrate it with TableFormatSync.

  1. Start off with SourceCatalog. (This can be StorageCatalog as well).
  2. User can choose multiple TargetCatalog and in each catalog there's an option to sync multiple table formats.
  3. XTable generates the table format metadata in storage first.
  4. Syncs the table format metadata to TargetCatalog.

This will be separate utility option in RunSync to configure a catalogConfig.yaml file.

/** This class represents the unique identifier for a table in a catalog. */
@Value
@Builder
public class CatalogTableIdentifier {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What do you think of calling this class TableIdentifier.
  2. Also, we should consider making the naming more generic, since it should represent all types of table identifiers. For instance, while databaseName is a popular namespace, there's also schema. In some scenarios, databaseName is synonymous with catalogName. The current two-part naming based on table and databaseName seems a bit restrictive to me.

Would you mind sharing the use cases this naming caters to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is okay, included the prefix Catalog because TableIdentifier is overloaded in the dependent projects (iceberg, hudi, delta etc.) and didn't want to add another identifier with the same name.

For 2, agreed that each catalog or system has a different term for what's called a "logical grouping of tables".
I have looked up this name in different systems and have found databaseName, namespace, schema are the popular ones, a more generic name that comes to my mind is tableCollection or tableGroup ? If we can't find a generic name, choosing databaseName maybe okay ? Regardless of the name we choose the conversion interfaces for each catalog are responsible for translating the catalog table definition to this representation.

@ashvina
Copy link
Contributor

ashvina commented Dec 18, 2024

Hi @vinishjail97 , this PR seems to cover two distinct features: refactoring to add CatalogSyncClient and enabling syncing to multiple catalogs. The combination of these features might be contributing to the PR's size. What do you think about splitting the PR along these feature lines?

@@ -34,14 +35,21 @@ public class ConversionConfig {
@NonNull SourceTable sourceTable;
// One or more targets to sync the table metadata to
List<TargetTable> targetTables;
// Each target table can be synced to multiple target catalogs, this is map from
// targetTableIdentifier to target catalogs.
Map<String, List<TargetCatalogConfig>> targetCatalogs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about the expected behavior if XTable fails to update a subset of the TargetCatalogs and if that impacts the way incremental sync works?
This may have been covered elsewhere in the PR that I might have missed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't change the behavior of existing incremental sync, the failure is returned back as part of this, btw existing XTable users can sync use incremental sync without configuring source/target catalogs, the existing RunSync class in utilities or sync function in ConversionController is not changing.

// The sync status for each catalog.
List<CatalogSyncStatus> catalogSyncStatusList; 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public class CatalogTableIdentifier {
/**
* Catalogs have the ability to group tables logically, databaseName is the identifier for such
* logical classification. The alternate names for this field include namespace, schemaName etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the nesting vary depending on the database. However, the usage of a 3-part naming is quite common. The convention includes name of database, schema, and table or view. I am wondering if it would be feasible to keep the table Id as a string instead of an object whose structure can vary a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the three level naming convention, kept catalogName as optional.
c227b76

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal representation for the table identifier in a catalog, for catalogs following a different convention we will interoperate using this model.

@ashvina
Copy link
Contributor

ashvina commented Dec 18, 2024

Hi @vinishjail97, This change significantly impacts XTable usage. The current description doesn't provide enough context for me. Could we set up a call or create a document to discuss this change in detail?

@vinishjail97
Copy link
Contributor Author

vinishjail97 commented Dec 18, 2024

@ashvina I will create two separate PR's to avoid the confusion.

  1. Interfaces for CatalogSyncClient and CatalogSync.
  2. Integration of catalog syncs in XTable conversion controller using RunCatalogSync.

We are not changing the way incremental sync works for table formats, the sync method in ConversionController still exists, the functionality of syncTableAcrossCatalogs is to synchronize a table in source catalog to target catalog. If there's a need for tableFormat sync, we synchronize the table format first otherwise skip it. After that, the targetTable's metadata is synchronized to the target catalogs using the catalogTableIdentifier provided. I will add a small RFC document as well in the second PR for clarity.

@vinishjail97 vinishjail97 changed the base branch from main to 590-CatalogSync-API December 18, 2024 10:07
@vinishjail97 vinishjail97 changed the title [590] Add interface for CatalogSyncClient and CatalogSync [590] Add RunCatalogSync utility for synchronizing tables across catalogs Dec 18, 2024
@vinishjail97
Copy link
Contributor Author

@ashvina I have split this change two PR's for clarity.

  1. [590] Add interfaces for CatalogSyncClient and CatalogSync #603
  2. [590] Add RunCatalogSync utility for synchronizing tables across catalogs #591

I will push the PR for an RFC doc for this tomorrow but have replied to most of your comments. Regarding the class CatalogTableIdentifier let's discuss on the RFC, if we have a compatibility matrix between CatalogTableIdentifier and the naming conventions for major catalogs documented users can easily understand.

@vinishjail97
Copy link
Contributor Author

vinishjail97 commented Dec 19, 2024

@ashvina I have followed a three level naming convention for CatalogTableIdentifier and split the change into two PR's for clarity. There's also an RFC PR for the design, we can discuss more f2f in our scheduled meeting tomorrow.

#605

@vinishjail97 vinishjail97 changed the base branch from 590-CatalogSync-API to main December 26, 2024 19:37
import org.apache.xtable.utilities.RunCatalogSync.DatasetConfig.TargetTableIdentifier;

/**
* Provides standalone process for reading tables from a source catalog and synchronizing their
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this will also do table format syncing, do we want to just combine this functionality with the existing tool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept RunCatalogSync because there will be future functionality of synchronizing permissions as well and RunSync will get overloaded with options.

try {
conversionController.syncTableAcrossCatalogs(
conversionConfig,
getConversionSourceProviders(tableFormats, tableFormatConverters, hadoopConf));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this a bit confusing when reading through the code since it is using targets to create sources at first glance. Would the SourceProvider be more clear if it was a TableStateProvider or something that does not include source/target language we use elsewhere for the table format sync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will think of a better name, for now keeping this refactoring in a separate PR.
#613

@the-other-tim-brown
Copy link
Contributor

Looks good to me overall but let's wait to merge until the RFC is finalized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants