Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for converting traditional hive tables to iceberg/delta/hudi #550

Open
1 of 2 tasks
djouallah opened this issue Sep 30, 2024 · 13 comments
Open
1 of 2 tasks
Labels
enhancement New feature or request

Comments

@djouallah
Copy link

djouallah commented Sep 30, 2024

Feature Request / Improvement

there are a lot of systems that produce parquet files only, it will be useful if xtable can convert those parquet to modern tables formats without rewriting data just by adding metadata continuously.

Delta do that already but it is a one off operation and can't accept new files

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@JDLongPLMR
Copy link

This seems like a pretty easy lift. There are a number of use cases where simply adding parquet files to the table would be handy.

@vinishjail97 vinishjail97 added the enhancement New feature or request label Oct 1, 2024
@vinishjail97
Copy link
Contributor

vinishjail97 commented Oct 1, 2024

Yes this can be done, we need to implement a parquet source class which needs to do two things - retrieve snapshot and retrieve change log since lastSyncTime.

Using List files

  1. List all parquet files in ObjectStorage or HDFS root path to retrieve the snapshot. This would be a simple list call.
  2. Fetch the parquet files that have been added since last syncInstant to retrieve the change log. We can this via the same list call and filtering files based on their creationTime is the simplest way but it's expensive.

Using cloud notifications queue

  1. The efficient way of doing this for object stores would be to setup the notifications queue for the bucket, consume and insert the file location, creationTime etc. to a key-value store or hudi/delta/iceberg table (let's call this events table). To handle duplicate notifications in the queue, the events table would have the parquet versioned file location as the primary key. If you are using hudi for events table, we don't need to write new code here, this step would be setting up cloud queue via terraform and starting a job which consumes from the queue. Ref: S3EventsSource GcsEventsSource SQS PubSub
  2. XTable parquet source class would trigger an incremental query to the events table to get the new files that have been added since the lastSyncTime and generate hudi, iceberg and delta metadata for them.

The design is similar to what hudi does for ingesting large number of files, steps 7 and 8 in the architecture would become XTable sync.
https://hudi.apache.org/blog/2021/08/23/s3-events-source/

If you are using HDFS or object stores which don't support a queue based system for file notifications, we need to build/re-use existing queue implementation for file notifications.

@vinishjail97
Copy link
Contributor

@djouallah @JDLongPLMR Let me know what you think of the two approaches, we can write this as utility tool in xtable-utilities similar to RunSync

https://github.com/apache/incubator-xtable/blob/main/xtable-utilities/src/main/java/org/apache/xtable/utilities/RunSync.java

@alberttwong
Copy link
Contributor

I believe you can covert parquet to hudi files via hudi bootstraping (https://hudi.apache.org/docs/migration_guide). Once it's in hudi, you can apache xtable to other formats. Onehouse can do this automatically.

@djouallah
Copy link
Author

@djouallah @JDLongPLMR Let me know what you think of the two approaches, we can write this as utility tool in xtable-utilities similar to RunSync

https://github.com/apache/incubator-xtable/blob/main/xtable-utilities/src/main/java/org/apache/xtable/utilities/RunSync.java

using listing files seems good for my use case

@vinishjail97
Copy link
Contributor

@djouallah Yes listing will be sufficient for a small number of files, do you plan to submit a PR for this ? Let me know if you need any help regarding the PR.

@djouallah
Copy link
Author

@vinishjail97 nah, I am just an end user of xtable :)

@vinishjail97
Copy link
Contributor

okay, I will start a thread in dev mailing list to see if someone is interested to work on this feature.

@JDLongPLMR
Copy link

thanks all. Seems like a helpful addition

@sudharshanraja-db
Copy link

Hi @vinishjail97 if it is not assigned to anyone yet would like to explore and take up this feature

@vinishjail97
Copy link
Contributor

Yes @sudharshanraja-db you can pick up the first sub-task of file listing utility if you are interested, let me know what you think. The second sub-task is more open ended one and we can discuss the design in dev mailing list before finalizing the approach.

@sudharshanraja-db
Copy link

Thanks @vinishjail97, like u suggested i will pick and start this feature so that i can understand the project and structure better ,then will look and discuss further about the design of second task

@vinishjail97
Copy link
Contributor

For the first task, you can look at the RunSync class in xtable-utilities and explore other modules.
https://github.com/apache/incubator-xtable/blob/main/xtable-utilities/src/main/java/org/apache/xtable/utilities/RunSync.java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants