Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a utility tool for syncing parquet files to all three formats delta, hudi, iceberg using file listing #553

Open
1 of 2 tasks
vinishjail97 opened this issue Oct 4, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@vinishjail97
Copy link
Contributor

Feature Request / Improvement

Sub-task of main feature request #550

Look at Using List files in the approach mentioned below. It's only a high level approach, PR author can change it and improve it as well.
#550 (comment)

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@vinishjail97 vinishjail97 added the enhancement New feature or request label Oct 4, 2024
@sudharshanraja-db
Copy link

Hi @vinishjail97 i am working on this task (last week was on and off with this) should be able to submit PR by next week just wanted to share the update here

@vinishjail97
Copy link
Contributor Author

vinishjail97 commented Oct 13, 2024

okay, you can submit a draft PR as well if you are blocked on anything.

@sudharshanraja-db
Copy link

sure will do that

@soumilshah1995
Copy link

+1

@lalit2001
Copy link

is there any update on this ??

@sudhar91
Copy link

sudhar91 commented Dec 7, 2024

Hi @vinishjail97
I made the initial commit and raised draft PR while i am continuing to refactor the code raised PR to get some feedback around approach. requesting your review here
I was primarily validating against delta due to familiarity but will continue to test for other formats as well.
I intentionally didn't checkin the tests as i have started to see some weird jar conflicts will push once am done but i was building running on my local system to test

TODO in my list

  1. Improve exception handling
  2. Improve doc strings and update documentation
  3. Simplify Certain logic
  4. Consistency in column stats population
  5. Test different scenarios of parquet ( multiple partitions..etc)
  6. Add more coverage

Your review comments and suggestion would be helpful as i make changes further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants