-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC-15][HUDI-1325] Merge updates of unsynced instants to metadata table #6
base: rfc-15
Are you sure you want to change the base?
Conversation
@rmpifer I was looking for this in apache/hudi :). and totally missed that it's here. Can we retarget this to apache/hudi/rfc-15? |
@@ -309,6 +306,13 @@ public HoodieBackedTableMetadata(Configuration conf, String datasetBasePath, Str | |||
} | |||
} | |||
|
|||
// Retrieve record from unsynced timeline instants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good amount of code was reusuable. I am wondering if we can do this in a separate TimelineMergedTableMetadata
implementation, that wraps HoodieBackedTableMetadata
?
Codecov Report
@@ Coverage Diff @@
## rfc-15 #6 +/- ##
============================================
- Coverage 43.79% 43.73% -0.07%
Complexity 3379 3379
============================================
Files 573 575 +2
Lines 24400 24438 +38
Branches 2445 2449 +4
============================================
Hits 10687 10687
- Misses 12692 12730 +38
Partials 1021 1021
Flags with carried forward coverage won't be shown. Click here to find out more.
|
7f84b12
to
862d501
Compare
- Changes the syncing model to only move over completed instants on data timeline - Syncing happens postCommit and on writeClient initialization - Latest delta commit on the metadata table is sufficient as the watermark for data timeline archival - Cleaning/Compaction use a suffix to the last instant written to metadata table, such that we keep the 1-1 - .. mapping between data and metadata timelines. - Got rid of a lot of the complexity around checking for valid commits during open of base/log files - Tests now use local FS, to simulate more failure scenarios - Some failure scenarios exposed HUDI-1434, which is needed for MOR to work correctly
56717d0
to
de31e8c
Compare
…d listing. (apache#2343) * [HUDI-1469] Faster initialization of metadata table using parallelized listing which finds partitions and files in a single scan. * MINOR fixes Co-authored-by: Vinoth Chandar <[email protected]>
Tips
What is the purpose of the pull request
There can be the possibility that the dataset timeline and the metadata table timeline become out of sync. When trying to read from the metadata table while the timeline is out of sync you would get incorrect values for
getAllFilesInPartition
andgetAllPartitionPaths
.This change provides a way to overcome this scenario by reading unsynced timeline instants and merging it with existing metadata table records to get the most up to date state of the file system
JIRA: https://issues.apache.org/jira/browse/HUDI-1325
Brief change log
FSBackedMetadataWriter
. Refactored this logic to a utility classHoodieTableMetadataTimelineUtil
HoodieMetadataMergedInstantRecordScanner
which handles conversion of timeline instants to metadata records and merges resultsFSBackedTableMetadata.getMergedRecordByKey
which uses the new scanner mentioned to fetch theHoodieRecord
associated with the desiredkey
from the unsynced timeline instants and merge it with the record from the metadata tableThis doesn't make sense since all instants are processed in serial order so there would never be the case where a rollback was being written before an instant earlier on the timeline was already synced. Removed this logic because it created circular dependency when implementing timeline merging
FSBackedTableMetadata
. By default when metadata writerFSBackedTableMetadataWriter
is initialized it syncs all instants to the metadata table. By using the reader we can simulate metadata table being out of sync.initMetaClient
in test base class to allow table type to be passed in since table type is always set asCOPY_ON_WRITE
if using this method to initialize the meta clientVerify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.