Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC-15][HUDI-1325] Merge updates of unsynced instants to metadata table #6

Open
wants to merge 5 commits into
base: rfc-15
Choose a base branch
from

Conversation

rmpifer
Copy link

@rmpifer rmpifer commented Dec 2, 2020

Tips

What is the purpose of the pull request

There can be the possibility that the dataset timeline and the metadata table timeline become out of sync. When trying to read from the metadata table while the timeline is out of sync you would get incorrect values for getAllFilesInPartition and getAllPartitionPaths.

This change provides a way to overcome this scenario by reading unsynced timeline instants and merging it with existing metadata table records to get the most up to date state of the file system

JIRA: https://issues.apache.org/jira/browse/HUDI-1325

Brief change log

  • The logic of converting timeline metadata to metadata table records was directly tied to the commit phase in FSBackedMetadataWriter. Refactored this logic to a utility class HoodieTableMetadataTimelineUtil
  • Created a scanner HoodieMetadataMergedInstantRecordScanner which handles conversion of timeline instants to metadata records and merges results
  • Added third step in FSBackedTableMetadata.getMergedRecordByKey which uses the new scanner mentioned to fetch the HoodieRecord associated with the desired key from the unsynced timeline instants and merge it with the record from the metadata table
  • When converting rollback operation to metadata table records there was logic that re-read from the metadata table to ensure any files being deleted as part of roll back existed.
// Rollbacks deletes instants from timeline. The instant being rolled-back may not have been synced to the
// metadata table. Hence, the deleted filed need to be checked against the metadata.

This doesn't make sense since all instants are processed in serial order so there would never be the case where a rollback was being written before an instant earlier on the timeline was already synced. Removed this logic because it created circular dependency when implementing timeline merging

  • Changed the validate metadata step in tests to use the metadata reader FSBackedTableMetadata. By default when metadata writer FSBackedTableMetadataWriter is initialized it syncs all instants to the metadata table. By using the reader we can simulate metadata table being out of sync.
  • Modified initMetaClient in test base class to allow table type to be passed in since table type is always set as COPY_ON_WRITE if using this method to initialize the meta client

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@vinothchandar
Copy link
Owner

@rmpifer I was looking for this in apache/hudi :). and totally missed that it's here.

Can we retarget this to apache/hudi/rfc-15?

@@ -309,6 +306,13 @@ public HoodieBackedTableMetadata(Configuration conf, String datasetBasePath, Str
}
}

// Retrieve record from unsynced timeline instants
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good amount of code was reusuable. I am wondering if we can do this in a separate TimelineMergedTableMetadata implementation, that wraps HoodieBackedTableMetadata?

@codecov-io
Copy link

Codecov Report

Merging #6 (75a3352) into rfc-15 (7f84b12) will decrease coverage by 0.06%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             rfc-15       #6      +/-   ##
============================================
- Coverage     43.79%   43.73%   -0.07%     
  Complexity     3379     3379              
============================================
  Files           573      575       +2     
  Lines         24400    24438      +38     
  Branches       2445     2449       +4     
============================================
  Hits          10687    10687              
- Misses        12692    12730      +38     
  Partials       1021     1021              
Flag Coverage Δ Complexity Δ
hudicli 27.48% <0.00%> (+0.35%) 0.00 <0.00> (ø)
hudiclient 24.74% <0.00%> (+0.41%) 0.00 <0.00> (ø)
hudicommon 51.64% <0.00%> (-0.95%) 0.00 <0.00> (ø)
hudihadoopmr 33.05% <ø> (ø) 0.00 <ø> (ø)
hudispark 67.19% <ø> (ø) 0.00 <ø> (ø)
huditimelineservice 64.43% <ø> (ø) 0.00 <ø> (ø)
hudiutilities 69.38% <ø> (ø) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...hudi/metadata/HoodieBackedTableMetadataWriter.java 4.01% <0.00%> (+1.34%) 4.00 <0.00> (ø)
...pache/hudi/metadata/HoodieBackedTableMetadata.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...data/HoodieMetadataMergedInstantRecordScanner.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...hudi/metadata/HoodieTableMetadataTimelineUtil.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)

- Changes the syncing model to only move over completed instants on data timeline
 - Syncing happens postCommit and on writeClient initialization
 - Latest delta commit on the metadata table is sufficient as the watermark for data timeline archival
 - Cleaning/Compaction use a suffix to the last instant written to metadata table, such that we keep the 1-1
 - .. mapping between data and metadata timelines.
 - Got rid of a lot of the complexity around checking for valid commits during open of base/log files
 - Tests now use local FS, to simulate more failure scenarios
 - Some failure scenarios exposed HUDI-1434, which is needed for MOR to work correctly
@rmpifer rmpifer force-pushed the rfc-15 branch 2 times, most recently from 56717d0 to de31e8c Compare December 18, 2020 00:20
umehrot2 and others added 3 commits December 20, 2020 11:29
…d listing. (apache#2343)

* [HUDI-1469] Faster initialization of metadata table using parallelized listing which finds partitions and files in a single scan.
* MINOR fixes

Co-authored-by: Vinoth Chandar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants