Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] support decouple the delta files lifecycle #3178

Closed
wants to merge 2 commits into from

Conversation

Aitozi
Copy link
Contributor

@Aitozi Aitozi commented Apr 9, 2024

Purpose

This PR is meant to support decouple the delta files lifecycle #2899

The basic idea behind this is that:

  • Add fileSource in DatafileMeta to indicate whether this file is generated as an APPEND or COMPACT file
  • When expire a snapshot, if changelog is decoupled, and it's a none changelog producer, we do not clean the APPEND files in data file
  • When expire a changelog, we do the final clean for these files
  • Due to the manifest file could still be used by delta files, so the base and delta manifest file for the none producer are also postpone to delete

About why we need FileSource in DataFileMeta

For none changelog producer, only APPEND commits are required for stream read. In a COMPACT commit, some files from the compact or append could be marked as delete. We should delete the files from the compact commit and keep the files from the append commit for further stream read. So we need a flag to distinguish the file source (compact or append).

Linked issue: close #xxx

Tests

API and Format

Introduce FileSource in DataFileMeta

Documentation

@Aitozi Aitozi changed the title [poc][core] support decouple the delta files lifecycle [core] support decouple the delta files lifecycle Apr 10, 2024
@Aitozi Aitozi requested a review from JingsongLi April 12, 2024 10:08
@JingsongLi
Copy link
Contributor

Perhaps it's best for us to have an abstract mechanism to ensure that we don't keep both the changelog and delta files at the same time, so that we can better understand this set of things.

@Aitozi Aitozi force-pushed the delta-lists branch 3 times, most recently from 078b05a to fa73cb7 Compare April 17, 2024 12:28
@Aitozi
Copy link
Contributor Author

Aitozi commented Apr 17, 2024

Perhaps it's best for us to have an abstract mechanism to ensure that we don't keep both the changelog and delta files at the same time, so that we can better understand this set of things.

I extract this logic to SnapshotDeletion and ChangelogDeletion to handle the manifestList and data file clean.

In this PR, I also refactor the ExpireChangelogImpl and ExpireSnapshotImpl to immutable, So that, we could know whether the changelog decouple is enabled in one entry point in ExpireSnapshots.Expire

@Aitozi Aitozi force-pushed the delta-lists branch 2 times, most recently from 9c06546 to 2ae1028 Compare April 18, 2024 02:02
@@ -71,7 +71,7 @@ class ExpireSnapshotsProcedureTest extends PaimonSparkTestBase with StreamTest {

// expire
checkAnswer(
spark.sql("CALL paimon.sys.expire_snapshots(table => 'test.T', retain_max => 2)"),
spark.sql("CALL paimon.sys.expire_snapshots(table => 'test.T', retain_max => 2, retain_min => 1)"),
Copy link
Contributor Author

@Aitozi Aitozi Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, if user not specify the retain_min, the default value is 1 in ExpireSnapshotsImpl, now the default value is fallback to the CoreOptions.SNAPSHOT_RETAIN_MIN = 10, so if max is 2, we should manually specify the retain_min => 1. I think the current behavior is more consistent, I'm not sure whether this will break the compatibility. Please also help check this cc @JingsongLi

@JingsongLi
Copy link
Contributor

Already merged to three PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants