Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Support statistic with time travel #4251

Merged
merged 17 commits into from
Oct 8, 2024

Conversation

xuzifu666
Copy link
Member

@xuzifu666 xuzifu666 commented Sep 24, 2024

Purpose

Scenario:user query statistic table by snapshot-id and analyzed table multiple times. Such as user can do the analyzed at regular time and collect the statistic metric to monitor.
Currently paimon query statistic system table with snapshot_id would error out due to PaimonAnalyzeTableColumnCommand##commitStatistics not keep the real snapshot id the same with statistic snapshot_id,the pr is aimed to fix it.
before fix:
1727191085648.png

after fix:
1727191088377.png

Linked issue: close #xxx

Tests

AnalyzeTableTestBase "Paimon analyze: test statistic system table with predicate"

API and Format

Documentation

@xuzifu666 xuzifu666 closed this Sep 24, 2024
@xuzifu666 xuzifu666 reopened this Sep 24, 2024
(Long)
snapshotIdPredicate
.visit(LeafPredicateExtractor.INSTANCE)
.get("snapshot_id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of "snapshot_id" in this class, could you add:
private static final String SNAPSHOT_ID = "snapshot_id";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And TABLE_TYPE in this class should be private.

public Optional<Statistics> statistics(Long snapshotId) {
if (!snapshotManager().snapshotExists(snapshotId)) {
throw new SnapshotNotExistException(
String.format("snapshot id: %s is not exisit", snapshotId.toString()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "snapshotId.toString()" to "snapshotId".

String.format("snapshot id: %s is not exisit", snapshotId.toString()));
}

Long latestSnapshotId = snapshotManager().latestSnapshotId();
Copy link
Contributor

@wwj6591812 wwj6591812 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latestSnapshotId may be null, you should add a check.

public Optional<Statistics> statistics(Long snapshotId) {
if (!snapshotManager().snapshotExists(snapshotId)) {
throw new SnapshotNotExistException(
String.format("snapshot id: %s is not exisit", snapshotId.toString()));
Copy link
Contributor

@wwj6591812 wwj6591812 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "is not exisit" to "is not existed";

private StatisticSplit(Path location) {
private final @Nullable LeafPredicate snapshotIdPredicate;

private StatisticSplit(Path location, LeafPredicate snapshotIdPredicate) {
Copy link
Contributor

@wwj6591812 wwj6591812 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 159, add @nullable before "LeafPredicate snapshotIdPredicate".

public static final RowType TABLE_TYPE =
private static final String SNAPSHOT_ID = "snapshot_id";

private static final RowType TABLE_TYPE =
new RowType(
Arrays.asList(
new DataField(0, "snapshot_id", new BigIntType(false)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SNAPSHOT_ID


Map<String, LeafPredicate> leafPredicates =
predicate.visit(LeafPredicateExtractor.INSTANCE);
snapshotIdPredicate = leafPredicates.get("snapshot_id");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SNAPSHOT_ID

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also in line 110

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TABLE_TYPE maybe not need set with constant value,keep the style with other table seems better,other position would change to SNAPSHOT_ID.

@wwj6591812
Copy link
Contributor

Good Job!
But when I review your code, I found there is no doc for statistic system table.
Could you add?

@xuzifu666
Copy link
Member Author

Good Job! But when I review your code, I found there is no doc for statistic system table. Could you add?

Addressed, Thanks for @wwj6591812 reivew, I would add a doc for it.

@wwj6591812
Copy link
Contributor

+1

@@ -74,6 +74,9 @@ default String fullName() {
@Experimental
Optional<Statistics> statistics();

@Experimental
Optional<Statistics> statistics(Long snapshotId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a long instead of Long

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Snapshot latestSnapshot = snapshotManager().latestSnapshot();
if (latestSnapshot != null) {
return store().newStatsFileHandler().readStats(latestSnapshot);
}
return Optional.empty();
}

@Override
public Optional<Statistics> statistics(Long snapshotId) {
if (!snapshotManager().snapshotExists(snapshotId)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just return store().newStatsFileHandler().readStats(latestSnapshot);?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should return store().newStatsFileHandler().readStats(latestSnapshot) better. Thanks~ @JingsongLi

@@ -166,14 +166,43 @@ public Identifier identifier() {

@Override
public Optional<Statistics> statistics() {
// todo: support time travel
Snapshot latestSnapshot = snapshotManager().latestSnapshot();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just here to respect time travel options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -74,6 +74,9 @@ default String fullName() {
@Experimental
Optional<Statistics> statistics();

@Experimental
Optional<Statistics> statistics(long snapshotId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid introducing this method?

Maybe we can Table.copy('scan.snapshot-id', 'xxx').statistics to get stats for snapshot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice reminder,from this way can implement time travel for statistic. Had change it.

return statistics;
}
}
latestSnapshotId--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to find the snapshot with ANALYZE commit. The snapshot will inherit its parent snapshot.

So I said, just return the stats of this snapshot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1727404052461.png

Copy link
Member Author

@xuzifu666 xuzifu666 Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According the logic maybe we need traversal. Because StatsFileHandler##readStats from a snapshotid,can only get the latest analyzed snapshot,but from the logic statistic file snapshot_id may less than the anaylzed snapshot about it. I added a case to the ut at end which may refer it.

@xuzifu666 xuzifu666 changed the title [core] Support statistic system table query with snapshot_id [core] Support statistic with time travel Sep 27, 2024
@xuzifu666
Copy link
Member Author

xuzifu666 commented Sep 27, 2024

traversal logic do a optimize, before: need from latest snapshot to the target analyzed snapshot, after: change from the nearest snapshot to the target analyzed snapshot. @JingsongLi
1727419450979.png

@xuzifu666
Copy link
Member Author

After discuss, change to query by real snapshot id. @JingsongLi

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 7a08c89 into apache:master Oct 8, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants