Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Support statistic with time travel #4251

Merged
merged 17 commits into from
Oct 8, 2024
16 changes: 16 additions & 0 deletions docs/content/maintenance/system-tables.md
Original file line number Diff line number Diff line change
Expand Up @@ -369,3 +369,19 @@ SELECT * FROM sys.catalog_options;
*/
```

### Statistic Table
You can query the statistic information through statistic table.

```sql
SELECT * FROM T$statistics;

/*
+--------------+------------+-----------------------+------------------+----------+
| snapshot_id | schema_id | mergedRecordCount | mergedRecordSize | colstat |
+--------------+------------+-----------------------+------------------+----------+
| 2 | 0 | 2 | 2 | {} |
+--------------+------------+-----------------------+------------------+----------+
1 rows in set
*/
```

Original file line number Diff line number Diff line change
Expand Up @@ -166,14 +166,43 @@ public Identifier identifier() {

@Override
public Optional<Statistics> statistics() {
// todo: support time travel
Snapshot latestSnapshot = snapshotManager().latestSnapshot();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just here to respect time travel options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (latestSnapshot != null) {
return store().newStatsFileHandler().readStats(latestSnapshot);
}
return Optional.empty();
}

@Override
public Optional<Statistics> statistics(Long snapshotId) {
if (!snapshotManager().snapshotExists(snapshotId)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just return store().newStatsFileHandler().readStats(latestSnapshot);?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should return store().newStatsFileHandler().readStats(latestSnapshot) better. Thanks~ @JingsongLi

throw new SnapshotNotExistException(
String.format("snapshot id: %s is not existed", snapshotId));
}

Long latestSnapshotId = snapshotManager().latestSnapshotId();
Copy link
Contributor

@wwj6591812 wwj6591812 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latestSnapshotId may be null, you should add a check.

if (latestSnapshotId == null) {
return Optional.empty();
}

while (latestSnapshotId > 0) {
Snapshot latestSnapshot = snapshotManager().snapshot(latestSnapshotId);
// reduce unnessary loop
if (latestSnapshot.id() < snapshotId) {
break;
}
if (latestSnapshot.commitKind() == Snapshot.CommitKind.ANALYZE) {
Optional<Statistics> statistics =
store().newStatsFileHandler().readStats(latestSnapshot);
if (statistics.isPresent() && statistics.get().snapshotId() == snapshotId) {
return statistics;
}
}
latestSnapshotId--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to find the snapshot with ANALYZE commit. The snapshot will inherit its parent snapshot.

So I said, just return the stats of this snapshot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1727404052461.png

Copy link
Member Author

@xuzifu666 xuzifu666 Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According the logic maybe we need traversal. Because StatsFileHandler##readStats from a snapshotid,can only get the latest analyzed snapshot,but from the logic statistic file snapshot_id may less than the anaylzed snapshot about it. I added a case to the ut at end which may refer it.

}
return Optional.empty();
}

@Override
public Optional<WriteSelector> newWriteSelector() {
switch (bucketMode()) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,11 @@ public Optional<Statistics> statistics() {
return wrapped.statistics();
}

@Override
public Optional<Statistics> statistics(Long snapshotId) {
return wrapped.statistics(snapshotId);
}

@Override
public OptionalLong latestSnapshotId() {
return wrapped.latestSnapshotId();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,11 @@ default Optional<Statistics> statistics() {
return Optional.empty();
}

@Override
default Optional<Statistics> statistics(Long snapshotId) {
return Optional.empty();
}

@Override
default OptionalLong latestSnapshotId() {
throw new UnsupportedOperationException();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,11 @@ default Optional<Statistics> statistics() {
return Optional.empty();
}

@Override
default Optional<Statistics> statistics(Long snapshotId) {
return Optional.empty();
}

@Override
default BatchWriteBuilder newBatchWriteBuilder() {
throw new UnsupportedOperationException(
Expand Down
3 changes: 3 additions & 0 deletions paimon-core/src/main/java/org/apache/paimon/table/Table.java
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,9 @@ default String fullName() {
@Experimental
Optional<Statistics> statistics();

@Experimental
Optional<Statistics> statistics(Long snapshotId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a long instead of Long

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// ================= Table Operations ====================

/** Copy this table with adding dynamic options. */
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
import org.apache.paimon.disk.IOManager;
import org.apache.paimon.fs.FileIO;
import org.apache.paimon.fs.Path;
import org.apache.paimon.predicate.LeafPredicate;
import org.apache.paimon.predicate.LeafPredicateExtractor;
import org.apache.paimon.predicate.Predicate;
import org.apache.paimon.reader.EmptyRecordReader;
import org.apache.paimon.reader.RecordReader;
Expand All @@ -47,6 +49,8 @@

import org.apache.paimon.shade.guava30.com.google.common.collect.Iterators;

import javax.annotation.Nullable;

import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
Expand All @@ -65,7 +69,9 @@ public class StatisticTable implements ReadonlyTable {

public static final String STATISTICS = "statistics";

public static final RowType TABLE_TYPE =
private static final String SNAPSHOT_ID = "snapshot_id";

private static final RowType TABLE_TYPE =
new RowType(
Arrays.asList(
new DataField(0, "snapshot_id", new BigIntType(false)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SNAPSHOT_ID

Expand Down Expand Up @@ -101,7 +107,7 @@ public RowType rowType() {

@Override
public List<String> primaryKeys() {
return Collections.singletonList("snapshot_id");
return Collections.singletonList(SNAPSHOT_ID);
}

@Override
Expand All @@ -121,15 +127,26 @@ public Table copy(Map<String, String> dynamicOptions) {

private class StatisticScan extends ReadOnceTableScan {

private @Nullable LeafPredicate snapshotIdPredicate;

@Override
public InnerTableScan withFilter(Predicate predicate) {
// TODO
if (predicate == null) {
return this;
}

Map<String, LeafPredicate> leafPredicates =
predicate.visit(LeafPredicateExtractor.INSTANCE);
snapshotIdPredicate = leafPredicates.get("snapshot_id");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SNAPSHOT_ID

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also in line 110

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TABLE_TYPE maybe not need set with constant value,keep the style with other table seems better,other position would change to SNAPSHOT_ID.


return this;
}

@Override
public Plan innerPlan() {
return () -> Collections.singletonList(new StatisticTable.StatisticSplit(location));
return () ->
Collections.singletonList(
new StatisticTable.StatisticSplit(location, snapshotIdPredicate));
}
}

Expand All @@ -139,8 +156,11 @@ private static class StatisticSplit extends SingletonSplit {

private final Path location;

private StatisticSplit(Path location) {
private final @Nullable LeafPredicate snapshotIdPredicate;

private StatisticSplit(Path location, @Nullable LeafPredicate snapshotIdPredicate) {
this.location = location;
this.snapshotIdPredicate = snapshotIdPredicate;
}

@Override
Expand All @@ -152,7 +172,8 @@ public boolean equals(Object o) {
return false;
}
StatisticTable.StatisticSplit that = (StatisticTable.StatisticSplit) o;
return Objects.equals(location, that.location);
return Objects.equals(location, that.location)
&& Objects.equals(snapshotIdPredicate, that.snapshotIdPredicate);
}

@Override
Expand Down Expand Up @@ -195,8 +216,22 @@ public RecordReader<InternalRow> createReader(Split split) throws IOException {
if (!(split instanceof StatisticTable.StatisticSplit)) {
throw new IllegalArgumentException("Unsupported split: " + split.getClass());
}
StatisticSplit statisticSplit = (StatisticSplit) split;
LeafPredicate snapshotIdPredicate = statisticSplit.snapshotIdPredicate;
Optional<Statistics> statisticsOptional;
if (snapshotIdPredicate != null) {
Long snapshotId =
(Long)
snapshotIdPredicate
.visit(LeafPredicateExtractor.INSTANCE)
.get(SNAPSHOT_ID)
.literals()
.get(0);
statisticsOptional = dataTable.statistics(snapshotId);
} else {
statisticsOptional = dataTable.statistics();
}

Optional<Statistics> statisticsOptional = dataTable.statistics();
if (statisticsOptional.isPresent()) {
Statistics statistics = statisticsOptional.get();
Iterator<Statistics> statisticsIterator =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,40 @@ abstract class AnalyzeTableTestBase extends PaimonSparkTestBase {
Row(2, 0, 2, "{ }"))
}

test("Paimon analyze: test statistic system table with predicate") {
spark.sql(s"""
|CREATE TABLE T (id STRING, name STRING, i INT, l LONG)
|USING PAIMON
|TBLPROPERTIES ('primary-key'='id')
|""".stripMargin)

spark.sql(s"INSERT INTO T VALUES ('1', 'a', 1, 1)")
spark.sql(s"INSERT INTO T VALUES ('2', 'aaa', 1, 2)")
Assertions.assertEquals(0, spark.sql("select * from `T$statistics`").count())

spark.sql(s"ANALYZE TABLE T COMPUTE STATISTICS")

spark.sql(s"INSERT INTO T VALUES ('3', 'b', 2, 1)")
spark.sql(s"INSERT INTO T VALUES ('4', 'bbb', 3, 2)")

spark.sql(s"ANALYZE TABLE T COMPUTE STATISTICS")

checkAnswer(
spark.sql(
"SELECT snapshot_id, schema_id, mergedRecordCount, colstat from `T$statistics` where snapshot_id=3"),
Nil)

checkAnswer(
spark.sql(
"SELECT snapshot_id, schema_id, mergedRecordCount, colstat from `T$statistics` where snapshot_id=2"),
Row(2, 0, 2, "{ }"))

checkAnswer(
spark.sql(
"SELECT snapshot_id, schema_id, mergedRecordCount, colstat from `T$statistics` where snapshot_id=5"),
Row(5, 0, 4, "{ }"))
}

test("Paimon analyze: analyze table without snapshot") {
spark.sql(s"CREATE TABLE T (id STRING, name STRING)")
spark.sql(s"ANALYZE TABLE T COMPUTE STATISTICS")
Expand Down
Loading