[core] Introduce stats in snapshot #2677

Zouxxyy · 2024-01-11T03:43:09Z

Purpose

snapshot-x

{
  "version" : 3,
  "id" : 2,
  "schemaId" : 3,
  "baseManifestList" : "manifest-list-485c5847-cb20-4324-b2ea-fd73959c5857-0",
  "deltaManifestList" : "manifest-list-485c5847-cb20-4324-b2ea-fd73959c5857-1",
  "changelogManifestList" : null,
  "commitUser" : "e30257ac-a617-4571-a57b-60c3dd608754",
  "commitIdentifier" : 9223372036854775807,
  "commitKind" : "ANALYZE",
  "timeMillis" : 1704944283486,
  "logOffsets" : { },
  "totalRecordCount" : 6,
  "deltaRecordCount" : 0,
  "changelogRecordCount" : 0,
  "watermark" : null,
  "statistics" : "stats-87effd5d-48fd-4aab-81fe-4222b847d247-0"
}

statistics/stats-87effd5d-48fd-4aab-81fe-4222b847d247-0

{
  "snapshotId": 2,
  "mergedRecordCount" : 10,
  "mergedRecordSize" : 1000,
  "colStats" : {
    "orderId" : {
      "distinctCount" : 10,
      "min" : "1",
      "max" : "10",
      "nullCount" : 0,
      "avgLen" : 8,
      "maxLen" : 8
    }
  }
}

Tests

API and Format

write

FileStoreCommitImpl commit = store.newCommit();
commit.writeStats(stats, Long.MAX_VALUE);

read

StatsFileHandler statsFileHandler = store.newStatsFileHandler();
Optional<Stats> stats = statsFileHandler.readStats();

@@ -184,7 +192,8 @@ public Snapshot(
            @Nullable Long totalRecordCount,
            @Nullable Long deltaRecordCount,
            @Nullable Long changelogRecordCount,
-            @Nullable Long watermark) {
+            @Nullable Long watermark,
+            @Nullable String statistics) {


If the partition-level statistics are supported, how to retrieve those?

Plan to add a new field named partitionStatistics (a manifestList link) when going to support partition-level statistics

both statistics and partitionStatistics are related to statistics, why we split into two?

paimon-core/src/main/java/org/apache/paimon/stats/ColStats.java

paimon-core/src/main/java/org/apache/paimon/stats/Stats.java

YannByron · 2024-01-12T10:09:33Z

paimon-core/src/main/java/org/apache/paimon/stats/StatsFile.java

+import java.io.UncheckedIOException;
+
+/** Stats file contains stats. */
+public class StatsFile {


According to OOP, these methods(read, write, delete and exists) should not be the members of StatsFile.

just follow the implement of indexFile and indexFileHandle

TaoZex · 2024-01-12T13:45:19Z

In pip-14, we included statistics for maxLen and avgLen, would you consider adding minLen？

Zouxxyy · 2024-01-12T16:18:28Z

In pip-14, we included statistics for maxLen and avgLen, would you consider adding minLen？

The primary purpose of PIP-14 is to enable Paimon to preserve stats for query engine CBO (especially for Spark), Below are the stats collected by various engines, the common ones among them are the stats introduced in this PIP.

Hive: https://cwiki.apache.org/confluence/display/Hive/StatsDev
Flink: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/analyze/
Spark: https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html

Besides, from what I know, aside from histograms (maybe will add to paimon in the future), distinctCount, min, max, avglen, and nullCount are useful in cost estimation, maxLen is hardly used in Spark(open source)'s CBO. Nevertheless, we have decided to include it in order to maintain consistency with Spark. Of course, if we find other certain stats are very useful, we can also add them.

Cost estimation example: https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive.

TaoZex · 2024-01-13T17:14:39Z

In pip-14, we included statistics for maxLen and avgLen, would you consider adding minLen？

The primary purpose of PIP-14 is to enable Paimon to preserve stats for query engine CBO (especially for Spark), Below are the stats collected by various engines, the common ones among them are the stats introduced in this PIP.

Hive: https://cwiki.apache.org/confluence/display/Hive/StatsDev Flink: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/analyze/ Spark: https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html

Besides, from what I know, aside from histograms (maybe will add to paimon in the future), distinctCount, min, max, avglen, and nullCount are useful in cost estimation, maxLen is hardly used in Spark(open source)'s CBO. Nevertheless, we have decided to include it in order to maintain consistency with Spark. Of course, if we find other certain stats are very useful, we can also add them.

Cost estimation example: https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive.

Thanks for your reply, which helped me understand the design idea of this part.

paimon-core/src/main/java/org/apache/paimon/stats/Stats.java

Zouxxyy · 2024-01-18T07:23:24Z

two update:

add schemaId in stats: snapshot may expire, but schema never
add colId for each col in colStats: for schema evolution in the future

YannByron · 2024-01-18T07:38:47Z

+1

JingsongLi

+1

Zouxxyy · 2024-01-26T14:06:23Z

#2799

YannByron · 2024-02-27T08:49:10Z

#2404

version1

233d140

YannByron self-assigned this Jan 11, 2024

update

ce5cda4

Zouxxyy changed the title ~~[core] Introdue stats in snapshot~~ [core] Introduce stats in snapshot Jan 11, 2024

Zouxxyy added 2 commits January 12, 2024 13:20

update

50c0bd2

update

710ae56