Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Introduce stats in snapshot #2677

Merged
merged 10 commits into from
Jan 20, 2024

Conversation

Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Jan 11, 2024

Purpose

First step of PIP-14: Paimon statistics

snapshot-x

{
  "version" : 3,
  "id" : 2,
  "schemaId" : 3,
  "baseManifestList" : "manifest-list-485c5847-cb20-4324-b2ea-fd73959c5857-0",
  "deltaManifestList" : "manifest-list-485c5847-cb20-4324-b2ea-fd73959c5857-1",
  "changelogManifestList" : null,
  "commitUser" : "e30257ac-a617-4571-a57b-60c3dd608754",
  "commitIdentifier" : 9223372036854775807,
  "commitKind" : "ANALYZE",
  "timeMillis" : 1704944283486,
  "logOffsets" : { },
  "totalRecordCount" : 6,
  "deltaRecordCount" : 0,
  "changelogRecordCount" : 0,
  "watermark" : null,
  "statistics" : "stats-87effd5d-48fd-4aab-81fe-4222b847d247-0"
}

statistics/stats-87effd5d-48fd-4aab-81fe-4222b847d247-0

{
  "snapshotId": 2,
  "mergedRecordCount" : 10,
  "mergedRecordSize" : 1000,
  "colStats" : {
    "orderId" : {
      "distinctCount" : 10,
      "min" : "1",
      "max" : "10",
      "nullCount" : 0,
      "avgLen" : 8,
      "maxLen" : 8
    }
  }
}

Tests

API and Format

write

FileStoreCommitImpl commit = store.newCommit();
commit.writeStats(stats, Long.MAX_VALUE);

read

StatsFileHandler statsFileHandler = store.newStatsFileHandler();
Optional<Stats> stats = statsFileHandler.readStats();

Next

Future work

  • expire of stats
  • read stats with system table
  • spark analyze

@YannByron YannByron self-assigned this Jan 11, 2024
@Zouxxyy Zouxxyy changed the title [core] Introdue stats in snapshot [core] Introduce stats in snapshot Jan 11, 2024
@@ -184,7 +192,8 @@ public Snapshot(
@Nullable Long totalRecordCount,
@Nullable Long deltaRecordCount,
@Nullable Long changelogRecordCount,
@Nullable Long watermark) {
@Nullable Long watermark,
@Nullable String statistics) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the partition-level statistics are supported, how to retrieve those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan to add a new field named partitionStatistics (a manifestList link) when going to support partition-level statistics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both statistics and partitionStatistics are related to statistics, why we split into two?

import java.io.UncheckedIOException;

/** Stats file contains stats. */
public class StatsFile {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to OOP, these methods(read, write, delete and exists) should not be the members of StatsFile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just follow the implement of indexFile and indexFileHandle

@TaoZex
Copy link
Contributor

TaoZex commented Jan 12, 2024

In pip-14, we included statistics for maxLen and avgLen, would you consider adding minLen?

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Jan 12, 2024

In pip-14, we included statistics for maxLen and avgLen, would you consider adding minLen?

The primary purpose of PIP-14 is to enable Paimon to preserve stats for query engine CBO (especially for Spark), Below are the stats collected by various engines, the common ones among them are the stats introduced in this PIP.

Hive: https://cwiki.apache.org/confluence/display/Hive/StatsDev
Flink: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/analyze/
Spark: https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html

Besides, from what I know, aside from histograms (maybe will add to paimon in the future), distinctCount, min, max, avglen, and nullCount are useful in cost estimation, maxLen is hardly used in Spark(open source)'s CBO. Nevertheless, we have decided to include it in order to maintain consistency with Spark. Of course, if we find other certain stats are very useful, we can also add them.

Cost estimation example: https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive.

@TaoZex
Copy link
Contributor

TaoZex commented Jan 13, 2024

In pip-14, we included statistics for maxLen and avgLen, would you consider adding minLen?

The primary purpose of PIP-14 is to enable Paimon to preserve stats for query engine CBO (especially for Spark), Below are the stats collected by various engines, the common ones among them are the stats introduced in this PIP.

Hive: https://cwiki.apache.org/confluence/display/Hive/StatsDev Flink: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/analyze/ Spark: https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html

Besides, from what I know, aside from histograms (maybe will add to paimon in the future), distinctCount, min, max, avglen, and nullCount are useful in cost estimation, maxLen is hardly used in Spark(open source)'s CBO. Nevertheless, we have decided to include it in order to maintain consistency with Spark. Of course, if we find other certain stats are very useful, we can also add them.

Cost estimation example: https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive.

Thanks for your reply, which helped me understand the design idea of this part.

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Jan 18, 2024

two update:

  • add schemaId in stats: snapshot may expire, but schema never
  • add colId for each col in colStats: for schema evolution in the future

@YannByron
Copy link
Contributor

+1

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 52d0aa2 into apache:master Jan 20, 2024
@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Jan 26, 2024

#2799

@YannByron
Copy link
Contributor

#2404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants