Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Introduce metadata.stats-dense-store to reduce meta size for multiple columns table #4322

Merged
merged 5 commits into from
Oct 15, 2024

Conversation

JingsongLi
Copy link
Contributor

@JingsongLi JingsongLi commented Oct 14, 2024

Purpose

At present, Paimon uses BinaryRow to store statistical information. Generally, it is not a problem, but some businesses have fields with over 3000 columns.

The BinaryRow structure has a characteristic that each field occupies a fixed 8 bytes, and for a 3000 column table, one BinaryRow has 25kb of storage. SimpleStats has 3 BinaryRow, then it will be 100 kb storage. 100000 files have GB level storage. This is unacceptable.

This PR:

  1. Introduce metadata.stats-dense-store to a dense mode to store SimpleStats and valueStatsCols in DataFileMeta.
  2. You can set metadata.stats-mode = none, then valueStatsCols will be empty list, and SimpleStats is empty.
  3. You can also set fields.b.stats-mode = full to enable stats for specific columns to enable data skipping, the meta storage will only contain b column.

Tests

  • AppendOnlyFileStoreTableTest
  • PrimaryKeyFileStoreTableTest

API and Format

Documentation

@wwj6591812
Copy link
Contributor

Good,we just encountered a similar problem today!

+ " none statistic mode is set.")
.linebreak()
.text(
"Note, When this mode is enabled, the sdk in reading engine requires at least"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1、Change the "When" to "when".
2、the Paimon sdk in reading engine requires at least version 0.9.1 or 1.0.0 or higher?

@@ -55,4 +55,8 @@ public class SystemFields {
public static boolean isSystemField(int fieldId) {
return fieldId >= SYSTEM_FIELD_ID_START;
}

public static boolean isSystemField(String field) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add KEY_FIELD_PREFIX to SYSTEM_FIELD_NAMES?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What to solve?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If KEY_FIELD_PREFIX is not in SYSTEM_FIELD_NAMES, then the funtion name "isSystemField" is inappropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So add KEY_FIELD_PREFIX to SYSTEM_FIELD_NAMES, what problem can be solved? Can you write a ut demo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A solution is using a SYSTEM_FIELD_PREFIXS, and always using starWith. But it is not good for performance.

Let it go now.

*/
public class ProjectedArray implements InternalArray {

protected final int[] indexMapping;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexMapping、array and ProjectedArray should be private?

@@ -101,8 +103,8 @@ public KeyValueDataFileWriter(
this.schemaId = schemaId;
this.level = level;

this.keyStatsConverter = new SimpleStatsConverter(keyType);
this.valueStatsConverter = new SimpleStatsConverter(valueType);
this.keyStatsConverter = new SimpleStatsConverter(keyType, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.keyStatsConverter = new SimpleStatsConverter(keyType);

private final InternalArray array;
private final long notFoundValue;

protected NullCountsEvoArray(int[] indexMapping, InternalArray array, long notFoundValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why protected?

private final InternalArray array;
private final long notFoundValue;

protected NullCountsEvoArray(int[] indexMapping, InternalArray array, long notFoundValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why protected?

dataFileSerializer.deserializeList(view),
dataFileSerializer.deserializeList(view),
dataFileSerializer.deserializeList(view)),
fileDeserializer.get(), fileDeserializer.get(), fileDeserializer.get()),
new IndexIncrement(
indexEntrySerializer.deserializeList(view),
indexEntrySerializer.deserializeList(view)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Condition 'version <= 2' is always 'true'

@@ -316,6 +317,9 @@ private static FunctionWithIOException<DataInputView, DataFileMeta> getFileMetaS
DataFileMeta08Serializer serializer = new DataFileMeta08Serializer();
return serializer::deserialize;
} else if (version == 2) {
DataFileMeta09Serializer serializer = new DataFileMeta09Serializer();
return serializer::deserialize;
} else if (version == 3) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public static final int DATA_FILE_META_VERSION_1 = 1;
public static final int DATA_FILE_META_VERSION_2= 2;
public static final int DATA_FILE_META_VERSION_3= 3;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -1026,7 +1026,7 @@ public class CoreOptions implements Serializable {
public static final String STATS_MODE_SUFFIX = "stats-mode";

public static final ConfigOption<String> METADATA_STATS_MODE =
key("metadata." + STATS_MODE_SUFFIX)
key("metadata.stats-mode")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can become more intuitive in the code.

null);
List<DataFileMeta> dataFiles = Collections.singletonList(dataFile);

LinkedHashMap<String, Pair<Integer, Integer>> dvRanges = new LinkedHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map<String, Pair<Integer, Integer>> dvRanges = new LinkedHashMap<>();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexFileMeta requires a LinkedHashMap

@wwj6591812
Copy link
Contributor

Please modify this doc : https://paimon.apache.org/docs/master/flink/sql-ddl/#specify-statistics-mode

@wwj6591812
Copy link
Contributor

A question :
If I have a paimon table with 1000 columns, and I just want field B to use dense stat, other columns close stats.
Is my configuration correct?

metadata.stats-dense-store=true
metadata.stats-mode=none
fields.b.stats-mode=truncate(16)

public static final ConfigOption<Boolean> METADATA_STATS_DENSE_STORE =
key("metadata.stats-dense-store")
.booleanType()
.defaultValue(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the default value is not true?
You are worry about that many users are using old versions of Paimon sdk in their reading engine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is true

@JingsongLi
Copy link
Contributor Author

A question : If I have a paimon table with 1000 columns, and I just want field B to use dense stat, other columns close stats. Is my configuration correct?

metadata.stats-dense-store=true metadata.stats-mode=none fields.b.stats-mode=truncate(16)

Correct!

@wwj6591812
Copy link
Contributor

Looks good to me!
I can't wait to use this feature quickly to solve our online produce problems!

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants