-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Introduce metadata.stats-dense-store to reduce meta size for multiple columns table #4322
Conversation
Good,we just encountered a similar problem today! |
+ " none statistic mode is set.") | ||
.linebreak() | ||
.text( | ||
"Note, When this mode is enabled, the sdk in reading engine requires at least" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1、Change the "When" to "when".
2、the Paimon sdk in reading engine requires at least version 0.9.1 or 1.0.0 or higher?
@@ -55,4 +55,8 @@ public class SystemFields { | |||
public static boolean isSystemField(int fieldId) { | |||
return fieldId >= SYSTEM_FIELD_ID_START; | |||
} | |||
|
|||
public static boolean isSystemField(String field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add KEY_FIELD_PREFIX to SYSTEM_FIELD_NAMES?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What to solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If KEY_FIELD_PREFIX is not in SYSTEM_FIELD_NAMES, then the funtion name "isSystemField" is inappropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So add KEY_FIELD_PREFIX to SYSTEM_FIELD_NAMES, what problem can be solved? Can you write a ut demo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A solution is using a SYSTEM_FIELD_PREFIXS
, and always using starWith. But it is not good for performance.
Let it go now.
*/ | ||
public class ProjectedArray implements InternalArray { | ||
|
||
protected final int[] indexMapping; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indexMapping、array and ProjectedArray should be private?
@@ -101,8 +103,8 @@ public KeyValueDataFileWriter( | |||
this.schemaId = schemaId; | |||
this.level = level; | |||
|
|||
this.keyStatsConverter = new SimpleStatsConverter(keyType); | |||
this.valueStatsConverter = new SimpleStatsConverter(valueType); | |||
this.keyStatsConverter = new SimpleStatsConverter(keyType, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.keyStatsConverter = new SimpleStatsConverter(keyType);
private final InternalArray array; | ||
private final long notFoundValue; | ||
|
||
protected NullCountsEvoArray(int[] indexMapping, InternalArray array, long notFoundValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why protected?
private final InternalArray array; | ||
private final long notFoundValue; | ||
|
||
protected NullCountsEvoArray(int[] indexMapping, InternalArray array, long notFoundValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why protected?
dataFileSerializer.deserializeList(view), | ||
dataFileSerializer.deserializeList(view), | ||
dataFileSerializer.deserializeList(view)), | ||
fileDeserializer.get(), fileDeserializer.get(), fileDeserializer.get()), | ||
new IndexIncrement( | ||
indexEntrySerializer.deserializeList(view), | ||
indexEntrySerializer.deserializeList(view))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Condition 'version <= 2' is always 'true'
@@ -316,6 +317,9 @@ private static FunctionWithIOException<DataInputView, DataFileMeta> getFileMetaS | |||
DataFileMeta08Serializer serializer = new DataFileMeta08Serializer(); | |||
return serializer::deserialize; | |||
} else if (version == 2) { | |||
DataFileMeta09Serializer serializer = new DataFileMeta09Serializer(); | |||
return serializer::deserialize; | |||
} else if (version == 3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static final int DATA_FILE_META_VERSION_1 = 1;
public static final int DATA_FILE_META_VERSION_2= 2;
public static final int DATA_FILE_META_VERSION_3= 3;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -1026,7 +1026,7 @@ public class CoreOptions implements Serializable { | |||
public static final String STATS_MODE_SUFFIX = "stats-mode"; | |||
|
|||
public static final ConfigOption<String> METADATA_STATS_MODE = | |||
key("metadata." + STATS_MODE_SUFFIX) | |||
key("metadata.stats-mode") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can become more intuitive in the code.
null); | ||
List<DataFileMeta> dataFiles = Collections.singletonList(dataFile); | ||
|
||
LinkedHashMap<String, Pair<Integer, Integer>> dvRanges = new LinkedHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map<String, Pair<Integer, Integer>> dvRanges = new LinkedHashMap<>();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexFileMeta requires a LinkedHashMap
Please modify this doc : https://paimon.apache.org/docs/master/flink/sql-ddl/#specify-statistics-mode |
A question : metadata.stats-dense-store=true |
public static final ConfigOption<Boolean> METADATA_STATS_DENSE_STORE = | ||
key("metadata.stats-dense-store") | ||
.booleanType() | ||
.defaultValue(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the default value is not true?
You are worry about that many users are using old versions of Paimon sdk in their reading engine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is true
Correct! |
Looks good to me! +1 |
Purpose
At present, Paimon uses
BinaryRow
to store statistical information. Generally, it is not a problem, but some businesses have fields with over 3000 columns.The
BinaryRow
structure has a characteristic that each field occupies a fixed 8 bytes, and for a 3000 column table, oneBinaryRow
has 25kb of storage.SimpleStats
has 3BinaryRow
, then it will be 100 kb storage. 100000 files have GB level storage. This is unacceptable.This PR:
metadata.stats-dense-store
to a dense mode to storeSimpleStats
andvalueStatsCols
inDataFileMeta
.metadata.stats-mode
=none
, thenvalueStatsCols
will be empty list, andSimpleStats
is empty.fields.b.stats-mode
=full
to enable stats for specific columns to enable data skipping, the meta storage will only containb
column.Tests
AppendOnlyFileStoreTableTest
PrimaryKeyFileStoreTableTest
API and Format
Documentation