[paimon-flink-cdc] Add the latest_schema state at schema evolution operator ，Reduce the latest schema access frequency #4535

GangYang-HX · 2024-11-15T07:34:02Z

Purpose

In scenarios where the number of Paimon table fields is large and the Write concurrency is high, reduce the Latest-Schema access frequency to improve the throughput of job cold start

Tests

Case-1: Observe whether the checkpoint time of schema evolution changes

Conclusion: After optimization, Schema Evolution is basically completed in seconds, or even milliseconds.

Case-2: Observe the log to see if there are still a large number of read schema behaviors

Conclusion: From hundreds of thousands to 115 times

API and Format

org.apache.paimon.flink.sink.cdc.UpdatedDataFieldsProcessFunction#processElement

Documentation

Before the Schema Evolution operator calls org.apache.paimon.flink.sink.cdc.UpdatedDataFieldsProcessFunctionBase#extractSchemaChanges, add a judgment to confirm whether the field update really needs to be triggered.

Add a List variable to determine whether it is an updated column: List latestSchemaList
Add a state ListState. When the task is restored from the state, it is directly restored from here: ListState latestSchemaListState

…latest schema access frequency

wwj6591812 · 2024-11-15T12:00:25Z

...ink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/UpdatedDataFieldsProcessFunction.java


    private final SchemaManager schemaManager;

    private final Identifier identifier;

+    private ListState<DataField> latestSchemaListState;
+    private List<DataField> latestSchemaList;


latestSchemaListState cannot be set to final because it cannot be initialized in the constructor

I mean latestSchemaList 'latestSchemaList' may be 'final', not latestSchemaListState.

wwj6591812 · 2024-11-15T12:00:44Z

...ink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/UpdatedDataFieldsProcessFunction.java

+        actualUpdatedDataFields.forEach(field -> latestSchemaList.add(field));
+    }
+
+    @Override


please add a ut

Added: SchemaEvolutionTest

wwj6591812 · 2024-11-15T12:02:28Z

...ink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/UpdatedDataFieldsProcessFunction.java

+            latestSchemaListState.get().forEach(dataField -> latestSchemaList.add(dataField));
+        } else {
+            RowType oldRowType = schemaManager.latest().get().logicalRowType();
+            oldRowType.getFields().forEach(dataField -> latestSchemaList.add(dataField));


latestSchemaList.addAll(oldRowType.getFields());

wwj6591812 · 2024-11-15T12:04:09Z

...ink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/UpdatedDataFieldsProcessFunction.java

            applySchemaChange(schemaManager, schemaChange, identifier);
        }
+        actualUpdatedDataFields.forEach(field -> latestSchemaList.add(field));


latestSchemaList.addAll(actualUpdatedDataFields);

wwj6591812 · 2024-11-15T12:05:00Z

...ink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/UpdatedDataFieldsProcessFunction.java

+                                new ListStateDescriptor<>(
+                                        "latest-schema-list-state", DataField.class));
+        if (context.isRestored()) {
+            latestSchemaListState.get().forEach(dataField -> latestSchemaList.add(dataField));


latestSchemaListState.get().forEach(latestSchemaList::add);

JingsongLi · 2024-11-22T06:51:51Z

Hi @GangYang-HX , do we really need to put the schema to the state? Maybe just a field in memory is OK?

GangYang-HX · 2024-11-25T03:08:42Z

Hi @GangYang-HX , do we really need to put the schema to the state? Maybe just a field in memory is OK?

Well, I have considered it before. If the state is not added when restoring from the state, there will be one more latest-schame access.
In fact, one more access should not have much impact. I will adjust it later.

… to save

JingsongLi · 2024-11-25T05:22:26Z

...ink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/UpdatedDataFieldsProcessFunction.java

@@ -37,25 +44,54 @@
 * be 1.
 */
 public class UpdatedDataFieldsProcessFunction
-        extends UpdatedDataFieldsProcessFunctionBase<List<DataField>, Void> {
+        extends UpdatedDataFieldsProcessFunctionBase<List<DataField>, Void>
+        implements CheckpointedFunction {


It doesn't need to be CheckpointedFunction

JingsongLi · 2024-11-25T05:23:10Z

...ink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/UpdatedDataFieldsProcessFunction.java


    private final SchemaManager schemaManager;

    private final Identifier identifier;

+    private final List<DataField> latestSchemaList;


@nullable
private List latestSchemaList

JingsongLi · 2024-11-25T05:24:12Z

...ink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/UpdatedDataFieldsProcessFunction.java

+    }
+
+    private boolean dataFieldContainIgnoreId(DataField dataField) {
+        return latestSchemaList.stream()


if (latestSchemaList == null) { RowType rowType = schemaManager.latest().get().logicalRowType(); latestSchemaList.addAll(rowType.getFields()); }

JingsongLi · 2024-11-25T05:26:46Z

...ink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/UpdatedDataFieldsProcessFunction.java


    private final SchemaManager schemaManager;

    private final Identifier identifier;

+    private final List<DataField> latestSchemaList;


Maybe it can be:

Set<FieldIdentifier> latestFields; FieldIdentifier { String name; DataType type; String description; }

…stFields in the open method

JingsongLi · 2024-11-26T02:29:22Z

...ink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/UpdatedDataFieldsProcessFunction.java

+         * non-SchemaChange.AddColumn scenario. Otherwise, the previously existing fields cannot be
+         * modified again.
+         */
+        updateLatestFields();


Can we just add updatedDataFields to latestFields?

No, this is the logic in previous versions, but there are risks in actual testing.
Reason: FieldIdentifier only guarantees the uniqueness of <name, type, description>. If any of the attributes is adjusted repeatedly, it will lead to misjudgment.
For example: Field A's type is int, and then it is changed to string. If you want to adjust it to int again, it will not work,because latestFields has already saved an element like <A, int, description>.

JingsongLi · 2024-11-28T12:02:14Z

+1

Add the latest_schema state at schema evolution operator ，reduce the …

2de7389

…latest schema access frequency

wwj6591812 reviewed Nov 15, 2024

View reviewed changes

gang3.yang added 3 commits November 18, 2024 14:47

add ut:SchemaEvolutionTest

8721106

Adjust the format to pass the checkstyle check

59e77a9

up TableTestBase，extends TableTestBase

cd4e459

Remove the state to save the latest schema, and only use memory field…

831ee6a

… to save

JingsongLi reviewed Nov 25, 2024

View reviewed changes

Remove CheckpointedFunction, add FieldIdentifier, and initialize late…

f780eff

…stFields in the open method

JingsongLi reviewed Nov 26, 2024

View reviewed changes

JingsongLi merged commit 2f93b7b into apache:master Nov 28, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[paimon-flink-cdc] Add the latest_schema state at schema evolution operator ，Reduce the latest schema access frequency #4535

[paimon-flink-cdc] Add the latest_schema state at schema evolution operator ，Reduce the latest schema access frequency #4535

GangYang-HX commented Nov 15, 2024 •

edited

Loading

wwj6591812 Nov 15, 2024

GangYang-HX Nov 18, 2024

wwj6591812 Nov 18, 2024

GangYang-HX Nov 18, 2024

wwj6591812 Nov 15, 2024

GangYang-HX Nov 18, 2024

wwj6591812 Nov 15, 2024

wwj6591812 Nov 15, 2024

wwj6591812 Nov 15, 2024

JingsongLi commented Nov 22, 2024

GangYang-HX commented Nov 25, 2024

JingsongLi Nov 25, 2024 •

edited

Loading

JingsongLi Nov 25, 2024

JingsongLi Nov 25, 2024 •

edited

Loading

JingsongLi Nov 25, 2024

JingsongLi Nov 26, 2024

GangYang-HX Nov 28, 2024

JingsongLi commented Nov 28, 2024

[paimon-flink-cdc] Add the latest_schema state at schema evolution operator ，Reduce the latest schema access frequency #4535

[paimon-flink-cdc] Add the latest_schema state at schema evolution operator ，Reduce the latest schema access frequency #4535

Conversation

GangYang-HX commented Nov 15, 2024 • edited Loading

Purpose

Tests

API and Format

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JingsongLi commented Nov 22, 2024

GangYang-HX commented Nov 25, 2024

JingsongLi Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JingsongLi Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JingsongLi commented Nov 28, 2024

GangYang-HX commented Nov 15, 2024 •

edited

Loading

JingsongLi Nov 25, 2024 •

edited

Loading

JingsongLi Nov 25, 2024 •

edited

Loading