Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Add auto increase sequence padding #2512

Closed
wants to merge 1 commit into from

Conversation

Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Dec 15, 2023

Purpose

Linked issue: close #2471

Add a new seq pad mode inc-seq, pads the sequence field with auto increase sequence, to solve the following problems:

When user defines sequence.field

  1. For the same seq (for example, sequence.field is timestamp of second, and records in the same second), the order of records cannot be determined.
  2. When performing row level changes, sequence.field must be the same, and its order cannot be guaranteed.

A better solution is to introduce a new comparison field, this PR is a trade-off, it reuses the _SEQUENCE_NUMBER and divides it into two parts: one part stores the auto-increment sequence (upper 32-bits) and the other part stores the user-defined sequence.field (transform to long) (lower 32-bits).

Notice !!!

  1. Since we only use 32-bits to store user seq, it will be truncated to 32 bits. This will result in loss of accuracy, e.g. for timestamp, only second accuracy is supported, and supports up to 2106-02-07T06:28:15
  2. The upper limit of the auto-increment sequence length is 32 bits.

Tests

API and Format

Documentation

@Zouxxyy Zouxxyy marked this pull request as draft December 15, 2023 01:38
@Zouxxyy Zouxxyy marked this pull request as ready for review December 15, 2023 06:26
@@ -459,7 +459,9 @@ public class CoreOptions implements Serializable {
text(
"3. \"millis-to-micro\": Pads the sequence field that indicates time with precision of milli-second to micro-second."),
text(
"4. Composite pattern: for example, \"second-to-micro,row-kind-flag\"."))
"4. \"inc-seq\": Pads the sequence field with auto increase sequence."),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for inc-seq, int, long must be a second? You should document this.

"4. Composite pattern: for example, \"second-to-micro,row-kind-flag\"."))
"4. \"inc-seq\": Pads the sequence field with auto increase sequence."),
text(
"5. Composite pattern: for example, \"second-to-micro,row-kind-flag\"."))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inc-seq, row-kind-flag is supported? I think maybe this should be most popular composite pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inc-seq + row-kind-flag will further reduce the accuracy, will there be a situation where +U come first, and then -U come, at the same second?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should throw exception...

if (maxSequenceNumber == -1) {
this.nextIncSequenceNumber = 0;
} else if (incSeqPadding) {
this.nextIncSequenceNumber = (maxSequenceNumber & INC_SEQ_MASK) + INC;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should merge this into SequenceGenerator. There should be some refactor.


spark.sql("DELETE FROM T WHERE id = 1")

val rows1 = spark.sql("SELECT * FROM T").collectAsList()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use checkDataset or checkAnswer to replace collectAsList + assertThat.

@@ -459,7 +459,9 @@ public class CoreOptions implements Serializable {
text(
"3. \"millis-to-micro\": Pads the sequence field that indicates time with precision of milli-second to micro-second."),
text(
"4. Composite pattern: for example, \"second-to-micro,row-kind-flag\"."))
"4. \"inc-seq\": Pads the sequence field with auto increase sequence."),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seconds-and-inc

@hekaifei
Copy link
Contributor

@Zouxxyy If the amount of data is relatively large, the number of data items reaches 5 billion, and auto-increment is only 32 bits. What will happen if it is not enough?

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Dec 29, 2023

@Zouxxyy If the amount of data is relatively large, the number of data items reaches 5 billion, and auto-increment is only 32 bits. What will happen if it is not enough?

@hekaifei System will crash, the current design only supports a single bucket with 2,147,483,647 items. You have 5 billion one bucket or total?

@hekaifei
Copy link
Contributor

@Zouxxyy If the amount of data is relatively large, the number of data items reaches 5 billion, and auto-increment is only 32 bits. What will happen if it is not enough?

@hekaifei System will crash, the current design only supports a single bucket with 2,147,483,647 items. You have 5 billion one bucket or total?

@Zouxxyy A few million per bucket right now. Worry about job keeps running and the data is updated frequently, 32 bit is not enough

@hekaifei
Copy link
Contributor

@Zouxxyy There are several tables with more than 5 billion data, divided into 1,000 to 2,000 buckets

@hekaifei
Copy link
Contributor

@Zouxxyy I think userSeq should be upper 32-bits, otherwise the same key inserted later and userSeq is smaller than before, Then the before data will be overwritten, This is not expected

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Jan 2, 2024

@hekaifei

A few million per bucket right now. Worry about job keeps running and the data is updated frequently, 32 bit is not enough

yes, frequent modifications will also cause the solution to collapse

otherwise the same key inserted later and userSeq is smaller than before, Then the before data will be overwritten

when you want to use row level update, you must use 'merge-engine' = 'deduplicate', means that the larger userSeq overrides the smaller one

@hekaifei
Copy link
Contributor

hekaifei commented Jan 3, 2024

@Zouxxyy

when you want to use row level update, you must use 'merge-engine' = 'deduplicate', means that the larger userSeq overrides the smaller one

In the inc-seq solution, the upper 32 bits are auto-increasing sequence, which will bring about the smaller userSeq overrides the larger one

@JingsongLi
Copy link
Contributor

The final solution should be #2811

I will close this PR, this PR is a temporary solution.

@JingsongLi JingsongLi closed this Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] spark deletes dynamic bucket table, the result is incorrect
4 participants