-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Add auto increase sequence padding #2512
Conversation
ec2dbfa
to
dc47462
Compare
@@ -459,7 +459,9 @@ public class CoreOptions implements Serializable { | |||
text( | |||
"3. \"millis-to-micro\": Pads the sequence field that indicates time with precision of milli-second to micro-second."), | |||
text( | |||
"4. Composite pattern: for example, \"second-to-micro,row-kind-flag\".")) | |||
"4. \"inc-seq\": Pads the sequence field with auto increase sequence."), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for inc-seq, int, long must be a second? You should document this.
"4. Composite pattern: for example, \"second-to-micro,row-kind-flag\".")) | ||
"4. \"inc-seq\": Pads the sequence field with auto increase sequence."), | ||
text( | ||
"5. Composite pattern: for example, \"second-to-micro,row-kind-flag\".")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inc-seq, row-kind-flag
is supported? I think maybe this should be most popular composite pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inc-seq
+ row-kind-flag
will further reduce the accuracy, will there be a situation where +U come first, and then -U come, at the same second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should throw exception...
if (maxSequenceNumber == -1) { | ||
this.nextIncSequenceNumber = 0; | ||
} else if (incSeqPadding) { | ||
this.nextIncSequenceNumber = (maxSequenceNumber & INC_SEQ_MASK) + INC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should merge this into SequenceGenerator
. There should be some refactor.
|
||
spark.sql("DELETE FROM T WHERE id = 1") | ||
|
||
val rows1 = spark.sql("SELECT * FROM T").collectAsList() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use checkDataset
or checkAnswer
to replace collectAsList + assertThat
.
@@ -459,7 +459,9 @@ public class CoreOptions implements Serializable { | |||
text( | |||
"3. \"millis-to-micro\": Pads the sequence field that indicates time with precision of milli-second to micro-second."), | |||
text( | |||
"4. Composite pattern: for example, \"second-to-micro,row-kind-flag\".")) | |||
"4. \"inc-seq\": Pads the sequence field with auto increase sequence."), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seconds-and-inc
@Zouxxyy If the amount of data is relatively large, the number of data items reaches 5 billion, and auto-increment is only 32 bits. What will happen if it is not enough? |
@hekaifei System will crash, the current design only supports a single bucket with 2,147,483,647 items. You have 5 billion one bucket or total? |
@Zouxxyy A few million per bucket right now. Worry about job keeps running and the data is updated frequently, 32 bit is not enough |
@Zouxxyy There are several tables with more than 5 billion data, divided into 1,000 to 2,000 buckets |
@Zouxxyy I think userSeq should be upper 32-bits, otherwise the same key inserted later and userSeq is smaller than before, Then the before data will be overwritten, This is not expected |
yes, frequent modifications will also cause the solution to collapse
when you want to use row level update, you must use 'merge-engine' = 'deduplicate', means that the larger userSeq overrides the smaller one |
In the inc-seq solution, the upper 32 bits are auto-increasing sequence, which will bring about the smaller userSeq overrides the larger one |
The final solution should be #2811 I will close this PR, this PR is a temporary solution. |
Purpose
Linked issue: close #2471
Add a new seq pad mode
inc-seq
, pads the sequence field with auto increase sequence, to solve the following problems:When user defines
sequence.field
sequence.field
is timestamp of second, and records in the same second), the order of records cannot be determined.sequence.field
must be the same, and its order cannot be guaranteed.A better solution is to introduce a new comparison field, this PR is a trade-off, it reuses the
_SEQUENCE_NUMBER
and divides it into two parts: one part stores the auto-increment sequence (upper 32-bits) and the other part stores the user-definedsequence.field
(transform to long) (lower 32-bits).Notice !!!
Tests
API and Format
Documentation