Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] first_value semantic conflict #3393

Open
2 tasks done
luowanghaoyun opened this issue May 24, 2024 · 5 comments
Open
2 tasks done

[Bug] first_value semantic conflict #3393

luowanghaoyun opened this issue May 24, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@luowanghaoyun
Copy link
Contributor

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

0.8

Compute Engine

Flink

Minimal reproduce step

I have read this issues: #3020

I discovered this phenomenon:

CREATE TABLE T (
    k INT,
    a INT,
    b VARCHAR,
    c VARCHAR,
    d VARCHAR,
PRIMARY KEY (k) NOT ENFORCED)
 WITH ('merge-engine'='aggregation', 
'fields.b.aggregate-function'='first_value',
'fields.c.aggregate-function'='first_non_null_value',
'fields.d.aggregate-function'='first_not_null_value',
 'sequence.field' = 'a'
);
INSERT INTO T VALUES (1, 4, '4', '4', '4')
INSERT INTO T VALUES (1, 2, '2', '2', '2')
INSERT INTO T VALUES (1, 3, '3', '3', '3')

The result is [1, 4, 2, 2, 2] , as expected

If I manually trigger a compaction before the third insert, like:

INSERT INTO T VALUES (1, 4, '4', '4', '4')
INSERT INTO T VALUES (1, 2, '2', '2', '2')
CALL sys.compact('default.t', '', '', '', 'sink.parallelism=4')
INSERT INTO T VALUES (1, 3, '3', '3', '3')

The result is [1, 4, 3, 3, 3] , and there is a semantic problem.
For local debugging, in this case, there are only two values merged ​​([1, 3, 3, 3, 3], [1, 4, 2, 2, 2]) when querying.
It seems that after compaction, the previous sequence information was lost.

What doesn't meet your expectations?

The Doc. description is vague:

1716547483757

I'm not sure if there's something wrong with my understanding. Should the oldest sequence be maintained? Otherwise the compaction will lead to inconsistent first_value semantics

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@luowanghaoyun luowanghaoyun added the bug Something isn't working label May 24, 2024
@Aitozi
Copy link
Contributor

Aitozi commented May 29, 2024

Thanks @luowanghaoyun for report this. IMO, the aggregate function for first, first_non_null_value, last_non_null_value whose value are not align with the latest sequence may encounter this problem, because the actual "version number " of the value is not kept.

I try to introduce an exclusive sequence for these three kind aggregation field, it should solve this problem, as posted in #3101 But I think it's a bit burden for end user, especially for the aggregation table, the default aggregation function is last_non_null_value. And the exclusive sequence enforce user to config an individual sequence field for these three kind aggregation field.

@liyubin117
Copy link
Contributor

@Aitozi Why not keep the earliest sequence under the mentioned scenarios? I think sequence.field should also be constrainted with the aggregation function.

@Aitozi
Copy link
Contributor

Aitozi commented May 29, 2024

@liyubin117 If we have serval aggregation function fields, a single sequence.field is not enough. The first_value need earliest sequence, but last_value need the latest sequence

@liyubin117
Copy link
Contributor

@Aitozi Thanks for claring that! it is actually more complex than I thought.

@luowanghaoyun
Copy link
Contributor Author

@Aitozi hi, when is this issue expected to be fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants