Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Query result duplicate primary key #3841

Open
1 of 2 tasks
herefree opened this issue Jul 30, 2024 · 7 comments
Open
1 of 2 tasks

[Bug] Query result duplicate primary key #3841

herefree opened this issue Jul 30, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@herefree
Copy link
Contributor

herefree commented Jul 30, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

0.7.0-incubating

Compute Engine

Flink 1.18.0

Minimal reproduce step

we have a flink job write data to paimon table,table options is:
+---------------------------+---------+
| key | value |
+---------------------------+---------+
| bucket | 8 |
| scan.remove-normalize | true |
| deduplicate.ignore-delete | true |
| changelog-producer | none |
| file.format | parquet |
+---------------------------+---------+

What doesn't meet your expectations?

When job execute for some time, we use batch mode query table ,some of our query results duplicate primary key.
image
When we update paimon version to query this table,it also has duplicate primary key.

Anything else?

I want to know what cause this problem. Is this problem caused by writer operator? Does a later version fix this issue?

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@herefree herefree added the bug Something isn't working label Jul 30, 2024
@eric666666
Copy link
Contributor

eric666666 commented Jul 30, 2024

It may happen when you change the bucket number but have not overwrite table first.

@herefree
Copy link
Contributor Author

It may happen when you change the bucket number but have not overwrite table first.

Bucket number have not been modified since the table was created.

@xuzifu666
Copy link
Member

xuzifu666 commented Jul 30, 2024

What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree

@herefree
Copy link
Contributor Author

What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree

{
"id" : 2,
"fields" : [ {
"id" : 0,
"name" : "",
"type" : "STRING NOT NULL",
"description" : ""
}, {
"id" : 1,
"name" : "",
"type" : "STRING",
"description" : ""
},

......

{
"id" : 54,
"name" : "",
"type" : "STRING",
"description" : ""
}, {
"id" : 55,
"name" : "",
"type" : "STRING",
"description" : ""
}, {
"id" : 56,
"name" : "",
"type" : "STRING",
"description" : ""
}, {
"id" : 57,
"name" : "",
"type" : "STRING",
"description" : ""
}, {
"id" : 58,
"name" : "",
"type" : "STRING",
"description" : ""
}, {
"id" : 59,
"name" : "",
"type" : "STRING",
"description" : ""
}, {
"id" : 60,
"name" : "",
"type" : "STRING",
"description" : ""
}, {
"id" : 61,
"name" : "",
"type" : "STRING",
"description" : ""
}, {
"id" : 62,
"name" : "",
"type" : "STRING",
"description" : ""
}, {
"id" : 63,
"name" : "",
"type" : "STRING",
"description" : ""
} ],
"highestFieldId" : 63,
"partitionKeys" : [ ],
"primaryKeys" : [ "id" ],
"options" : {
"bucket" : "8",
"scan.remove-normalize" : "true",
"deduplicate.ignore-delete" : "true",
"changelog-producer" : "none",
"file.format" : "parquet"
},
"comment" : "",
"timeMillis" : 1722325416673
}
I didn‘t delete data before,but the changlog of the upstream table may have -D data.I set deduplicate.ignore-delete is true just don't want -D data was write in this table,Or some flink job don‘t consumer -D data when consuming this table.

@herefree
Copy link
Contributor Author

image
I also find this duplicate data in the same bucket.

@herefree
Copy link
Contributor Author

herefree commented Aug 1, 2024

What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree

I set deduplicate.ignore-delete = false, I find it didn‘t have duplicate primary key,but I not sure that the later versions fix this problem.

@discivigour
Copy link
Contributor

@herefree Could you give the detailed minimal reproduce steps so that we can reporduce this bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants