You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TLDR: Deleting based on a condition, on a partitioned iceberg table, yields incorrect results and deletes more records than expected.
I've encountered the following issue while experimenting with pyiceberg, and trying to delete some records based on a condition on a partitioned table. I am using the latest version of pyiceberg (0.7.0) and I utilize the glue catalog.
I have a pandas dataframe with 100 records, which is populated as such:
id: randomly, between 1 and 10,000
created_at: date, randomly between 2024-01-01 and 2024-02-01
relevancy_score: randomly, between 0 and 1, following a beta distribution with alpha=2, beta=20.
arrow_schema = iceberg_table.schema().as_arrow()
docs_table = pa.Table.from_pandas(df, schema=arrow_schema)
# Append the data to the Iceberg table
iceberg_table.append(docs_table)
and I get the following info:
total_records = iceberg_table.scan().to_pandas().shape
print(f"Total records after append: {total_records}")
lower_before = iceberg_table.scan(row_filter=LessThan("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score < 0.1 before deletion: {lower_before}")
greater_before = iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score >= 0.1 before deletion: {greater_before}")
Total records after append: (100, 3)
Records with relevancy_score < 0.1 before deletion: (64, 3)
Records with relevancy_score >= 0.1 before deletion: (36, 3)
# Verify the total number of records in the table after deletion
remaining_records = iceberg_table.scan().to_pandas().shape
print(f"Total records after deletion: {remaining_records}")
# Verify the number of records that meet the relevancy_score condition after deletion
lower_after = iceberg_table.scan(row_filter=LessThan("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score < 0.1 after deletion: {lower_after}")
greater_after = iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score >= 0.1 after deletion: {greater_after}")
Total records after deletion: (26, 3)
Records with relevancy_score < 0.1 after deletion: (26, 3)
Records with relevancy_score >= 0.1 after deletion: (0, 3)
After checking the manifest files after the deletion what seems to be happening is:
The snapshot correctly keeps the files/days where there are no records where relevancy_score >= 0.1
The snapshot correctly marks as deleted the files/days where there are no records where relevancy_score < 0.1
The snapshot seems to incorrectly ignore completely files/days where there are both records with relevancy_score >= 0.1 and relevancy_score < 0.1
For example, let's consider the following slice of the dataset:
id
created_at
relevancy_score
808
2024-01-02
0.18115229995825877
1424
2024-01-02
0.13995060752152988
8026
2024-01-02
0.05992063209461162
The .avro files of the manifest post-deletion, do not mention any data files of the 2024-01-02 partition whatsoever when it should have kept the record with id=8026
I'm attaching an accompanying notebook which I used to recreate the issue
I didn't encounter a similar issue when working with a non-partitioned table
The text was updated successfully, but these errors were encountered:
Apache Iceberg version
0.7.0 (latest release)
Please describe the bug 🐞
TLDR: Deleting based on a condition, on a partitioned iceberg table, yields incorrect results and deletes more records than expected.
I've encountered the following issue while experimenting with pyiceberg, and trying to delete some records based on a condition on a partitioned table. I am using the latest version of pyiceberg (0.7.0) and I utilize the glue catalog.
I have a pandas dataframe with 100 records, which is populated as such:
2024-01-01
and2024-02-01
Having created the iceberg table:
I append my df to the table:
and I get the following info:
Total records after append: (100, 3)
Records with relevancy_score < 0.1 before deletion: (64, 3)
Records with relevancy_score >= 0.1 before deletion: (36, 3)
so then I run the deletion:
and the results are quite unexpected:
Total records after deletion: (26, 3)
Records with relevancy_score < 0.1 after deletion: (26, 3)
Records with relevancy_score >= 0.1 after deletion: (0, 3)
After checking the manifest files after the deletion what seems to be happening is:
For example, let's consider the following slice of the dataset:
The
.avro
files of the manifest post-deletion, do not mention any data files of the2024-01-02
partition whatsoever when it should have kept the record with id=8026I'm attaching an accompanying notebook which I used to recreate the issue
I didn't encounter a similar issue when working with a non-partitioned table
The text was updated successfully, but these errors were encountered: