Newly created table does not detect commit failures #1366

HaraldVanWoerkom · 2024-11-25T08:22:56Z

Apache Iceberg version

0.7.1

Please describe the bug 🐞

I have multiple processes that write to a table. They sometimes clash, resulting in a failed commit, which I can retry (I use tenacity for that). However, if the table is freshly created, two concurrent writes are NOT detected. Both pass, while only one of them is written (probably the last one wins, I did not verify).

This probably has to do with the fact that a newly created table has no snapshot and the commit clash detection checks that the snapshot ID has not changed. Probably there is no code that checks that if no snapshot was present at the start of the write, there should also be no snapshot at commit time.

As a work-around I now add a dummy row after the table is created and then delete that row. This is a bit clunky, is there a simple way to force create a snapshot? It seems pyiceberg tries to prevent empty snapshots, which is great unless I really want a new snapshot.

Fokko · 2024-11-25T14:17:06Z

Hey @HaraldVanWoerkom thanks for reaching out here. Two questions:

Which catalog are you using?
Have you tried updating to 0.8.0 to ensure that the issue persists?

Each table has a UUID, and that is checked when creating a table. When a table is dropped and re-created the UUID changes. If the table is created for the first time, it ensures that there is no table yet (and fail otherwise).

HaraldVanWoerkom · 2024-11-25T14:45:12Z

Hi @Fokko,

We are using Nessie 0.99.0.
I have not tried 0.8.0 yet, I'll try that.

I get the impression that we are talking about different things. I create a new table and keep it empty. Then I spawn two processes that write into that table using table.append. I expect one of the two processes to fail (see also issue 269 you submitted, optimistic concurrency sometimes leads to failures) if they commit at the same time (and since they write a large piece of data, the clash is likely).

What I see is that neither of the processes fails, but the data from one of the processes is not written.

Both processes continue doing new append calls and these get a proper CommitFailedException when the commit clashes. The problem only occurs on a commit on an empty table.

HaraldVanWoerkom · 2024-11-26T13:27:58Z

Hi @Fokko,
I verified with 0.8.0, the problem is still there.

kevinjqliu · 2024-11-26T18:05:58Z

@HaraldVanWoerkom if you have a reproducible process, can you check if this occurs with another catalog? For example, this one that we use for integration tests

iceberg-python/dev/docker-compose-integration.yml

Lines 43 to 44 in 1e9bdc2

    
           rest: 
        
             image: tabulario/iceberg-rest

We've seen multiple issues related to Nessie and how it represents snapshots internally. For example,
#1105

Fokko · 2024-11-26T20:12:53Z

As a side note, we don't have a proper retry strategy in Iceberg yet. To appends should not cause a conflict right away because they don't interfere. Instead what should happen, when it gets the CommitFailedException, it should reload the manifest-list, re-fast-append the manifest, and retry the commit.

Thanks for the additional context, and testing with PyIceberg 0.8.0 as well, that's very helpful. Let me dig into the code to see if I can track the issue and determine if it is on the PyIceberg side or the Nessie side of things.

Fokko · 2024-11-26T20:21:28Z

Here's the logic:

iceberg-python/pyiceberg/table/update/snapshot.py

Lines 271 to 279 in 1e9bdc2

    
           return ( 
        
               ( 
        
                   AddSnapshotUpdate(snapshot=snapshot), 
        
                   SetSnapshotRefUpdate( 
        
                       snapshot_id=self._snapshot_id, parent_snapshot_id=self._parent_snapshot_id, ref_name="main", type="branch" 
        
                   ), 
        
               ), 
        
               (AssertRefSnapshotId(snapshot_id=self._transaction.table_metadata.current_snapshot_id, ref="main"),), 
        
           )

We always set the requirement, even if the current-snapshot-id is null.

HaraldVanWoerkom · 2024-11-27T08:30:25Z

Hi @Fokko ,

Thanks for taking the time to look into this.

If I understand correctly, in the case of an empty table, you send an update request to the catalog with a restriction that this update should only be done if the snapshot_id is None.

This sounds like a potential misinterpretation of an interface. I can totally see that somebody interprets None as being "don't care". That would be a wrong interpretation, as the interface description clearly specifies what None means: https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/update/__init__.py#L609.

So I should talk to the Nessie people about this.

One remaining question: until this is resolved, I would like to have a way to add a snapshot to an empty table. What would be the simplest way to do this (preferably via the pyiceberg API)?

HaraldVanWoerkom mentioned this issue Dec 9, 2024

[Bug]: Newly created pyiceberg table does not detect commit conflicts projectnessie/nessie#10062

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newly created table does not detect commit failures #1366

Newly created table does not detect commit failures #1366

HaraldVanWoerkom commented Nov 25, 2024

Fokko commented Nov 25, 2024

HaraldVanWoerkom commented Nov 25, 2024

HaraldVanWoerkom commented Nov 26, 2024

kevinjqliu commented Nov 26, 2024

Fokko commented Nov 26, 2024

Fokko commented Nov 26, 2024

HaraldVanWoerkom commented Nov 27, 2024

Newly created table does not detect commit failures #1366

Newly created table does not detect commit failures #1366

Comments

HaraldVanWoerkom commented Nov 25, 2024

Apache Iceberg version

Please describe the bug 🐞

Fokko commented Nov 25, 2024

HaraldVanWoerkom commented Nov 25, 2024

HaraldVanWoerkom commented Nov 26, 2024

kevinjqliu commented Nov 26, 2024

Fokko commented Nov 26, 2024

Fokko commented Nov 26, 2024

HaraldVanWoerkom commented Nov 27, 2024