[Feature] Introduce deletion vectors for primary key table #2898

Zouxxyy · 2024-02-23T10:05:46Z

Motivation

Position deletion is a solution to implement the Merge-On-Read (MOR) structure, which has been adopted by other formats such as Iceberg and Delta. By combining with Paimon's LSM tree, we can create a new mode with deletion vectors (bitmap to identity which row id deleted) index file unique to Paimon.

Under this mode, extra overhead (lookup and write deletion vectors index file) will be introduced during writing, but during reading, data can be directly retrieved using "data + filter with deletion vector", avoiding additional merge costs between different files. Furthermore, this mode can be easily integrated into native engine solutions like Spark + Gluten in the future, thereby significantly enhancing read performance.

PIP: https://cwiki.apache.org/confluence/x/Tws4EQ

JingsongLi · 2024-03-14T09:51:22Z

We have things remaining:

Since there is no need to merge when reading, in this mode, we can support filter pushdown of non-PK fields!
Supports dv with partial-update and aggregate. Looks like current implementation is not work.
Supports dv with first-row.
Documentation for using deletion vectors mode.
Roaring map dependency should be bundled into paimon-common.
AvroBulkFormat should return RecordWithPositionIterator.

JingsongLi · 2024-03-28T05:52:11Z

Thanks @Zouxxyy , all finished!

Zouxxyy added the enhancement New feature or request label Feb 23, 2024

Zouxxyy mentioned this issue Feb 27, 2024

[core][DV] Support obtain row position when reading orc and parquet #2909

Closed

Zouxxyy changed the title ~~[Feature] Introduce position delete mode~~ [Feature] Introduce deletion vectors mode Feb 27, 2024

This was referenced Feb 28, 2024

[core] Introduce RecordWithPositionIterator interface #2916

Merged

[core] Introduce deletion vector #2923

Merged

Zouxxyy self-assigned this Feb 29, 2024

This was referenced Mar 7, 2024

[core] Integrate deletion vector to reader and writer #2958

Merged

[core] Introduce deletion files to DataSplit #2988

Merged

[core] Fix streaming & batch read dv table #3001

Merged

JingsongLi closed this as completed in #3001 Mar 13, 2024

JingsongLi reopened this Mar 14, 2024

Zouxxyy mentioned this issue Mar 15, 2024

[core] Dv table supports value filter pushdown #3024

Merged

This was referenced Mar 27, 2024

[core] Shade roaringbitmap dependency into paimon-common #3100

Merged

[core] Support dv with avro format #3105

Merged

[doc] Introduce deletion vector documentation page #3107

Merged

JingsongLi changed the title ~~[Feature] Introduce deletion vectors mode~~ [Feature] Introduce deletion vectors for primary key table Mar 28, 2024

JingsongLi closed this as completed Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Introduce deletion vectors for primary key table #2898

[Feature] Introduce deletion vectors for primary key table #2898

Zouxxyy commented Feb 23, 2024 •

edited

Loading

JingsongLi commented Mar 14, 2024 •

edited

Loading

JingsongLi commented Mar 28, 2024

[Feature] Introduce deletion vectors for primary key table #2898

[Feature] Introduce deletion vectors for primary key table #2898

Comments

Zouxxyy commented Feb 23, 2024 • edited Loading

Motivation

JingsongLi commented Mar 14, 2024 • edited Loading

JingsongLi commented Mar 28, 2024

Zouxxyy commented Feb 23, 2024 •

edited

Loading

JingsongLi commented Mar 14, 2024 •

edited

Loading