Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
LanceDB - Remove Orphaned Chunks (#1620)
* Add tests for LanceDB chunking and merging functionality Signed-off-by: Marcel Coetzee <[email protected]> * Add TSplitter type alias for LanceDB document splitting function Signed-off-by: Marcel Coetzee <[email protected]> * Refine typing for chunks Signed-off-by: Marcel Coetzee <[email protected]> * Add type definitions for chunk splitter function and related types Signed-off-by: Marcel Coetzee <[email protected]> * Remove unused ChunkInputT, ChunkOutputT, and TSplitter type definitions Signed-off-by: Marcel Coetzee <[email protected]> * Implement efficient update strategy for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <[email protected]> * Implement efficient update strategy for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <[email protected]> * Refactor LanceDB client and tests for improved readability and type safety Signed-off-by: Marcel Coetzee <[email protected]> * Linting Signed-off-by: Marcel Coetzee <[email protected]> * Add document_id parameter to lancedb_adapter and update merge logic Signed-off-by: Marcel Coetzee <[email protected]> * Remove resolved comments Signed-off-by: Marcel Coetzee <[email protected]> * Implement efficient orphan removal for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <[email protected]> * Implement efficient update strategy for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <[email protected]> * Add test for removing orphaned records in LanceDB Signed-off-by: Marcel Coetzee <[email protected]> * Update LanceDB orphaned records removal test for chunked documents Signed-off-by: Marcel Coetzee <[email protected]> * Set test pipeline as dev mode Signed-off-by: Marcel Coetzee <[email protected]> * Fix write disposition check in LanceDBRemoveOrphansJob execute method Signed-off-by: Marcel Coetzee <[email protected]> * Add FollowupJob trait to LoadLanceDBJob Signed-off-by: Marcel Coetzee <[email protected]> * Fix file type Signed-off-by: Marcel Coetzee <[email protected]> * Fix file typing Signed-off-by: Marcel Coetzee <[email protected]> * Add test for removing orphaned records in LanceDB root table Signed-off-by: Marcel Coetzee <[email protected]> * Enhance LanceDB test to cover nested child removal and update scenarios Signed-off-by: Marcel Coetzee <[email protected]> * Use doc id hint for top level tables Signed-off-by: Marcel Coetzee <[email protected]> * Only join on join columns for orphan removal job Signed-off-by: Marcel Coetzee <[email protected]> * Add ollama to supported embedding providers and test orphaned record removal with embeddings Signed-off-by: Marcel Coetzee <[email protected]> * Add merge_key to document resource for efficient updates in LanceDB Signed-off-by: Marcel Coetzee <[email protected]> * Formatting Signed-off-by: Marcel Coetzee <[email protected]> * Set default file size to 128MB Signed-off-by: Marcel Coetzee <[email protected]> * Only use parquet loader file formats Signed-off-by: Marcel Coetzee <[email protected]> * Import pyarrow.parquet Signed-off-by: Marcel Coetzee <[email protected]> * Remove recommended file size from LanceDB destination capabilities Signed-off-by: Marcel Coetzee <[email protected]> * Update LanceDB client to use more efficient batch processing methods on loading for Load Jobs Signed-off-by: Marcel Coetzee <[email protected]> * Refactor unique identifier handling for LanceDB tables Signed-off-by: Marcel Coetzee <[email protected]> * Optimize UUID column generation for LanceDB tables Signed-off-by: Marcel Coetzee <[email protected]> * Refactor LanceDBClient to use string type hints for Table Signed-off-by: Marcel Coetzee <[email protected]> * Minor refactor Signed-off-by: Marcel Coetzee <[email protected]> * Implement efficient schema update with Nullability support Signed-off-by: Marcel Coetzee <[email protected]> * Optimize orphaned chunks removal for large datasets Signed-off-by: Marcel Coetzee <[email protected]> * Projection pushdown Signed-off-by: Marcel Coetzee <[email protected]> * Format Signed-off-by: Marcel Coetzee <[email protected]> * Prevent primary key and document ID hint conflict in merge disposition Signed-off-by: Marcel Coetzee <[email protected]> * Add recommended file size for LanceDB destination Signed-off-by: Marcel Coetzee <[email protected]> * Improve comment clarity for projection push-down in LanceDB Signed-off-by: Marcel Coetzee <[email protected]> * Update to new load interface Signed-off-by: Marcel Coetzee <[email protected]> * Remove unnecessary LanceDBLoadJob attributes Signed-off-by: Marcel Coetzee <[email protected]> * Change instance attributes to `run` method as variables Signed-off-by: Marcel Coetzee <[email protected]> * Schedule follow up refernce job Signed-off-by: Marcel Coetzee <[email protected]> * Add follow up lancedb remove orphan job skeleron Signed-off-by: Marcel Coetzee <[email protected]> * Write empty follow up file Signed-off-by: Marcel Coetzee <[email protected]> * Write parquet Signed-off-by: Marcel Coetzee <[email protected]> * Add support for reference file format in LanceDB destination Signed-off-by: Marcel Coetzee <[email protected]> * Handle parent table name resolution if it doesn't exist in Lance db remove orphan job Signed-off-by: Marcel Coetzee <[email protected]> * Refactor specialised orphan follow up job back to reference job Signed-off-by: Marcel Coetzee <[email protected]> * Refactor orphan removal for chunked documents Signed-off-by: Marcel Coetzee <[email protected]> * Fix dlt system table check for name instead of object Signed-off-by: Marcel Coetzee <[email protected]> * Implement staging methods Signed-off-by: Marcel Coetzee <[email protected]> * Override staging client methods Signed-off-by: Marcel Coetzee <[email protected]> * Docs Signed-off-by: Marcel Coetzee <[email protected]> * Override staging client methods Signed-off-by: Marcel Coetzee <[email protected]> * Delete with inserts Signed-off-by: Marcel Coetzee <[email protected]> * Keep with batch reader Signed-off-by: Marcel Coetzee <[email protected]> * Remove Lancedb client's staging implementation Signed-off-by: Marcel Coetzee <[email protected]> * Insert in memory arrow table. This will be optimized Signed-off-by: Marcel Coetzee <[email protected]> * Rename classes to the new job implementation classes Signed-off-by: Marcel Coetzee <[email protected]> * Use namedtuple for table chain to improve readability Signed-off-by: Marcel Coetzee <[email protected]> * Remove orphans by loading all ancestor IDs simultaneously Signed-off-by: Marcel Coetzee <[email protected]> * Fix doc_id adapter Signed-off-by: Marcel Coetzee <[email protected]> * Fix doc_id adapter Signed-off-by: Marcel Coetzee <[email protected]> * Revert to previous Signed-off-by: Marcel Coetzee <[email protected]> * Revert "Remove orphans by loading all ancestor IDs simultaneously" This reverts commit 06e04d9. * Remove doc_id hint Signed-off-by: Marcel Coetzee <[email protected]> * Infer merge key if not supplied from provided primary key Signed-off-by: Marcel Coetzee <[email protected]> * Remove unused utility functions Signed-off-by: Marcel Coetzee <[email protected]> * Remove LanceDB doc ID hints and use schema normalizer Signed-off-by: Marcel Coetzee <[email protected]> * LanceDB writes strange code Signed-off-by: Marcel Coetzee <[email protected]> * Minor Formatting Signed-off-by: Marcel Coetzee <[email protected]> * Support compound primary and merge keys Signed-off-by: Marcel Coetzee <[email protected]> * Remove old comment Signed-off-by: Marcel Coetzee <[email protected]> * - Change default vector column name to "vector" to conform with lancedb standard - Add search tests with tantivy as search engine Signed-off-by: Marcel Coetzee <[email protected]> * Format and fix linting Signed-off-by: Marcel Coetzee <[email protected]> * Add custom embedding function registration test Signed-off-by: Marcel Coetzee <[email protected]> * Spawn process in test to make sure registry can be deserialized from arrow files Signed-off-by: Marcel Coetzee <[email protected]> * Simplify null string handling Signed-off-by: Marcel Coetzee <[email protected]> * Change NULL string replacement with random string, doc clarification Signed-off-by: Marcel Coetzee <[email protected]> * Update default vector column name in docs Signed-off-by: Marcel Coetzee <[email protected]> * Set `remove_orphans` flag to False on tests that don't require it Signed-off-by: Marcel Coetzee <[email protected]> * Implement starter arrow string placeholder function Signed-off-by: Marcel Coetzee <[email protected]> * Add test for empty arrow string element vectorised replacement utility function Signed-off-by: Marcel Coetzee <[email protected]> * Handle NULL values in addition to empty strings in arrow substitution method Signed-off-by: Marcel Coetzee <[email protected]> * More efficient empty value replacement with canonical arrow usage Signed-off-by: Marcel Coetzee <[email protected]> * Format Signed-off-by: Marcel Coetzee <[email protected]> * Bump pyarrow version Signed-off-by: Marcel Coetzee <[email protected]> * Use pa.nulls instead of [None]*len Signed-off-by: Marcel Coetzee <[email protected]> * Update tests Signed-off-by: Marcel Coetzee <[email protected]> * Invert remove orphans flag Signed-off-by: Marcel Coetzee <[email protected]> * Implement root table orphan deletion, only integer doc_ids Signed-off-by: Marcel Coetzee <[email protected]> * Cater for string ids as well in doc_id removal process Signed-off-by: Marcel Coetzee <[email protected]> * Fix test with wrong primary key Signed-off-by: Marcel Coetzee <[email protected]> * Just send list of ids as is. don't pc.compute on client end Signed-off-by: Marcel Coetzee <[email protected]> * Extract schema matching into utils Signed-off-by: Marcel Coetzee <[email protected]> * Add utils Signed-off-by: Marcel Coetzee <[email protected]> * Pass all tests Signed-off-by: Marcel Coetzee <[email protected]> * Minor format and cleanup Signed-off-by: Marcel Coetzee <[email protected]> * Docs Signed-off-by: Marcel Coetzee <[email protected]> * Amend replace test to test with large number of records to catch race conditions with replace disposition Signed-off-by: Marcel Coetzee <[email protected]> * Fix replace race conditions by delegating truncation to dlt Signed-off-by: Marcel Coetzee <[email protected]> * Update lock file Signed-off-by: Marcel Coetzee <[email protected]> * Refactor type mapping and schema handling in LanceDB client Signed-off-by: Marcel Coetzee <[email protected]> * Change 'complex' column type to 'json' in LanceDB client Signed-off-by: Marcel Coetzee <[email protected]> * update lock file Signed-off-by: Marcel Coetzee <[email protected]> * fixes generating lancedb literals * verifies merge key early, fixes column override in adapters * fixes linting errors --------- Signed-off-by: Marcel Coetzee <[email protected]> Co-authored-by: Marcin Rudolf <[email protected]>
- Loading branch information