Don't use Custom Embedding Functions #1771

Pipboyguy · 2024-08-29T20:58:17Z

Description

OpenAI embedding service doesn't accept empty string bodies. We used to deal with this by overriding the whole OpenAIEmbedding function.

This caused more grief than it fixed since the LanceDB registry doesn't keep track of it well, with very finicky Arrow metadata parsing and de-serialisation.

We simplify this fix by simply replacing empty strings with a placeholder that should be very semantically dissimilar to 99.9% of queries. Ideally, the null strings' embedding vectors themselves should be pinned at the origin, but this should be handled by upstream LanceDB.

The default vector column name is also changed to simply "vector" to coincide with LanceDB's default vector name to make onboarding and setup easier.

Related Issues

Fixes LanceDB destination: can't query generated tables #1765

Additional Context

See lancedb/lancedb#1577

…db standard - Add search tests with tantivy as search engine Signed-off-by: Marcel Coetzee <[email protected]>

Signed-off-by: Marcel Coetzee <[email protected]>

…arrow files Signed-off-by: Marcel Coetzee <[email protected]>

Signed-off-by: Marcel Coetzee <[email protected]>

netlify · 2024-08-29T20:58:34Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`9a347e6`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/66d3421eb1e42d00088bb147
😎 Deploy Preview	https://deploy-preview-1771--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

zilto · 2024-08-29T21:21:03Z

dlt/destinations/impl/lancedb/lancedb_client.py

@@ -731,6 +720,19 @@ def run(self) -> None:
        with FileStorage.open_zipsafe_ro(self._file_path) as f:
            records: List[DictStrAny] = [json.loads(line) for line in f]

+        # Replace empty strings with placeholder string if OpenAI is used.


Can't tell the impact on performance, but I think it's a good fix until there's progress on the LanceDB issue!

I don't know how frequently you'd hit an empty string when embedding, but it might be worth mentioning in the docs?

@Pipboyguy didn't we switch the format to parquet? I think it is in PR that is still in review. anyway we'll be able to use pa.compute to replace those soon

@rudolfix yes indeed, it does make it a bit tricky to implement a fix considering the switch in format.

@zilto agreed, will add a doc entry for this!

zilto · 2024-08-29T21:21:37Z

tests/load/lancedb/utils.py

@@ -52,7 +52,7 @@ def assert_table(
        "_dlt_id",
        "_dlt_load_id",
        dlt.config.get("destination.lancedb.credentials.id_field_name", str) or "id__",
-        dlt.config.get("destination.lancedb.credentials.vector_field_name", str) or "vector__",
+        dlt.config.get("destination.lancedb.credentials.vector_field_name", str) or "vector",


I think using vector is nice because it aligns with the lancedb defaults.

rudolfix

thanks for working on this and for the tests! My single request is to reconsider the "dissimilar" token

rudolfix · 2024-08-30T09:24:26Z

dlt/destinations/impl/lancedb/lancedb_client.py

@@ -81,6 +78,7 @@

 TIMESTAMP_PRECISION_TO_UNIT: Dict[int, str] = {0: "s", 3: "ms", 6: "us", 9: "ns"}
 UNIT_TO_TIMESTAMP_PRECISION: Dict[str, int] = {v: k for k, v in TIMESTAMP_PRECISION_TO_UNIT.items()}
+EMPTY_STRING_PLACEHOLDER = "__EMPTY_STRING_PLACEHOLDER__"


use some random string. who knows what kind of tokenizer may be used against it... openAI may embed this as separate words

Ahh good point! You're right I'll replace with randomly gen string

rudolfix · 2024-08-30T09:29:18Z

dlt/destinations/impl/lancedb/lancedb_client.py

@@ -731,6 +720,19 @@ def run(self) -> None:
        with FileStorage.open_zipsafe_ro(self._file_path) as f:
            records: List[DictStrAny] = [json.loads(line) for line in f]

+        # Replace empty strings with placeholder string if OpenAI is used.


@Pipboyguy didn't we switch the format to parquet? I think it is in PR that is still in review. anyway we'll be able to use pa.compute to replace those soon

Signed-off-by: Marcel Coetzee <[email protected]>

rudolfix

LGTM!

Pipboyguy added 5 commits August 29, 2024 13:33

- Change default vector column name to "vector" to conform with lance…

d1e4173

…db standard - Add search tests with tantivy as search engine Signed-off-by: Marcel Coetzee <[email protected]>

Format and fix linting

613f5bc

Signed-off-by: Marcel Coetzee <[email protected]>

Add custom embedding function registration test

703c4a8

Signed-off-by: Marcel Coetzee <[email protected]>

Spawn process in test to make sure registry can be deserialized from …

c07c8fc

…arrow files Signed-off-by: Marcel Coetzee <[email protected]>

Simplify null string handling

8afa7e1

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy added bug Something isn't working destination Issue related to new destinations community This issue came from slack community workspace labels Aug 29, 2024

Pipboyguy self-assigned this Aug 29, 2024

Pipboyguy linked an issue Aug 29, 2024 that may be closed by this pull request

LanceDB destination: can't query generated tables #1765

Closed

Pipboyguy requested review from rudolfix and sh-rp and removed request for rudolfix August 29, 2024 20:58

zilto reviewed Aug 29, 2024

View reviewed changes

rudolfix requested changes Aug 30, 2024

View reviewed changes

Change NULL string replacement with random string, doc clarification

2395432

Signed-off-by: Marcel Coetzee <[email protected]>

Pipboyguy requested review from rudolfix and zilto August 30, 2024 12:32

Update default vector column name in docs

9a347e6

Signed-off-by: Marcel Coetzee <[email protected]>

rudolfix approved these changes Sep 1, 2024

View reviewed changes

rudolfix merged commit dd973c5 into devel Sep 3, 2024
55 of 56 checks passed

rudolfix deleted the 1765-lancedb-destination-cant-query-generated-tables branch September 3, 2024 08:20

rudolfix mentioned this pull request Sep 13, 2024

1.0.0 announcement and release notes #1778

Closed

QianZhu mentioned this pull request Nov 25, 2024

Empty String Causes Embedding Error with OpenAI Endpoint lancedb/lancedb#1577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use Custom Embedding Functions #1771

Don't use Custom Embedding Functions #1771

Pipboyguy commented Aug 29, 2024 •

edited

Loading

netlify bot commented Aug 29, 2024 •

edited

Loading

zilto Aug 29, 2024

rudolfix Aug 30, 2024

Pipboyguy Aug 30, 2024

Pipboyguy Aug 30, 2024

zilto Aug 29, 2024

rudolfix left a comment

rudolfix Aug 30, 2024

Pipboyguy Aug 30, 2024

rudolfix Aug 30, 2024

rudolfix left a comment

Don't use Custom Embedding Functions #1771

Don't use Custom Embedding Functions #1771

Conversation

Pipboyguy commented Aug 29, 2024 • edited Loading

Description

Related Issues

Additional Context

netlify bot commented Aug 29, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Pipboyguy commented Aug 29, 2024 •

edited

Loading

netlify bot commented Aug 29, 2024 •

edited

Loading