Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use Custom Embedding Functions #1771

Merged

Conversation

Pipboyguy
Copy link
Collaborator

@Pipboyguy Pipboyguy commented Aug 29, 2024

Description

OpenAI embedding service doesn't accept empty string bodies. We used to deal with this by overriding the whole OpenAIEmbedding function.

This caused more grief than it fixed since the LanceDB registry doesn't keep track of it well, with very finicky Arrow metadata parsing and de-serialisation.

We simplify this fix by simply replacing empty strings with a placeholder that should be very semantically dissimilar to 99.9% of queries. Ideally, the null strings' embedding vectors themselves should be pinned at the origin, but this should be handled by upstream LanceDB.

The default vector column name is also changed to simply "vector" to coincide with LanceDB's default vector name to make onboarding and setup easier.

Related Issues

Additional Context

See lancedb/lancedb#1577

…db standard

- Add search tests with tantivy as search engine

Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
@Pipboyguy Pipboyguy added bug Something isn't working destination Issue related to new destinations community This issue came from slack community workspace labels Aug 29, 2024
@Pipboyguy Pipboyguy self-assigned this Aug 29, 2024
@Pipboyguy Pipboyguy linked an issue Aug 29, 2024 that may be closed by this pull request
@Pipboyguy Pipboyguy requested review from rudolfix and sh-rp and removed request for rudolfix August 29, 2024 20:58
Copy link

netlify bot commented Aug 29, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 9a347e6
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66d3421eb1e42d00088bb147
😎 Deploy Preview https://deploy-preview-1771--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -731,6 +720,19 @@ def run(self) -> None:
with FileStorage.open_zipsafe_ro(self._file_path) as f:
records: List[DictStrAny] = [json.loads(line) for line in f]

# Replace empty strings with placeholder string if OpenAI is used.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't tell the impact on performance, but I think it's a good fix until there's progress on the LanceDB issue!

I don't know how frequently you'd hit an empty string when embedding, but it might be worth mentioning in the docs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pipboyguy didn't we switch the format to parquet? I think it is in PR that is still in review. anyway we'll be able to use pa.compute to replace those soon

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rudolfix yes indeed, it does make it a bit tricky to implement a fix considering the switch in format.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zilto agreed, will add a doc entry for this!

@@ -52,7 +52,7 @@ def assert_table(
"_dlt_id",
"_dlt_load_id",
dlt.config.get("destination.lancedb.credentials.id_field_name", str) or "id__",
dlt.config.get("destination.lancedb.credentials.vector_field_name", str) or "vector__",
dlt.config.get("destination.lancedb.credentials.vector_field_name", str) or "vector",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using vector is nice because it aligns with the lancedb defaults.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this and for the tests! My single request is to reconsider the "dissimilar" token

@@ -81,6 +78,7 @@

TIMESTAMP_PRECISION_TO_UNIT: Dict[int, str] = {0: "s", 3: "ms", 6: "us", 9: "ns"}
UNIT_TO_TIMESTAMP_PRECISION: Dict[str, int] = {v: k for k, v in TIMESTAMP_PRECISION_TO_UNIT.items()}
EMPTY_STRING_PLACEHOLDER = "__EMPTY_STRING_PLACEHOLDER__"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use some random string. who knows what kind of tokenizer may be used against it... openAI may embed this as separate words

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh good point! You're right I'll replace with randomly gen string

@@ -731,6 +720,19 @@ def run(self) -> None:
with FileStorage.open_zipsafe_ro(self._file_path) as f:
records: List[DictStrAny] = [json.loads(line) for line in f]

# Replace empty strings with placeholder string if OpenAI is used.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pipboyguy didn't we switch the format to parquet? I think it is in PR that is still in review. anyway we'll be able to use pa.compute to replace those soon

@Pipboyguy Pipboyguy requested review from rudolfix and zilto August 30, 2024 12:32
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rudolfix rudolfix merged commit dd973c5 into devel Sep 3, 2024
55 of 56 checks passed
@rudolfix rudolfix deleted the 1765-lancedb-destination-cant-query-generated-tables branch September 3, 2024 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community This issue came from slack community workspace destination Issue related to new destinations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LanceDB destination: can't query generated tables
3 participants