-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't use Custom Embedding Functions #1771
Don't use Custom Embedding Functions #1771
Conversation
…db standard - Add search tests with tantivy as search engine Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
…arrow files Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@@ -731,6 +720,19 @@ def run(self) -> None: | |||
with FileStorage.open_zipsafe_ro(self._file_path) as f: | |||
records: List[DictStrAny] = [json.loads(line) for line in f] | |||
|
|||
# Replace empty strings with placeholder string if OpenAI is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't tell the impact on performance, but I think it's a good fix until there's progress on the LanceDB issue!
I don't know how frequently you'd hit an empty string when embedding, but it might be worth mentioning in the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Pipboyguy didn't we switch the format to parquet? I think it is in PR that is still in review. anyway we'll be able to use pa.compute
to replace those soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rudolfix yes indeed, it does make it a bit tricky to implement a fix considering the switch in format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zilto agreed, will add a doc entry for this!
@@ -52,7 +52,7 @@ def assert_table( | |||
"_dlt_id", | |||
"_dlt_load_id", | |||
dlt.config.get("destination.lancedb.credentials.id_field_name", str) or "id__", | |||
dlt.config.get("destination.lancedb.credentials.vector_field_name", str) or "vector__", | |||
dlt.config.get("destination.lancedb.credentials.vector_field_name", str) or "vector", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using vector
is nice because it aligns with the lancedb defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this and for the tests! My single request is to reconsider the "dissimilar" token
@@ -81,6 +78,7 @@ | |||
|
|||
TIMESTAMP_PRECISION_TO_UNIT: Dict[int, str] = {0: "s", 3: "ms", 6: "us", 9: "ns"} | |||
UNIT_TO_TIMESTAMP_PRECISION: Dict[str, int] = {v: k for k, v in TIMESTAMP_PRECISION_TO_UNIT.items()} | |||
EMPTY_STRING_PLACEHOLDER = "__EMPTY_STRING_PLACEHOLDER__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use some random string. who knows what kind of tokenizer may be used against it... openAI may embed this as separate words
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh good point! You're right I'll replace with randomly gen string
@@ -731,6 +720,19 @@ def run(self) -> None: | |||
with FileStorage.open_zipsafe_ro(self._file_path) as f: | |||
records: List[DictStrAny] = [json.loads(line) for line in f] | |||
|
|||
# Replace empty strings with placeholder string if OpenAI is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Pipboyguy didn't we switch the format to parquet? I think it is in PR that is still in review. anyway we'll be able to use pa.compute
to replace those soon
Signed-off-by: Marcel Coetzee <[email protected]>
Signed-off-by: Marcel Coetzee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
OpenAI embedding service doesn't accept empty string bodies. We used to deal with this by overriding the whole
OpenAIEmbedding
function.This caused more grief than it fixed since the LanceDB registry doesn't keep track of it well, with very finicky Arrow metadata parsing and de-serialisation.
We simplify this fix by simply replacing empty strings with a placeholder that should be very semantically dissimilar to 99.9% of queries. Ideally, the null strings' embedding vectors themselves should be pinned at the origin, but this should be handled by upstream LanceDB.
The default vector column name is also changed to simply "vector" to coincide with LanceDB's default vector name to make onboarding and setup easier.
Related Issues
Additional Context
See lancedb/lancedb#1577