You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Vector search on a LanceDB table generated by dlt is broken. Here's a simple query:
importosimportlancedbdlt_lancedb_uri=os.environ["DESTINATION__LANCEDB__CREDENTIALS__URI"]
lancedb_con=lancedb.connect(dlt_lancedb_uri)
lancedb_table=lancedb_con.open_table("my_table")
query=lancedb_table.search("My very important question")
results=query.to_list()
Expected behavior
Since I expect downstream users to import lancedb and not worry about how data is ingested, dlt should adopt a different strategy for embedding function registration:
Add to the docs the need to import dlt.destinations.impl.lancedb.models before querying data from lancedb
Have dlt/destinations/impl/lancedb/__init__.py import dlt/destinations/impl/lancedb/models.py. This should enable dlt.destinations.lancedb to be sufficient (I believe?)
Avoid needing a custom PatchedOpenAIEmbeddings and rely on natively-supported lancedb functions
Collaborate with lancedb to modify the stored pyarrow.Schema's metadata to include the required module imports (i.e., the dlt submodule). This would add the embedding function to the LanceDB registry at deserialization before trying to retrieve the function from the registry,
This error was painful to debug, because nothing points to dlt being the source. Renaming the registered function to openai_dlt_patch would be of great help
Steps to reproduce
Configure the lancedb destination (credentials, embedding function, etc.)
Use the lancedb destination with the lancedb_adapter to ingest data
In a separate process (script, notebook, REPL, etc.), import lancedb only and access a generated table that has an embed column.
Query the table using lancedb's .search() (vector search)
It should fail saying "openai_patched" is not in registry
Fix
6. import dlt.destinations.impl.lancedb.models
7. retry steps 4 and 5 and it should now be working
By manually checking the LanceDB embedding function registry, you can see the "openai_patched" function defined by dlt being registered.
importlancedb.embeddings.registryasembedding_registry_moduleregistry=embedding_registry_module.get_registry()
registry._functions# dictionary of {func_name: func} of type Dict[str, Callable]registry.get("openai_patched")
dlt version
0.5.3
Describe the problem
Problem
Vector search on a LanceDB table generated by dlt is broken. Here's a simple query:
Expected behavior
Since I expect downstream users to import
lancedb
and not worry about how data is ingested, dlt should adopt a different strategy for embedding function registration:import dlt.destinations.impl.lancedb.models
before querying data from lancedbdlt/destinations/impl/lancedb/__init__.py
importdlt/destinations/impl/lancedb/models.py
. This should enabledlt.destinations.lancedb
to be sufficient (I believe?)PatchedOpenAIEmbeddings
and rely on natively-supported lancedb functionslancedb
to modify the storedpyarrow.Schema
's metadata to include the required module imports (i.e., thedlt
submodule). This would add the embedding function to the LanceDB registry at deserialization before trying to retrieve the function from the registry,This error was painful to debug, because nothing points to
dlt
being the source. Renaming the registered function toopenai_dlt_patch
would be of great helpSteps to reproduce
lancedb
destination (credentials, embedding function, etc.)lancedb
destination with thelancedb_adapter
to ingest datalancedb
only and access a generated table that has an embed column..search()
(vector search)"openai_patched"
is not in registryFix
6. import
dlt.destinations.impl.lancedb.models
7. retry steps 4 and 5 and it should now be working
By manually checking the LanceDB embedding function registry, you can see the
"openai_patched"
function defined by dlt being registered.dlt code: https://github.com/dlt-hub/dlt/blob/devel/dlt/destinations/impl/lancedb/models.py
lancedb code: https://github.com/lancedb/lancedb/blob/main/python/python/lancedb/embeddings/registry.py
Operating system
Linux
Runtime environment
Local
Python version
3.11
dlt data source
Not relevant
dlt destination
No response
Other deployment details
dlt destination is
lancedb
Additional information
No response
The text was updated successfully, but these errors were encountered: