Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LanceDB destination: can't query generated tables #1765

Closed
zilto opened this issue Aug 28, 2024 · 0 comments · Fixed by #1771
Closed

LanceDB destination: can't query generated tables #1765

zilto opened this issue Aug 28, 2024 · 0 comments · Fixed by #1771
Assignees

Comments

@zilto
Copy link
Collaborator

zilto commented Aug 28, 2024

dlt version

0.5.3

Describe the problem

Problem

Vector search on a LanceDB table generated by dlt is broken. Here's a simple query:

import os
import lancedb

dlt_lancedb_uri = os.environ["DESTINATION__LANCEDB__CREDENTIALS__URI"]
lancedb_con = lancedb.connect(dlt_lancedb_uri)
lancedb_table = lancedb_con.open_table("my_table")
query = lancedb_table.search("My very important question")
results = query.to_list()

Expected behavior

Since I expect downstream users to import lancedb and not worry about how data is ingested, dlt should adopt a different strategy for embedding function registration:

  1. Add to the docs the need to import dlt.destinations.impl.lancedb.models before querying data from lancedb
  2. Have dlt/destinations/impl/lancedb/__init__.py import dlt/destinations/impl/lancedb/models.py. This should enable dlt.destinations.lancedb to be sufficient (I believe?)
  3. Avoid needing a custom PatchedOpenAIEmbeddings and rely on natively-supported lancedb functions
  4. Collaborate with lancedb to modify the stored pyarrow.Schema's metadata to include the required module imports (i.e., the dlt submodule). This would add the embedding function to the LanceDB registry at deserialization before trying to retrieve the function from the registry,

This error was painful to debug, because nothing points to dlt being the source. Renaming the registered function to openai_dlt_patch would be of great help

Steps to reproduce

  1. Configure the lancedb destination (credentials, embedding function, etc.)
  2. Use the lancedb destination with the lancedb_adapter to ingest data
  3. In a separate process (script, notebook, REPL, etc.), import lancedb only and access a generated table that has an embed column.
  4. Query the table using lancedb's .search() (vector search)
  5. It should fail saying "openai_patched" is not in registry

Fix
6. import dlt.destinations.impl.lancedb.models
7. retry steps 4 and 5 and it should now be working

By manually checking the LanceDB embedding function registry, you can see the "openai_patched" function defined by dlt being registered.

import lancedb.embeddings.registry as embedding_registry_module

registry = embedding_registry_module.get_registry()
registry._functions  # dictionary of {func_name: func} of type Dict[str, Callable]
registry.get("openai_patched")

dlt code: https://github.com/dlt-hub/dlt/blob/devel/dlt/destinations/impl/lancedb/models.py
lancedb code: https://github.com/lancedb/lancedb/blob/main/python/python/lancedb/embeddings/registry.py

Operating system

Linux

Runtime environment

Local

Python version

3.11

dlt data source

Not relevant

dlt destination

No response

Other deployment details

dlt destination is lancedb

Additional information

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants