-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for "reduce" steps in LanceDB adapter #1699
Comments
@zilto thanks for this idea. a few random (probably I miss some background) comments:
that would create btw. with @Pipboyguy we are trying to support chunking in some more or less unified way #1587 |
I agree with the motivation of the cited issue! But to add more context:
This suggests doing
For RAG, I intend to use "contexts" for first-pass vector search then use "context-chunk" lineage to filter out "contexts" that have "too much in common" and increase the information content of the text passed to the LLM. Over time, it's valuable to logging which "context" and underlying "chunk" are high signal for downstream uses. More concretely, a user asks a question about dlt. You want documentation to be embedded in large "contexts" to have good recall, then the LLM should be able to extract the right info from the "context" and generate an answer. However, it's still fuzzy "what" was useful to the LLM or user. The above lineage would show that retrieving "contexts" with "chunk"
Didn't think of that! While it handles relationships, I would have duplicated "chunks" stored, no? |
@zilto it seems will be picking your brain a lot :) our goal is to support chunked documents with "merge" write disposition (where only subset of documents will be updated). I'll get back to this topic tomorrow. we need to move forward... |
Feature description
Allow the LanceDB and other Vector DB adapter to specify a "contextualize" or rolling window operation to join partitioned text chunks before applying the embedding function.
Are you a dlt user?
Yes, I'm already a dlt user.
Use case
context
The constructs of
@dlt.resource
and@dlt.transformer
are very convenient for document ingestion for NLP/LLM use cases. The@dlt.resource
returns full-text and@dlt.transformer
can chunk it (into paragraphs for example). The LanceDB and other vector DB adapters make it easy to embed the full-text and the chunked text columns. We get something like this:Full-text
Chunks (3 words)
limitations
However, embedding these "partitioned" chunks is often low value for RAG. A common operation is "contextualizing" chunks, which consists of a rolling window operation (with window size and stride / overlap parameters). For instance LanceDB has contextualize(), but it requires converting the data to a pandas dataframe. Let's illustrate a "2-chunk window" based on the previous table:
Contexts
AFAIK, dlt doesn't provide a clear API for normalizing the
chunk_id
and thecontext_id
columns. The "contextualize" operation could be directly implemented in a single@dlt.transformer
, but it would only includedocument_id -> context_id
and miss the fact that "contextualized chunks" aren't independent; they share underlying chunks.Proposed solution
adding a "reducer" step
I was able to hack around to receive a batch of "chunks" and use
dlt.mark.with_table_name
to dispatch both a "context table" and "relation table" from the same@dlt.transformer
. Mock code:Contexts
Chunks-to-contexts keys
There's probably room for a generic
@dlt.reducer
that automatically manages the primary / foreign keys based on the other resources metadata, handles the key set hashing, and dispatches results to tables. Given that this could be a can of worm, it could be tested and refined while being hidden behind thelancedb_adapter
. The API could be expanded toThis would reproduce the above logic by creating the chunks table as defined by the user (
chunks
resource) and creating the second table automaticallyRelated issues
No response
The text was updated successfully, but these errors were encountered: