[FEATURE]Ability to chunk the documents and generate multiple embeddings using k-NN nested fields. #482

vamshin · 2023-11-02T04:27:45Z

Is your feature request related to a problem?

Today ingestion processors in Neural plugin only consider the tokens supported by model and rest of the tokens are discarded. If there is a bigger passage, then its possible some of the information is lost and hamper the relevancy

What solution would you like?

Ingestion processors should be able to break the passage into chunks based on the token limit and get relevant embeddings from the model and store in the k-NN nested fields.

sam-herman · 2023-12-11T19:38:04Z

@vamshin do you mind if I will take a stab at this one? Feel free to assign to me

vamshin · 2023-12-11T21:03:39Z

@samuel-oci thanks for your interest. Would you like to put RFC on how you would approach this? We can let community provide feedback and finalize the approach before starting the development. What do you think?

sam-herman · 2023-12-12T01:12:40Z

Sure thing, I will put something up for community review (RFC) before sending any PRs.

sam-herman · 2023-12-13T00:09:45Z

Before I go ahead with the creation of the full RFC, I would like to put some of my thoughts here and get some feedback from folks regarding chunking techniques:

Naive (token limit) - Naive approach will blindly chunk the document text based on the model token limit. We can benchmark this approach, however my intuition regarding this is that it might not be the best idea as we will be cropping sentences and paragraphs in arbitrary way which could lead to weird sentences and inconsistent results.
Heuristic/dynamic - we can try and employ heuristic regarding a paragraph prior to chunking it, that means we would have to at least identify the end of a sentence or end of a paragraph and chunk based on that. I am open to ideas and suggestion here if anyone has any static parsers for paragraphs/sentences they recommend to identify intact paragraphs and sentences. Another option is to load an ML library via DJL and employ it in the chunking process (will do some research around that one but open to all suggestions as well).

In either technique we choose we might want to have a mechanism that would allow for the user a certain degree of choice whether it's static/dynamic/naive (token length) chunking and we would need to add those configs into the neural-search plugin.

dylan-tong-aws · 2023-12-13T18:50:34Z

Hey all, many customers are asking for us to provide this feature out-of-the-box in conjunction with the k-NN query-side functionality to merge/re-assemble documents chunks.

I was planning to publish a feature proposal using our standard template. In the interim, this is what I've collected from customers:

Today, users have been using basic custom chunking logic, and the general feedback is these methods are sufficient. Users are implementing static chunking strategies chunking by "length" or by "token count". Some add extra intelligence to split on specific tokens to ensure complete sentences. LangChain's text splitter provides a good example of useful options. Their text splitters allow you to specify a degree of overlap, which can be helpful.
The most fancy methods that I have come across involve using a ML model to identify optimal splits. In my opinion, these aren't scenarios that need to be supported in a first release, and should be decoupled from a generic text chunker. One method involves using a model to evaluate the similarity between the next sentence and make a best effort to maintain chunks that are semantically coherent. We are planning to adding support for a generic ML processor that could probably help enable these use cases. You could imagine a pipeline where a ML model can help inject split tokens into the documents before passing the docs into this text chunker.
We have users/customers/partners who want to decouple ingest processing from their search clusters. There are a plethora of great data prep/processing tools out there, and users have a variety of preferences. For instance, OpenSearch has DataPrepper and the managed AWS offering, Amazon OpenSearch Ingestion Service. We plan to ensure that these ingest processors can run on these tools. Similarly, we want to ensure these capabilities are extensible so that these processors could run on or be orchestrated by 3P tools like Apache Spark and various ML platforms.

sam-herman · 2023-12-14T06:39:44Z

@dylan-tong-aws thank you for the very useful customer feedback.

Users that implement custom chunking logic today are usually using basic methods, and it suffices for their needs. It's a static chunking method by "length" or by "token count". Some add extra intelligence to split on specific tokens to ensure complete sentences. LangChain's text splitter provides a good example of useful options. Their text splitters allow you to specify a degree of overlap, which can be helpful.

The most fancy methods that I have come across involve using a ML model to identify optimal splits. In my opinion, these aren't scenarios that need to be supported in a first release, and should be decoupled from a generic text chunker. One method involves using a model to evaluate the similarity between the next sentence and making a best effort to make splits semantically coherent. We are planning to adding support for a generic ML processor that could probably help enable these use cases. You could imagine a pipeline where a ML model can help inject split tokens into the documents before passing the docs into this text chunker.

agreed, first release needs to be super simple without too much emphasis on sophisticated libraries.

We have users/customers/partners who want to decouple ingest processing from their search clusters. There are a plethora of great data prep/processing tools out there, and users have a variety of preferences. For instance, OpenSearch has DataPrepper and the managed AWS offering, Amazon OpenSearch Ingestion Service. We plan to ensure that these ingest processors can run on these tools. Similarly, we want to ensure these capabilities are extensible so that this processing could run or be orchestrated by 3P tools like Apache Spark and various ML platforms.

I am thinking about those items as dealing with the multi vector query and underlying data structure that support multi chunks in a document:
opensearch-project/k-NN#675
opensearch-project/k-NN#1065

And this item as a way to have some quick out of the box thing that can chunk it for you in case you have a large document that reaches token limit for a model.

Is that inline with your thoughts as well?

sam-herman · 2024-01-22T19:07:46Z

Hi @vamshin what's the next step on this one? Should I just publish an RFC? Would like to understand what is the assignment on this ticket before committing to spend time on an RFC.
If you can assign to me I can go ahead with the RFC.

ben-gineer · 2024-01-22T21:07:40Z

We'd love to see this feature. If we can help in any way, it would save us having to implement something separate, only to discard it.

One question - could this already be implemented on an ingest painless script processor (splitting on dot & space in a regex)?

vamshin · 2024-01-23T07:24:29Z

@samuel-oci please go ahead with RFC. I have assigned this issue to you.

@ben-gineer yes this logic would be part of the current ingestion processors (text-embedding).

yuye-aws · 2024-01-23T13:46:10Z

Hi @samuel-oci ! I have created an RFC issue for document chunking: #548. You can merge your proposal with the RFC so that we can implement this chunking feature together. If you have any problem, feel free to ask me, thanks!

dylan-tong-aws · 2024-02-03T00:08:40Z

@samuel-oci , @vamshin, can you guys please ensure we have an example of how this feature will work our embedding processors and the query-side support for chunking (opensearch-project/k-NN#1065).

Specifically, let say this processor produces a next field of strings, how can the embedding processor(s) be chained together to generate a nested field of vectors in an ingest pipeline?

sam-herman · 2024-02-06T00:41:10Z

@samuel-oci , @vamshin, can you guys please ensure we have an example of how this feature will work our embedding processors and the query-side support for chunking (opensearch-project/k-NN#1065).

Specifically, let say this processor produces a next field of strings, how can the embedding processor(s) be chained together to generate a nested field of vectors in an ingest pipeline?

@dylan-tong-aws yes, that will need to be included in the RFC. Currently in order to avoid duplication of work, I am using #548 as the place to capture all comments.

model-collapse · 2024-02-24T01:43:20Z

Please checkout the design issue here: #548

hdhalter · 2024-03-06T00:39:07Z

@vamshin - is documentation needed for 2.13?

vamshin · 2024-03-06T16:22:01Z

@hdhalter yes we need documentation. @model-collapse are you working on this?

navneet1v · 2024-04-01T22:19:41Z

@vamshin can we close this issue?

vamshin added Features Introduces a new unit of functionality that satisfies a requirement enhancement labels Nov 2, 2023

github-actions bot added the untriaged label Nov 2, 2023

vamshin removed the untriaged label Nov 2, 2023

vamshin added the neural-search label Nov 6, 2023

vamshin changed the title ~~[FEATURE]Ability to generate multiple embeddings by chunk the data and ingest using k-NN nested fields .~~ [FEATURE]Ability to generate multiple embeddings by chunk the data and ingest using k-NN nested fields. Dec 1, 2023

vamshin changed the title ~~[FEATURE]Ability to generate multiple embeddings by chunk the data and ingest using k-NN nested fields.~~ [FEATURE]Ability to chunk the documents and generate multiple embeddings using k-NN nested fields. Dec 11, 2023

sam-herman mentioned this issue Dec 13, 2023

[FEATURE] Nested field mapping support for text embedding processor #110

Closed

vamshin added the v2.13.0 label Dec 13, 2023

vamshin assigned sam-herman Jan 23, 2024

yuye-aws mentioned this issue Jan 23, 2024

[RFC] Text chunking design #548

Closed

heemin32 mentioned this issue Feb 3, 2024

[FEATURE] Improved multi vector support using Nested fields opensearch-project/k-NN#1065

Closed

vamshin added v2.13.0 and removed v2.13.0 labels Feb 19, 2024

sam-herman mentioned this issue Feb 23, 2024

[META] Chunking and querying of long passages for vector search #612

Closed

vibrantvarun added v2.14.0 v2.13.0 and removed v2.13.0 v2.14.0 labels Mar 19, 2024

vamshin closed this as completed Apr 15, 2024

github-project-automation bot added this to OpenSearch Project Roadmap Aug 30, 2024

github-project-automation bot moved this to 2.13.0 (Launched ) in OpenSearch Project Roadmap Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]Ability to chunk the documents and generate multiple embeddings using k-NN nested fields. #482

[FEATURE]Ability to chunk the documents and generate multiple embeddings using k-NN nested fields. #482

vamshin commented Nov 2, 2023

sam-herman commented Dec 11, 2023

vamshin commented Dec 11, 2023

sam-herman commented Dec 12, 2023

sam-herman commented Dec 13, 2023 •

edited

Loading

dylan-tong-aws commented Dec 13, 2023 •

edited

Loading

sam-herman commented Dec 14, 2023

sam-herman commented Jan 22, 2024 •

edited

Loading

ben-gineer commented Jan 22, 2024 •

edited

Loading

vamshin commented Jan 23, 2024

yuye-aws commented Jan 23, 2024 •

edited

Loading

dylan-tong-aws commented Feb 3, 2024

sam-herman commented Feb 6, 2024

model-collapse commented Feb 24, 2024

hdhalter commented Mar 6, 2024

vamshin commented Mar 6, 2024 •

edited

Loading

navneet1v commented Apr 1, 2024

[FEATURE]Ability to chunk the documents and generate multiple embeddings using k-NN nested fields. #482

[FEATURE]Ability to chunk the documents and generate multiple embeddings using k-NN nested fields. #482

Comments

vamshin commented Nov 2, 2023

Is your feature request related to a problem?

What solution would you like?

sam-herman commented Dec 11, 2023

vamshin commented Dec 11, 2023

sam-herman commented Dec 12, 2023

sam-herman commented Dec 13, 2023 • edited Loading

dylan-tong-aws commented Dec 13, 2023 • edited Loading

sam-herman commented Dec 14, 2023

sam-herman commented Jan 22, 2024 • edited Loading

ben-gineer commented Jan 22, 2024 • edited Loading

vamshin commented Jan 23, 2024

yuye-aws commented Jan 23, 2024 • edited Loading

dylan-tong-aws commented Feb 3, 2024

sam-herman commented Feb 6, 2024

model-collapse commented Feb 24, 2024

hdhalter commented Mar 6, 2024

vamshin commented Mar 6, 2024 • edited Loading

navneet1v commented Apr 1, 2024

sam-herman commented Dec 13, 2023 •

edited

Loading

dylan-tong-aws commented Dec 13, 2023 •

edited

Loading

sam-herman commented Jan 22, 2024 •

edited

Loading

ben-gineer commented Jan 22, 2024 •

edited

Loading

yuye-aws commented Jan 23, 2024 •

edited

Loading

vamshin commented Mar 6, 2024 •

edited

Loading