Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]Ability to chunk the documents and generate multiple embeddings using k-NN nested fields. #482

Closed
vamshin opened this issue Nov 2, 2023 · 16 comments
Assignees
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement neural-search v2.13.0

Comments

@vamshin
Copy link
Member

vamshin commented Nov 2, 2023

Is your feature request related to a problem?

Today ingestion processors in Neural plugin only consider the tokens supported by model and rest of the tokens are discarded. If there is a bigger passage, then its possible some of the information is lost and hamper the relevancy

What solution would you like?

Ingestion processors should be able to break the passage into chunks based on the token limit and get relevant embeddings from the model and store in the k-NN nested fields.

@vamshin vamshin added Features Introduces a new unit of functionality that satisfies a requirement enhancement labels Nov 2, 2023
@vamshin vamshin removed the untriaged label Nov 2, 2023
@vamshin vamshin changed the title [FEATURE]Ability to generate multiple embeddings by chunk the data and ingest using k-NN nested fields . [FEATURE]Ability to generate multiple embeddings by chunk the data and ingest using k-NN nested fields. Dec 1, 2023
@sam-herman
Copy link

@vamshin do you mind if I will take a stab at this one? Feel free to assign to me

@vamshin vamshin changed the title [FEATURE]Ability to generate multiple embeddings by chunk the data and ingest using k-NN nested fields. [FEATURE]Ability to chunk the documents and generate multiple embeddings using k-NN nested fields. Dec 11, 2023
@vamshin
Copy link
Member Author

vamshin commented Dec 11, 2023

@samuel-oci thanks for your interest. Would you like to put RFC on how you would approach this? We can let community provide feedback and finalize the approach before starting the development. What do you think?

@sam-herman
Copy link

Sure thing, I will put something up for community review (RFC) before sending any PRs.

@sam-herman
Copy link

sam-herman commented Dec 13, 2023

Before I go ahead with the creation of the full RFC, I would like to put some of my thoughts here and get some feedback from folks regarding chunking techniques:

  1. Naive (token limit) - Naive approach will blindly chunk the document text based on the model token limit. We can benchmark this approach, however my intuition regarding this is that it might not be the best idea as we will be cropping sentences and paragraphs in arbitrary way which could lead to weird sentences and inconsistent results.
  2. Heuristic/dynamic - we can try and employ heuristic regarding a paragraph prior to chunking it, that means we would have to at least identify the end of a sentence or end of a paragraph and chunk based on that. I am open to ideas and suggestion here if anyone has any static parsers for paragraphs/sentences they recommend to identify intact paragraphs and sentences. Another option is to load an ML library via DJL and employ it in the chunking process (will do some research around that one but open to all suggestions as well).

In either technique we choose we might want to have a mechanism that would allow for the user a certain degree of choice whether it's static/dynamic/naive (token length) chunking and we would need to add those configs into the neural-search plugin.

@dylan-tong-aws
Copy link

dylan-tong-aws commented Dec 13, 2023

Hey all, many customers are asking for us to provide this feature out-of-the-box in conjunction with the k-NN query-side functionality to merge/re-assemble documents chunks.

I was planning to publish a feature proposal using our standard template. In the interim, this is what I've collected from customers:

  1. Today, users have been using basic custom chunking logic, and the general feedback is these methods are sufficient. Users are implementing static chunking strategies chunking by "length" or by "token count". Some add extra intelligence to split on specific tokens to ensure complete sentences. LangChain's text splitter provides a good example of useful options. Their text splitters allow you to specify a degree of overlap, which can be helpful.

  2. The most fancy methods that I have come across involve using a ML model to identify optimal splits. In my opinion, these aren't scenarios that need to be supported in a first release, and should be decoupled from a generic text chunker. One method involves using a model to evaluate the similarity between the next sentence and make a best effort to maintain chunks that are semantically coherent. We are planning to adding support for a generic ML processor that could probably help enable these use cases. You could imagine a pipeline where a ML model can help inject split tokens into the documents before passing the docs into this text chunker.

  3. We have users/customers/partners who want to decouple ingest processing from their search clusters. There are a plethora of great data prep/processing tools out there, and users have a variety of preferences. For instance, OpenSearch has DataPrepper and the managed AWS offering, Amazon OpenSearch Ingestion Service. We plan to ensure that these ingest processors can run on these tools. Similarly, we want to ensure these capabilities are extensible so that these processors could run on or be orchestrated by 3P tools like Apache Spark and various ML platforms.

@sam-herman
Copy link

@dylan-tong-aws thank you for the very useful customer feedback.

Users that implement custom chunking logic today are usually using basic methods, and it suffices for their needs. It's a static chunking method by "length" or by "token count". Some add extra intelligence to split on specific tokens to ensure complete sentences. LangChain's text splitter provides a good example of useful options. Their text splitters allow you to specify a degree of overlap, which can be helpful.

The most fancy methods that I have come across involve using a ML model to identify optimal splits. In my opinion, these aren't scenarios that need to be supported in a first release, and should be decoupled from a generic text chunker. One method involves using a model to evaluate the similarity between the next sentence and making a best effort to make splits semantically coherent. We are planning to adding support for a generic ML processor that could probably help enable these use cases. You could imagine a pipeline where a ML model can help inject split tokens into the documents before passing the docs into this text chunker.

agreed, first release needs to be super simple without too much emphasis on sophisticated libraries.

We have users/customers/partners who want to decouple ingest processing from their search clusters. There are a plethora of great data prep/processing tools out there, and users have a variety of preferences. For instance, OpenSearch has DataPrepper and the managed AWS offering, Amazon OpenSearch Ingestion Service. We plan to ensure that these ingest processors can run on these tools. Similarly, we want to ensure these capabilities are extensible so that this processing could run or be orchestrated by 3P tools like Apache Spark and various ML platforms.

I am thinking about those items as dealing with the multi vector query and underlying data structure that support multi chunks in a document:
opensearch-project/k-NN#675
opensearch-project/k-NN#1065

And this item as a way to have some quick out of the box thing that can chunk it for you in case you have a large document that reaches token limit for a model.

Is that inline with your thoughts as well?

@sam-herman
Copy link

sam-herman commented Jan 22, 2024

Hi @vamshin what's the next step on this one? Should I just publish an RFC? Would like to understand what is the assignment on this ticket before committing to spend time on an RFC.
If you can assign to me I can go ahead with the RFC.

@ben-gineer
Copy link

ben-gineer commented Jan 22, 2024

We'd love to see this feature. If we can help in any way, it would save us having to implement something separate, only to discard it.

One question - could this already be implemented on an ingest painless script processor (splitting on dot & space in a regex)?

@vamshin
Copy link
Member Author

vamshin commented Jan 23, 2024

@samuel-oci please go ahead with RFC. I have assigned this issue to you.

@ben-gineer yes this logic would be part of the current ingestion processors (text-embedding).

@yuye-aws
Copy link
Member

yuye-aws commented Jan 23, 2024

Hi @samuel-oci ! I have created an RFC issue for document chunking: #548. You can merge your proposal with the RFC so that we can implement this chunking feature together. If you have any problem, feel free to ask me, thanks!

@dylan-tong-aws
Copy link

@samuel-oci , @vamshin, can you guys please ensure we have an example of how this feature will work our embedding processors and the query-side support for chunking (opensearch-project/k-NN#1065).

Specifically, let say this processor produces a next field of strings, how can the embedding processor(s) be chained together to generate a nested field of vectors in an ingest pipeline?

@sam-herman
Copy link

@samuel-oci , @vamshin, can you guys please ensure we have an example of how this feature will work our embedding processors and the query-side support for chunking (opensearch-project/k-NN#1065).

Specifically, let say this processor produces a next field of strings, how can the embedding processor(s) be chained together to generate a nested field of vectors in an ingest pipeline?

@dylan-tong-aws yes, that will need to be included in the RFC. Currently in order to avoid duplication of work, I am using #548 as the place to capture all comments.

@model-collapse
Copy link
Collaborator

Please checkout the design issue here: #548

@hdhalter
Copy link

hdhalter commented Mar 6, 2024

@vamshin - is documentation needed for 2.13?

@vamshin
Copy link
Member Author

vamshin commented Mar 6, 2024

@hdhalter yes we need documentation. @model-collapse are you working on this?

@navneet1v
Copy link
Collaborator

@vamshin can we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement neural-search v2.13.0
Projects
Status: 2.13.0 (Launched )
Development

No branches or pull requests

9 participants