-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]Ability to chunk the documents and generate multiple embeddings using k-NN nested fields. #482
Comments
@vamshin do you mind if I will take a stab at this one? Feel free to assign to me |
@samuel-oci thanks for your interest. Would you like to put RFC on how you would approach this? We can let community provide feedback and finalize the approach before starting the development. What do you think? |
Sure thing, I will put something up for community review (RFC) before sending any PRs. |
Before I go ahead with the creation of the full RFC, I would like to put some of my thoughts here and get some feedback from folks regarding chunking techniques:
In either technique we choose we might want to have a mechanism that would allow for the user a certain degree of choice whether it's static/dynamic/naive (token length) chunking and we would need to add those configs into the neural-search plugin. |
Hey all, many customers are asking for us to provide this feature out-of-the-box in conjunction with the k-NN query-side functionality to merge/re-assemble documents chunks. I was planning to publish a feature proposal using our standard template. In the interim, this is what I've collected from customers:
|
@dylan-tong-aws thank you for the very useful customer feedback.
agreed, first release needs to be super simple without too much emphasis on sophisticated libraries.
I am thinking about those items as dealing with the multi vector query and underlying data structure that support multi chunks in a document: And this item as a way to have some quick out of the box thing that can chunk it for you in case you have a large document that reaches token limit for a model. Is that inline with your thoughts as well? |
Hi @vamshin what's the next step on this one? Should I just publish an RFC? Would like to understand what is the assignment on this ticket before committing to spend time on an RFC. |
We'd love to see this feature. If we can help in any way, it would save us having to implement something separate, only to discard it. One question - could this already be implemented on an ingest painless script processor (splitting on dot & space in a regex)? |
@samuel-oci please go ahead with RFC. I have assigned this issue to you. @ben-gineer yes this logic would be part of the current ingestion processors (text-embedding). |
Hi @samuel-oci ! I have created an RFC issue for document chunking: #548. You can merge your proposal with the RFC so that we can implement this chunking feature together. If you have any problem, feel free to ask me, thanks! |
@samuel-oci , @vamshin, can you guys please ensure we have an example of how this feature will work our embedding processors and the query-side support for chunking (opensearch-project/k-NN#1065). Specifically, let say this processor produces a next field of strings, how can the embedding processor(s) be chained together to generate a nested field of vectors in an ingest pipeline? |
@dylan-tong-aws yes, that will need to be included in the RFC. Currently in order to avoid duplication of work, I am using #548 as the place to capture all comments. |
Please checkout the design issue here: #548 |
@vamshin - is documentation needed for 2.13? |
@hdhalter yes we need documentation. @model-collapse are you working on this? |
@vamshin can we close this issue? |
Is your feature request related to a problem?
Today ingestion processors in Neural plugin only consider the tokens supported by model and rest of the tokens are discarded. If there is a bigger passage, then its possible some of the information is lost and hamper the relevancy
What solution would you like?
Ingestion processors should be able to break the passage into chunks based on the token limit and get relevant embeddings from the model and store in the k-NN nested fields.
The text was updated successfully, but these errors were encountered: