-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Refactor on data validation and extraction from customer's documents in several processors #660
Comments
This is a very early idea of refactoring code doesn't mean above statement will be the final change, let's have more discussion under this issue especially related code owners, would like to hear your voice on this. |
That make sense since other processors might need same support for customer's configuration. We can consider this overall and make it more generic across different processors. |
Checked on @mingshl 's issue and PR again and it seems her proposal is targeted to a different case, simply put: |
@zane-neo the dot path notation for nested object would be supported in ml inference processors. we first support remote models, but the future development will include local models. The reason behind that is the input datasets for local models are planning to be unified similar to remote models, and it will be convenient to be supported for many features, including ml inference processors. |
I think there are some diverges here, so in ml inference processors , the main responsibility is to call inference and add the results to the documents or search responses. The only required field is model_id, so the use cases have to be using models to inference. And the ml inference processors itself doesn't need to write painless script. |
Yeah, painless scripts are part of the connector which is conforms to current remote inference implementation. The main point here is the ML inference processor is based on |
Closing this issue...feel free to open if there are any concerns |
Is your feature request related to a problem?
Background
Currently in neural search, we validate and extract data from user's index documents and send them to model for inference. We used a recursive approach to check the data structure based on user's configuration. For example, below is the user's configuration for
text_embedding
and thefield_map
is the important part which represents a mapping relation between the original key and the target key(embedding field key).Above configuration assume user has original document in below structure:
We support raw string, map type with leaf string type, and list of string, list of map with leaf string type.
Problem statement
Several processors are using the same configuration to validate and extract the field content: InferenceProcessor, TextImageEmbeddingProcessor and TextChunkingProcessor etc which causes duplicate code among these classes.
And a more critical issue is when we need to implement new features for this validation and extraction, we need to duplicate that as well, e.g.: #110 this issue requested to add dot support for user's configuration, and we need to implement for this in multiple places which is a bad smell and it's time to refactor our code.
Proposal
The proposal is that we can extract these code in a common place and by designing them reasonably we can reduce the code duplication and all the enhancement goes a single place in future.
We should note that not every piece can be made common since different processor has different logic, e.g. TextImageEmbeddingProcessor and InferenceProcessor's
buildxxxKeyAndOriginalValue
, since they support different type, it's not easy to make this part reusable, so in this case we will have to make some abstract methods and different processor has their own different implementation. This case we need to add an abstract class and different processors need to extend it.Some code are almost same, e.g. TextImageEmbeddingProcessor and InferenceProcessor and TextChunkProcessor's
validateEmbeddingFieldsValue/validateFieldsValue
method, they're almost exactly same with only minor differences, in this case we can extract the code to a common class a common method with adding slightly change and with combination approach we can reduce the code duplication.What solution would you like?
A clear and concise description of what you want to happen.
What alternatives have you considered?
A clear and concise description of any alternative solutions or features you've considered.
Do you have any additional context?
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: