-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] OpenSearch Search Quality Evaluation Framework #15354
Comments
I'm wondering if we can leverage some of the join stuff that we're doing to build the indices for the implicit judgement calculations. Can construction of those indices be modeled as a @penghuo -- you've been integrating the SQL plugin with Spark. Is there something we could do to spin up a Spark job to do these calculations and persist the results back into OpenSearch? |
A question I have is, where will this be implemented and how tightly/loosely coupled is the implementation going to be with OpenSearch. I understand that UBI is a project and any interested search engine can implement UBI. Currently OpenSearch is the only search engine so far that has the initial implementation for UBI. But technically, I could use UBI that available on OpenSearch even if my current search engine is Solr or something else. Let's say my search engine is Solr and I have an OpenSearch cluster just for UBI. I can create a dummy index in OpenSearch and send an empty-result query to OpenSearch which is in addition to my actual search query to Solr that servers my user. The empty-result query will be sent to OpenSearch after the Solr has responded with the docIds. Example below code block. Another alternate option is to not use the OpenSearch UBI plugin but send this information to Why should I do this? For the following reasons,
And I'm saying all these without any bias towards OpenSearch and thinking about the recommendations I would give out if I were an independent search consultant. So going back to my initial question - If this search quality framework can be implemented search engine agnostic, OpenSearch can be used as an UBI hub (for the lack of better word). So the question "what/how this is going to be implemented" takes precedence over "where is this going to be implemented". By "where", I mean, is it going to be an OpenSearch plugin, OR extension of the OpenSearch UBI plugin OR OpenSeach core OR independent repo (as independent as
|
Some quick comments Re "Currently OpenSearch is the only search engine so far that has the initial implementation for UBI." The server-side component of UBI collects queries and responses at the server. This has been implemented so far for OpenSearch, Solr, and Elasticsearch. On the analysis side, you can send the data anywhere you want. We generally recommend sending it to mass storage (S3 in the AWS context) because as you say it can get very voluminous. You can then load it into OpenSearch (or for that matter Redshift if you like) for analysis. We will be building tools on top of OpenSearch, but since the schema is well-defined, others can build tools where they like. We are still working on how exactly to implement analysis functionality. My current hypothesis is that most analysts will want to use Python as their main tool, querying OpenSearch or a DBMS for bulk data operations. Where exactly the Python code will run is still an open question. |
@aswath86 thanks for following up, and I think that you asked two big things, one about the value proposition of UBI, and then one about the Search Quality Eval Framework. To your first question... Yes! In talking about UBI, I've met two large organizations that use Solr as the search engine for their application, and have OpenSearch for all their logs and analytics and other data sources. Both of them are interested in the prospect of using UBI without needing to throw out their existing Solr investment, and leveraging their OpenSearch setup even more! I'm hoping that UBI support will ship in Solr in the near future with a pipeline to send that data to OpenSearch backend. (apache/solr#2452 if you are curious). This can be expanded to many other search engines. To the second question:
In someways, I think this is a question to be answered by more experienced OpenSearch Maintainers ;-). From my perspective, I see some real tension in the community on this exact question applied to many areas. If you recall, we wanted to ship UBI as part of core OpenSearch, and what we heard is "lets put LESS into core and more into plugins". However, then you look at ML Commons, which while technically is a single "plugin", really looks like a very rich independent ecosystem bundled up into a plugin. I can see a path that involves us expanding out the UI aspects (dashboards, visualizations, etc) to the existing We are aiming to wrap the first phase by end of year, so we need to figure out what is shippable by then, while also starting to htink about what we do next year, and how big do we dream? |
I want to share that we had a good discussion on COEC calculations, and that is closing in on done. We'd like to get the documentation about how to calculate COEC based implicit judgements into the 2.18 release if possible. |
GitHub project - https://github.com/o19s/opensearch-search-quality-evaluation |
[RFC] OpenSearch Search Quality Evaluation Framework
Introduction
User Behavior Insights (UBI) provides OpenSearch users with the ability to capture user behavior data to be used for improving search relevance. Implemented as an OpenSearch plugin, UBI can connect queries with user behaviors per its defined schema. This data allows for insight into judgements that were observed from user behaviors.
This RFC proposes development of an evaluation framework that uses the UBI-collected data to improve search result quality through the calculation of implicit judgments.
Thanks to the following collaborators on this RFC:
Problem Statement
As a search relevance engineer, understanding the quality of search results over time as changes to data, algorithm, and underlying platform occur is extremely difficult, yet also critical to building robust search experiences.This is a common, long-standing problem that is notoriously difficult to solve. This is especially true for small organizations or any organization without a dedicated search team. Collecting the data and making effective use of the data can be a time-consuming activity.
With the collected user data being the source of implicit judgements there are challenges calculating to keep in mind, e.g. position bias (users have the tendency to click on documents presented at the top) or presentation bias (users cannot click on what is not presented, therefore no data is collected).
Proposal
We propose developing a framework for evaluating search quality by calculating implicit judgments based on data collected by UBI to optimize and improve search result quality. Ultimately, we would like for the framework to perform automatic optimization by consuming UBI data, calculating implicit judgments and then providing search tuning without manual interaction. This automation will help make the functionality usable by organizations of all sizes.
We propose modeling implicit judgments on the statistic “Clicks Over Expected Clicks” (COEC) (H. Cheng and E. Cant´lˇs-Paz, 2010). We chose this model for its confirmability and for dealing with position bias - a bias omnipresent in search applications.
For teams that already have an approach to calculate implicit judgements or for teams that want to calculate implicit judgements with a different approach we provide ways to integrate these judgements into the framework. In the future we imagine the support of extensions that enable calculations inside the framework.
Collecting Implicit Judgments
The UBI plugin already captures the information needed to derive implicit judgments, storing the information in two OpenSearch indexes:
ubi_queries
for search requests and search responses, andubi_events
for events.Data Transformation and Calculations
The data required includes the query, the position of the search result, whether or not the search result was clicked, and a user-selectable field whose value consistently and uniquely identifies the search result. The data to be used, whether collected by UBI or not, will need to be transformed into this format.
These operations will take place outside of OpenSearch to avoid a tight-coupling with OpenSearch. Not requiring the installation of a plugin will permit users to utilize their existing judgment pipelines if they already have them.
The illustration below shows an overview how implicit judgements are calculated. The behavioral data necessary for implicit judgements comes from users interacting (searching, clicking on results) with the search platform. In the illustration we assume UBI is used as the tool for collecting this data.
The search quality evaluation framework initially retrieves all seen and clicked documents for a configurable amount of historical time from the user-configurable source, which by default will be the UBI indexes but will also support an Amazon S3 bucket. To calculate implicit judgements with COEC as the underlying model two statistics are calculated:
With these two intermediate indexes the final judgements are calculated and stored in a third index.
For all three indexes, a new one is created for each of the above mentioned steps. An alias is created for the latest successful calculation. After successfully calculating implicit judgements the previous indexes can be removed when no longer needed. This is configurable to enable OpenSearch users to store implicit judgements calculated from different source data.
Using the Implicit Judgments
The implicit judgments can be used to calculate search metrics (i.e. nDCG), enable offline evaluation, and are designed to be used to train LTR models.
The primary goal of the search quality evaluation framework is to assess the search result quality with the calculated implicit judgements.
As such the framework provides several features:
Create a query sample
To assess the search result quality of a system a subset of real world queries typically is chosen and evaluated. The search quality evaluation framework can take queries stored in the
ubi_queries
index (or another index with data stored in a compatible way) and apply Probability-Proportional-to-Size sampling (PPTSS) on it to generate a frequency-weighted query sample. The size of the resulting sample is configurable with the default value set to 3,000.Download, upload or change a query sample
For OpenSearch users who already have a query sample it is possible to upload these to the search quality evaluation framework directly. Changing an existing query sample is possible via downloading and uploading the query sample with the desired changes made. Storing multiple query samples is possible.
Calculate metrics based on a query sample
Having a query sample and a set of implicit judgements enables calculating search result quality metrics. Supported metrics are the classic Information Retrieval metrics such as NDCG, AP, and friends. We are looking at the metrics that are used in the TREC eval project, and will look for a Java library or reimplement the metrics ourselves.
Together with the actual search metric, the search quality evaluation framework calculates statistical measures that let users measure statistical significance by doing a t-test or calculating the p-value. Users specify which runs to compare and the system calculates the t-score and p-value. This is done by passing the unique ids of the test results under which the test results are stored in the corresponding OpenSearch index.
By choosing the query sample to evaluate users can run evaluation jobs on different samples, e.g. to see the effects of changes on specialized query samples.
Track calculated search result quality over time
Every metric that is calculated is stored in an index within OpenSearch. That way it is possible to measure and visualize the progression of search metrics for a query sample over time, e.g. with a dedicated dashboard.
Changes to UBI
We may find that changes in UBI are necessary. We will strive to not change UBI's current formats for data, instead opting to use UBI "out of the box" as much as possible as to not interfere with other users of UBI.
We do NOT require that the OpenSearch UBI plugin be enabled in order to use this tooling. As long as you have data conforming to the UBI schema and provide it to the tooling, then you can use these features. However, having UBI collect the signal data is the most seamless way to get implicit judgements calculated.
GitHub Repository
This work will initially be built outside of the OpenSearch GitHub project but we hope to transfer the repository to “living” inside the OpenSearch GitHub repository as soon as possible. We are tentatively calling the repository
search-quality-evaluation
.Roadmap
We will attempt to follow the OpenSearch release schedule with incremental progress implemented in OpenSearch releases. This will enable real-time use and feedback.
Conclusion
The availability of the UBI plugin and the data it collects provides opportunities for improving search result quality. The method presented in this RFC is not the only way the data can be leveraged; rather, it describes one method that has shown success in several industries and types of search environments. We believe this makes it a good method for first implementation. Once executed, we hope this will lead to the implementation of additional "pluggable" methods that give OpenSearch users a choice of how to model their data and improve search result quality.
The text was updated successfully, but these errors were encountered: