Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-16667 #4

Open
wants to merge 38 commits into
base: main
Choose a base branch
from
Open

Conversation

aruggero
Copy link
Collaborator

@aruggero aruggero commented Mar 29, 2023

https://issues.apache.org/jira/browse/SOLR-16667

Description

Currently, the feature vector cache is used only for logging purposes in the learning-to-rank model.
It would be useful to integrate the cache usage also in the reranking phase to speed up the process.

Solution

A new learning-to-rank feature vector cache has been added to speed up the reranking process. Before this contribution, there was a unique ltr cache for logging purposes, that cache has been removed and a unique new cache has been added for both logging and reranking.
Currently the new cache stores a key defined by: the feature store name, the features definition (in the feature store), and the document id.
The cache is defined in the Solr config as:

<query>
  <featureVectorCache class="solr.CaffeineCache" size="4096" initialSize="2048" autowarmCount="0" />
</query>

Then, the cache is used in org.apache.solr.ltr.LTRScoringQuery.ModelWeight.ModelScorer.FeatureTraversalScorer#fillFeaturesInfo
If no hit happens in the cache, the old behavior is maintained and the feature vector is calculated from scratch.

A change has also been made in org.apache.solr.ltr.model.LinearModel#score and org.apache.solr.ltr.model.NeuralNetworkModel.DefaultLayer#calculateOutput in order to be able to manage NaN values.
When asking for a sparse/dense feature vector format, we would like to:

  • show all the features values in the dense format (with defaults)
  • show only the computed feature values in the sparse format (no defaults)

To apply this behavior we need to differentiate between a "default" value and a computed value that is equal to the default.
Suppose to have a boolean feature, in this case, if the feature value is not defined we will assign the default one (zero and no computation done), but zero is also the value given when the feature is false (here the computation is done).
How to differentiate the two cases?
The user can differentiate the two cases by defining NaN as the default value of that feature. In this way he will see:

  • dense: all the features values (with defaults)
  • sparse: only the computed values (also the computed zero values)

Here the need to manage these NaN values in the linear model and in the neural model (their behavior has not been changed).

Tests

A test has been added to check that the feature vectors' of the results returned after a hit in the cache are the same returned when computed from scratch: org/apache/solr/ltr/TestFeatureVectorCache.java
A test has been added to check the new sparse/format behavior: org.apache.solr.ltr.feature.TestFeatureLogging#testDefaultNaNFeatureExtraction
Some tests have been changed to correctly match the default format (sparse or dense) chosen when starting up the test: org/apache/solr/ltr/feature/TestFieldValueFeature.java

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

aruggero and others added 29 commits February 17, 2023 11:38
…(not in the right order with respect to FeaturesInfo)
Copy link
Member

@alessandrobenedetti alessandrobenedetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but the overall pull request looks OK!


public abstract String makeFeatureVector(LTRScoringQuery.FeatureInfo[] featuresInfo);

private static int fvCacheKey(LTRScoringQuery scoringQuery, int docid) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the generation of the cache key has been moved from this class, motivation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is no longer invoked inside the Logger (see row 81 below), therefore I moved it where it is called (in LTRScoringQuery)

Copy link
Collaborator Author

@aruggero aruggero Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this is now called when creating the LTRScoringQuery (query part of the key) and added as a private variable in the LTRScoringQuery object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants