SOLR-16667 #4

aruggero · 2023-03-29T11:56:09Z

https://issues.apache.org/jira/browse/SOLR-16667

Description

Currently, the feature vector cache is used only for logging purposes in the learning-to-rank model.
It would be useful to integrate the cache usage also in the reranking phase to speed up the process.

Solution

A new learning-to-rank feature vector cache has been added to speed up the reranking process. Before this contribution, there was a unique ltr cache for logging purposes, that cache has been removed and a unique new cache has been added for both logging and reranking.
Currently the new cache stores a key defined by: the feature store name, the features definition (in the feature store), and the document id.
The cache is defined in the Solr config as:

<query>
  <featureVectorCache class="solr.CaffeineCache" size="4096" initialSize="2048" autowarmCount="0" />
</query>

Then, the cache is used in org.apache.solr.ltr.LTRScoringQuery.ModelWeight.ModelScorer.FeatureTraversalScorer#fillFeaturesInfo
If no hit happens in the cache, the old behavior is maintained and the feature vector is calculated from scratch.

A change has also been made in org.apache.solr.ltr.model.LinearModel#score and org.apache.solr.ltr.model.NeuralNetworkModel.DefaultLayer#calculateOutput in order to be able to manage NaN values.
When asking for a sparse/dense feature vector format, we would like to:

show all the features values in the dense format (with defaults)
show only the computed feature values in the sparse format (no defaults)

To apply this behavior we need to differentiate between a "default" value and a computed value that is equal to the default.
Suppose to have a boolean feature, in this case, if the feature value is not defined we will assign the default one (zero and no computation done), but zero is also the value given when the feature is false (here the computation is done).
How to differentiate the two cases?
The user can differentiate the two cases by defining NaN as the default value of that feature. In this way he will see:

dense: all the features values (with defaults)
sparse: only the computed values (also the computed zero values)

Here the need to manage these NaN values in the linear model and in the neural model (their behavior has not been changed).

Tests

A test has been added to check that the feature vectors' of the results returned after a hit in the cache are the same returned when computed from scratch: org/apache/solr/ltr/TestFeatureVectorCache.java
A test has been added to check the new sparse/format behavior: org.apache.solr.ltr.feature.TestFeatureLogging#testDefaultNaNFeatureExtraction
Some tests have been changed to correctly match the default format (sparse or dense) chosen when starting up the test: org/apache/solr/ltr/feature/TestFieldValueFeature.java

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

…cache_ranking

…corers

…(not in the right order with respect to FeaturesInfo)

…n_for_feature_vector_cache

…Id in feature vector cache

…hich are empty if only logging is required)

…ingQuery before using it

alessandrobenedetti

Some minor comments, but the overall pull request looks OK!

alessandrobenedetti · 2023-04-13T10:28:44Z

solr/modules/ltr/src/java/org/apache/solr/ltr/FeatureLogger.java

-
-  public abstract String makeFeatureVector(LTRScoringQuery.FeatureInfo[] featuresInfo);
-
-  private static int fvCacheKey(LTRScoringQuery scoringQuery, int docid) {


the generation of the cache key has been moved from this class, motivation?

this method is no longer invoked inside the Logger (see row 81 below), therefore I moved it where it is called (in LTRScoringQuery)

also this is now called when creating the LTRScoringQuery (query part of the key) and added as a private variable in the LTRScoringQuery object

solr/modules/ltr/src/java/org/apache/solr/ltr/LTRScoringQuery.java

solr/solr-ref-guide/modules/query-guide/pages/learning-to-rank.adoc

aruggero and others added 29 commits February 17, 2023 11:38

First changes for cache integration in ltr ranking

9e9ae26

Added cache in sparse model scorer

dbb2931

Removed logSingleHit

5e3895f

Removed throws error because not arisen

5e81f2c

Added the new featureVectorCache in the SolrIndexSearcher

84a4646

Merge remote-tracking branch 'UpStream/main' into ltr_feature_vector_…

d137fd1

…cache_ranking

Removed unuseful piece of code in scoreSingleHit

b6bdba8

added comments for topN parameter

27ed1cc

Removed unthrown exceptions

f78fead

Fixed cache usage in LTRScoringQuery score of both sparse and dense s…

f6415a2

…corers

Alternative approach with only floats in the feature vector to cache …

ca30f38

…(not in the right order with respect to FeaturesInfo)

Merge remote-tracking branch 'upstream/main' into alternative_solutio…

0ec2533

…n_for_feature_vector_cache

first draft

09ec0b7

Reversed changes not related to feature vector cache

01a09d3

Refactoring

f5ea825

Adjusted tests with dense returned feature format and fixed document …

bad9950

…Id in feature vector cache

Fixed another test with dense format

61740bc

moved context to featureTraversalRescorer

07c4bbc

Reversed randomic format tests

5a745bf

Fixed tests with random feature format

c125c1b

Added NaN check also in vector extraction for isUsed condition

6b8e01f

Gradlew tidy

25ea7ac

Changes cache key for feature vector: features definition + efi + docId

b198688

Divided query part of the feature vector key from the document part

62af786

Fixed problem with fvKey. Put featureStoreName instead of features (w…

8c07a05

…hich are empty if only logging is required)

Added features definition in the feature vector key

0433606

added space

a80baaa

Added test for feature vector cache and checked if enabled in LTRScor…

c579b02

…ingQuery before using it

Removed last featureVectorCache configuration from config test files

fb934d7

aruggero requested a review from alessandrobenedetti March 29, 2023 11:56

aruggero and others added 5 commits March 29, 2023 14:58

Added documentation for new cache and sparse format

d5161f6

minor changes to start pipeline

b020225

minor changes to start pipeline

e67af0c

minor changes to start pipeline

00db16b

refactor

bb5cdcf

alessandrobenedetti approved these changes Apr 13, 2023

View reviewed changes

Sease and others added 4 commits April 13, 2023 22:11

fixed test

c9900b1

gradlew tidy

1d3f326

Created a separate method for efi hash

f8f068e

Changed isUsed in isDefaultValue

a6d459f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-16667 #4

SOLR-16667 #4

aruggero commented Mar 29, 2023 •

edited

Loading

alessandrobenedetti left a comment

alessandrobenedetti Apr 13, 2023

aruggero Apr 14, 2023

aruggero Apr 14, 2023 •

edited

Loading


		public abstract String makeFeatureVector(LTRScoringQuery.FeatureInfo[] featuresInfo);

		private static int fvCacheKey(LTRScoringQuery scoringQuery, int docid) {

SOLR-16667 #4

Are you sure you want to change the base?

SOLR-16667 #4

Conversation

aruggero commented Mar 29, 2023 • edited Loading

Description

Solution

Tests

Checklist

alessandrobenedetti left a comment

Choose a reason for hiding this comment

alessandrobenedetti Apr 13, 2023

Choose a reason for hiding this comment

aruggero Apr 14, 2023

Choose a reason for hiding this comment

aruggero Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

aruggero commented Mar 29, 2023 •

edited

Loading

aruggero Apr 14, 2023 •

edited

Loading