Skip to content

Rank Profile

rmerizalde edited this page Apr 8, 2013 · 10 revisions

OpenCommerceSearch uses the ExtendedDismax (disjunction max) query parser by default. This means the search is done across multiple fields which could eventually have different weights.

Schema

The index schema will vary from one application to another. The out of the box has 6 searchable fields:

  • text: this field contains any terms that will be boost the same regardless of the original field. For example: product title, brand name
  • highest: this field contains terms that will be used to apply the highest boost. For example: product id
  • high: this field contains terms that will be used to apply a higher boost. For example: product year
  • medium: this field contains terms that will be used to apply a medium boost. For example: the category leaf's name a product is assigned to
  • low: this field contains terms that will be used to apply a medium boost. For example: the ancestor category names of the category leaf a product is assigned to, product size
  • lowest: this field contains terms that will be used to apply a low boost. For example: product color, other features or attributes

Default profile

By default, OCS uses the following weights:

text^2.5 highest^3 high^2 medium^1.6 low^1.3 lowest^1.1

Lets analyze the following debug log output for the query 'fleece jacket' for an index using this profile:

<str name="TNF7394-FUSPK-S03M">
1.4134583 = (MATCH) sum of:
  0.90818536 = (MATCH) max of:
    0.32088697 = (MATCH) weight(text:fleec^2.5 in 1751) [DefaultSimilarity], result of:
      0.32088697 = score(doc=1751,freq=1.0 = termFreq=1.0
), product of:
        0.2954988 = queryWeight, product of:
          2.5 = boost
          4.3436656 = idf(docFreq=5703, maxDocs=161553)
          0.027211927 = queryNorm
        1.0859164 = fieldWeight in 1751, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          4.3436656 = idf(docFreq=5703, maxDocs=161553)
          0.25 = fieldNorm(doc=1751)
    0.90818536 = (MATCH) weight(highest:fleec^3.0 in 1751) [DefaultSimilarity], result of:
      0.90818536 = score(doc=1751,freq=1.0 = termFreq=1.0
), product of:
        0.38507253 = queryWeight, product of:
          3.0 = boost
          4.7169576 = idf(docFreq=3926, maxDocs=161553)
          0.027211927 = queryNorm
        2.3584788 = fieldWeight in 1751, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          4.7169576 = idf(docFreq=3926, maxDocs=161553)
          0.5 = fieldNorm(doc=1751)
  0.50527304 = (MATCH) max of:
    0.50527304 = (MATCH) weight(low:jacket^1.1 in 1751) [DefaultSimilarity], result of:
      0.50527304 = score(doc=1751,freq=2.0 = termFreq=2.0
), product of:
        0.22110957 = queryWeight, product of:
          1.1 = boost
          7.3867865 = idf(docFreq=271, maxDocs=161553)
          0.027211927 = queryNorm
        2.2851703 = fieldWeight in 1751, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          7.3867865 = idf(docFreq=271, maxDocs=161553)
          0.21875 = fieldNorm(doc=1751)
    0.12614335 = (MATCH) weight(medium:jacket^1.5 in 1751) [DefaultSimilarity], result of:
      0.12614335 = score(doc=1751,freq=1.0 = termFreq=1.0
), product of:
        0.12836081 = queryWeight, product of:
          1.5 = boost
          3.1447194 = idf(docFreq=18917, maxDocs=161553)
          0.027211927 = queryNorm
        0.9827248 = fieldWeight in 1751, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          3.1447194 = idf(docFreq=18917, maxDocs=161553)
          0.3125 = fieldNorm(doc=1751)
    0.45389915 = (MATCH) weight(highest:jacket^3.0 in 1751) [DefaultSimilarity], result of:
      0.45389915 = score(doc=1751,freq=1.0 = termFreq=1.0
), product of:
        0.27222934 = queryWeight, product of:
          3.0 = boost
          3.334682 = idf(docFreq=15644, maxDocs=161553)
          0.027211927 = queryNorm
        1.667341 = fieldWeight in 1751, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          3.334682 = idf(docFreq=15644, maxDocs=161553)
          0.5 = fieldNorm(doc=1751)
</str>

The score of the document is 1.4134583 which is the sum of the weight for the term fleece (i.e. 0.90818536) and the weight for jacket (i.e. 0.50527304)

1.4134583 = (MATCH) sum of:
  0.90818536 = (MATCH) max of:
  0.50527304 = (MATCH) max of:
</str>

Lets take a closer look at how the weight for 'fleece' is calculated. Each term can be found in multiple fields, the search will pick the highest of them. In this example, the term was found in two fields: text and highest. The weight for highest (0.90818536) trumps the weight for text (0.32088697).

    0.32088697 = (MATCH) weight(text:fleec^2.5 in 1751) [DefaultSimilarity], result of:
    0.90818536 = (MATCH) weight(highest:fleec^3.0 in 1751) [DefaultSimilarity], result of:

Lastly, let take a glance at how the weight for highest was computed:

      0.90818536 = score(doc=1751,freq=1.0 = termFreq=1.0), product of:
        0.38507253 = queryWeight, product of:
          3.0 = boost
          4.7169576 = idf(docFreq=3926, maxDocs=161553)
          0.027211927 = queryNorm
        2.3584788 = fieldWeight in 1751, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          4.7169576 = idf(docFreq=3926, maxDocs=161553)
          0.5 = fieldNorm(doc=1751)

The query weight is the product of the field boost in the rank profile, the inverse document frequency (a.k.a idf) and the query norm (3.0 * 4.7169576 * 0.027211927).

The idf is the measure of how often the term appears across the index. The value is calculated as log(numDocs/(docFreq+1)) + 1) where numDocs is the total number documents in the index, docFreq is the number of document where a term t appears and log is the natural logarithm. This means the more a term appears in the index the lower the score is, making common terms less relevant.

The queryNorm is a normalization factor to make documents comparable. The query norm doesn't affect ranking because all ranked documents are multiplied by the same factor.

The field weight is the product of the term frequency (a.k.a tf), the inverse document frequency, and the field norm (1.0 * 4.7169576 * 0.5).

The tf is the number of times the term appears in the document (document's field in this case). Document with higher number of occurrences will produce a higher score.

The field norm encapsulate index boosts (doc & field) and length factors. The norm is the product of the document boost, field boost and the length norm. In this example, no index boosts were added. The length norm is calculated as 1/sqrt(numTerms). In this particular case, the highest field had the terms 'Infant Girls' Fleece Jackets', so the calculation was 1*1* (1/sqrt(4)) = 0.5.

Finally, the weight is the product of 0.38507253 and 2.3584788 which give 0.90818536.

References

Clone this wiki locally