Issue 481: Use case insensitive filter and add case insensitive string type. #496

kaladay · 2023-01-11T16:17:19Z

Description

Cannot a filter to the solr.StrField.
According to the SOLR documentation, a filter can only be added to something tokenized and a solr.StrField does not allow tokenization. This uses a solr.TextField instead.
Several fields need to have case insensitive searches. A new type is added that uses the KeywordTokenizer, called string_ci and strings_ci. The KeywordTokenizer essentialy is a pretend token. It tokenizes the whole string, which is effectively the same as not having a tokenizer. The documentation even references the KeywordTokenizer as the method of disabling the tokenizer.

Fields that should be case insensitive are moved from string to string_ci and strings to strings_ci respectively.

There are potential performance concerns with using solr.TextField rather than solr.StrField due to the loss of the docvalues optimization feature.

This change requires a change to the solr cor data structure.
I consider this a breaking change.

see: https://solr.apache.org/guide/7_7/field-types-included-with-solr.html#field-types-included-with-solr
see: https://solr.apache.org/guide/7_7/field-type-definitions-and-properties.html#field-type-definitions-and-properties
see: https://solr.apache.org/guide/7_7/field-properties-by-use-case.html#field-properties-by-use-case
see: https://solr.apache.org/guide/7_7/tokenizers.html#keyword-tokenizer
see: https://solr.apache.org/guide/7_7/docvalues.html

Fixes #481

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

How Has This Been Tested?

Manually

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
My changes generate no new warnings
New and existing unit tests pass locally with my changes

…g type. Cannot a filter to the `solr.StrField`. According to the SOLR documentation, a filter can only be added to something tokenized and a `solr.StrField` does not allow tokenization. This uses a `solr.TextField` instead. Several fields need to have case insensitive searches. A new type is added that uses the `KeywordTokenizer`, called `string_ci` and `strings_ci`. The `KeywordTokenizer` essentialy is a pretend token. It tokenizes the whole string, which is effectively the same as not having a tokenizer. The documentation even references the `KeywordTokenizer` as the method of disabling the tokenizer. Fields that should be case insensitive are moved from `string` to `string_ci` and `strings` to `strings_ci` respectively. There are potential performance concerns with using `solr.TextField` rather than `solr.StrField` due to the loss of the docvalues optimization feature. see: https://solr.apache.org/guide/7_7/field-types-included-with-solr.html#field-types-included-with-solr see: https://solr.apache.org/guide/7_7/field-type-definitions-and-properties.html#field-type-definitions-and-properties see: https://solr.apache.org/guide/7_7/field-properties-by-use-case.html#field-properties-by-use-case see: https://solr.apache.org/guide/7_7/tokenizers.html#keyword-tokenizer see: https://solr.apache.org/guide/7_7/docvalues.html

coveralls · 2023-01-11T16:21:30Z

Coverage: 45.24% (+0.03%) from 45.215% when pulling 753b83a on 481-case_sensitive into 850bc32 on staging.

ghost · 2023-01-11T17:22:44Z

#481

Suggested approach was a TextField using KeywordTokenizerFactory.

Additionally, was suggested to seperate between index and query time with two analyzers.

Such as

    <fieldType name="whole_strings" class="solr.TextField" omitNorms="true" sortMissingLast="true" multiValued="true">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

This would simply change all fields of type whole_strings to afford matching case insensitive by during query by lowercasing. For exact match case sensitive searches, it is common to expect search term to be wrapped in double quotes enforcing exact match on the search field.

The question is, what search behavior changes are not desired by affording all fields to search with case insensitivity? One with minimal change approach may consider minimal change be that of changes to the versioned schema and not to the minimal changes to search behavior. Basically, adding additional field types is not minimal changes to versioned schema (obviously) and the search behavior changes may still be a minimum in term of anticipated or expected search terms.

ghost · 2023-01-11T17:30:14Z

Not sure we need the additional field types. What behavior changes are there without the additional field types?

solr/config/managed-schema

…x date_created. The `strings_ci` is close enough to `whole_strings`, just use `whole_strings`. There is no `whole_string`. Rename `string_ci` to `whole_string`. To better prevent future problems, document these custom field types. The date_created is not multi-valued so use `whole_string`.

kaladay requested review from jeremythuff, a user and rmathew1011 January 11, 2023 16:17

kaladay linked an issue Jan 11, 2023 that may be closed by this pull request

Searches are case-sensitive for special searches, like Subject search. #481

Open

ghost reviewed Jan 11, 2023

View reviewed changes

solr/config/managed-schema Outdated Show resolved Hide resolved

kaladay requested a review from a user January 11, 2023 21:09

ghost approved these changes Jan 12, 2023

View reviewed changes

kaladay merged commit 2fae79e into staging Jan 12, 2023

kaladay deleted the 481-case_sensitive branch January 27, 2023 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 481: Use case insensitive filter and add case insensitive string type. #496

Issue 481: Use case insensitive filter and add case insensitive string type. #496

kaladay commented Jan 11, 2023

coveralls commented Jan 11, 2023 •

edited

Loading

ghost commented Jan 11, 2023 •

edited by ghost

Loading

ghost commented Jan 11, 2023 •

edited by ghost

Loading

Issue 481: Use case insensitive filter and add case insensitive string type. #496

Issue 481: Use case insensitive filter and add case insensitive string type. #496

Conversation

kaladay commented Jan 11, 2023

Description

Type of change

How Has This Been Tested?

Checklist:

coveralls commented Jan 11, 2023 • edited Loading

ghost commented Jan 11, 2023 • edited by ghost Loading

ghost commented Jan 11, 2023 • edited by ghost Loading

coveralls commented Jan 11, 2023 •

edited

Loading

ghost commented Jan 11, 2023 •

edited by ghost

Loading

ghost commented Jan 11, 2023 •

edited by ghost

Loading