-
Notifications
You must be signed in to change notification settings - Fork 49
Searching Configuration
Searching options define what Picky does with your query before actually searching. E.g. where does it split the search terms?
Searching is defined in app/application.rb
like all other search specific options.
Define the default querying behaviour or a querying behaviour for a specific index by by calling searching(options)
with various options (described below, in the new Ruby 1.9 hash style):
class PickySearch < Application
# ...
# This sets the default behaviour, which can be overridden in a Search
# for that specific Search.
#
searching removes_characters: /[^a-zA-Z0-9\s\/\-\,\\&\\"\~\*\:]/,
stopwords: /\b(and|the|of|it|in|for)\b/,
splits_text_on: /[\s\/\-\,\\&]+/,
removes_characters_after_splitting: /\./,
substitutes_characters_with: CharacterSubstituters::WestEuropean.new,
normalizes_words: [
[/deoxyribonucleic/i, 'DN'],
[/\b(\w*)str(eet)?/, '\1st']
],
maximum_tokens: 4
# ...
search = Search.new some_index do
searching removes_characters: /[^a-zA-Z0-9\s\/\-\,\\&\\"\~\*\:]/,
stopwords: /\b(and|the|of|it|in|for)\b/,
splits_text_on: /[\s\/\-\,\\&]+/,
removes_characters_after_splitting: /\./,
substitutes_characters_with: CharacterSubstituters::WestEuropean.new,
normalizes_words: [
[/deoxyribonucleic/i, 'DN'],
[/\b(\w*)str(eet)?/, '\1st']
],
maximum_tokens: 4
end
end
This example does:
- not remove the abc, numbers, /, -, &, ", ~, *, and :. (Note that Picky needs you to pass through ", ~, *, and :)
- remove in the query text the words and, the, of, it, in, and for. (If they are not the only word to occur in the query)
- split query text on whitespaces, /, -, and &. “peter & fish-bone” would become “peter”, “fish”, “bone”.
- remove characters after splitting in the same mode as removes_characters, but for each split word.
- substitute certain west european special characters, e.g. “ü” to “ue”, or “ø” to “o”. (if you have them indexed that way)
- normalize “Deoxyribonucleic” (and similar) to “DN”, and “…street”, “…str” to “…st”.
- finally only allow maximally 4 tokens through. (usually, 3 words are more than enough for Picky to find something)
Note: The options are almost the same as in the Indexing Configuration.
The options are:
- substitutes_characters_with(substituter)
- removes_characters(regexp)
- stopwords(regexp)
- splits_text_on(regexp)
- removes_characters_after_splitting(regexp)
- normalizes_words(array of [regexp, replacement])
- maximum_tokens(amount)
By default, one of these is defined:
splits_text_on(/\s/)
So, if none of the above options is defined, Picky splits on whitespaces (\s).
First, text is processed, then split into words, finally made into tokens.
The order below represents the order the filters are applied.
This is the very first step. Here, characters can be replaced in the text using a character substituter.
A character substituter is an object that has a #substitute(text)
method that returns a text.
Currently, there is only CharacterSubstituters::WestEuropean.new
.
Example:
substitutes_characters_with: CharacterSubstituters::WestEuropean.new
Defines what characters are removed from the search text.
Example:
removes_characters: /[0-9]/
if you don’t want any number to make it to the search engine.
Note that it is case sensitive, so /[a-z]/
will only remove lowercase letters.
Also note that Picky needs :
, "
, ~
, and *
to function properly, so please don’t remove these.
If you wish to define a whitelist, use [^...]
, e.g. /[^ïôåñëäöüa-zA-Z0-9\s\/\-\,\&\.\"\~\*\:]/i
.
Defines what words are removed from the text, after removing specific characters.
Example:
stopwords: /\b(and|the|of|it|in|for)\b/i
would remove a number of stopwords, case-insensitively.
Note that if the word occurs along, i.e. "and"
, it is not removed from the query.
Define how the text is split into tokens. Tokens are what Picky works with and tries to find in indexes.
So, if you define splits_text_on(/\s/)
, then Picky will split input text "my beautiful query"
into tokens [:my, :beautiful, :query]
.
Defines rules for words to replace after splitting.
Example:
normalizes_words: [[/\$(\w+)/i, '\1 dollars']]
This is the same as removes_characters
(see above), but after splitting.
Define the maximum number of tokens that make it through.
If somebody is searching for hi my name is peter and I would like to search for a car
, and maximum_tokens: 4
is set, only hi my name is
makes it through.