-
Notifications
You must be signed in to change notification settings - Fork 49
Indexing configuration
Indexing options define what Picky does with your data. E.g. where does it split the data text into words to index?
Indexing is defined in app/application.rb
like all other index specific options.
Define the default indexing behaviour or an indexing behaviour for a specific index by calling indexing(options)
with various options (described below, in the new Ruby 1.9 hash style):
class PickySearch < Application
# ...
indexing removes_characters: /[^a-zA-Z0-9\.]/,
stopwords: /\b(and|or|in|on|is|has)\b/,
splits_text_on: /\s/,
removes_characters_after_splitting: /\./,
substitutes_characters_with: CharacterSubstituters::WestEuropean.new,
normalizes_words: [
[/(.*)hausen/, '\1hn'],
[/\b(\w*)str(eet)?/, '\1st']
]
# ...
index = Index::Memory.new(:some_index) do
# ...
indexing removes_characters: /[^a-zA-Z0-9\.]/,
stopwords: /\b(and|or|in|on|is|has)\b/,
splits_text_on: /\s/,
removes_characters_after_splitting: /\./,
substitutes_characters_with: CharacterSubstituters::WestEuropean.new,
normalizes_words: [
[/(.*)hausen/, '\1hn'],
[/\b(\w*)str(eet)?/, '\1st']
]
# ...
end
end
This example does:
- not remove the abc or numbers.
- remove in the query text the words and, or, in, on, is, and has. (If they are not the only word to occur in the data)
- split query text on whitespaces. “fish market” would be indexed as “fish”, “market”.
- remove characters after splitting in the same mode as removes_characters, but for each split word.
- substitute certain west european special characters, e.g. “ü” to “ue”, or “ø” to “o”. (if you have them indexed that way)
- normalize “Petershausen” (and similar) to “Petershn”, and “…street”, “…str” to “…st”.
Note: The options are almost the same as in the Searching Configuration.
The options are:
- removes_characters(regexp)
- stopwords(regexp)
- splits_text_on(regexp)
- removes_characters_after_splitting(regexp)
- substitutes_characters_with(substituter)
- normalizes_words(array of [regexp, replacement])
- rejects_token_if(a_lambda)
By default, there is only one of these defined:
splits_text_on(/\s/)
So, if none of the above options is defined, Picky splits on whitespaces (\s).
First, text is processed, then split into words, finally made into tokens.
The order below represents the order the filters are applied.
This is the very first step. Here, characters can be replaced in the text using a character substituter.
A character substituter is an object that has a #substitute(text)
method that returns a text.
Currently, there is only CharacterSubstituters::WestEuropean.new
(see [CharacterSubstituters|Charactersubstituters-configuration]).
Example:
substitutes_characters_with: CharacterSubstituters::WestEuropean.new
Defines what characters are removed from the indexed text.
Example:
removes_characters: /[0-9]/
if you don’t want any number to make it to the search engine.
Note that it is case sensitive, so /[a-z]/
will only remove lowercase letters.
Also note that Picky needs :
, "
, ~
, and *
to function properly, so please don’t remove these.
If you wish to define a whitelist, use [^...]
, e.g. /[^ïôåñëäöüa-zA-Z0-9\s\/\-\,\&\.\"\~\*\:]/i
.
Defines what words are removed from the text, after removing specific characters.
Example:
stopwords: /\b(and|the|of|it|in|for)\b/i
would remove a number of stopwords, case-insensitively.
Note that if the word occurs along, i.e. "and"
, it is not removed from the query.
Define how the text is split into tokens. Tokens are what Picky works with and tries to find in indexes.
So, if you define splits_text_on(/\s/)
, then Picky will split input text "my beautiful query"
into tokens [:my, :beautiful, :query]
.
Defines rules for words to replace after splitting.
Example:
normalizes_words: [[/\$(\w+)/i, '\1 dollars']]
This is the same as removes_characters
(see above), but after splitting.
Define a lambda that can reject tokens. This is the last step.
Example:
rejects_token_if: lambda { |token| token.blank? || token == :i_dont_like_this_token }