-
Notifications
You must be signed in to change notification settings - Fork 49
Contributing CharacterSubstituters
Glad you’d like to add a character substituter to Picky!
A character substituter is what you use to normalize single characters. For example, you’d like your indexer or query to convert umlauts into the non-umlaut version:
ü -> ue
(Yes, a single character can be normalized into multiple ones)
This is already built in, and you can use it as follows:
# For indexing:
default_indexing substitutes_characters_with: CharacterSubstituters::WestEuropean.new
# For querying:
default_querying substitutes_characters_with: CharacterSubstituters::WestEuropean.new
However, the west european character substitution just changes ö into oe, and ç into c, and similar. For example, there is no conversion defined for polish or russian characters.
So you might want to do your own. How to do it? It’s easy.
- Check the available character substituters if the substituter already exists.
- If yes, use (and improve) it :)
- If not, fork the repository and follow the instructions below.
Every character substituter should implement the substitute(text)
method. This method is called by the indexer and/or query.
-
substitute(text)
# Substitute characters in the text and return a new text.
This is how the west european character substituter implements the substitute method (at the time of this writing). See also the spec to see what it does.
def substitute text
trans = @chars.new(text).normalize(:kd)
# substitute special cases
#
trans.gsub!('ß', 'ss')
# substitute umlauts (of A,O,U,a,o,u)
#
trans.gsub!(/([AOUaou])\314\210/u, '\1e')
# get rid of ecutes, graves and …
#
trans.unpack('U*').select { |cp|
cp < 0x0300 || cp > 0x035F
}.pack('U*')
end