Skip to content

Contributing CharacterSubstituters

floere edited this page Nov 21, 2010 · 5 revisions

Glad you’d like to add a character substituter to Picky!

What is it?

A character substituter is what you use to normalize single characters. For example, you’d like your indexer or query to convert umlauts into the non-umlaut version:
ü -> ue (Yes, a single character can be normalized into multiple ones)
This is already built in, and you can use it as follows:

  # For indexing:
  default_indexing substitutes_characters_with: CharacterSubstituters::WestEuropean.new
  
  # For querying:
  default_querying substitutes_characters_with: CharacterSubstituters::WestEuropean.new

However, the west european character substitution just changes ö into oe, and ç into c, and similar. For example, there is no conversion defined for polish or russian characters.

So you might want to do your own. How to do it? It’s easy.

How to do it

  1. Check the available Sources if the substituter already exists.
  2. If yes, use (and improve) it :)
  3. If not, fork the repository and follow the instructions below.

Every character substituter should implement the substitute(text) method. This method is called by the indexer and/or query.

  • substitute(text) # Substitute characters in the text and return a new text.

Example

This is how the west european character substituter implements the substitute method (at the time of this writing). See also the spec to see what it does

  def substitute text
    trans = @chars.new(text).normalize(:kd)
    
    # substitute special cases
    #
    trans.gsub!('ß', 'ss')
    
    # substitute umlauts (of A,O,U,a,o,u)
    #
    trans.gsub!(/([AOUaou])\314\210/u, '\1e')
    
    # get rid of ecutes, graves and …
    #
    trans.unpack('U*').select { |cp|
      cp < 0x0300 || cp > 0x035F
    }.pack('U*')
  end