-
Notifications
You must be signed in to change notification settings - Fork 49
Handling Unicode
This page answers all your questions regarding “How to make Picky work with my Unicode character sets”.
The basic generator will generate a project that does not work with e.g. Japanese or Cyrillic character sets.
This is because indexing will remove all characters not defined in the example’s “negative” (“remove not these”) regexp:
indexing removes_characters: /[^a-z0-9\s]/i
(the newlines are not removed so the text can be split on them later on). What this means is that your non-alphanumeric characters are simply removed.
If you wish Picky to index your Cyrillic characters, you need to tell it to do so:
indexing removes_characters: /[^\p{Cyrillic}0-9\s]/i
This means: “Indexing removes characters, but not cyrillic or numeric ones, or newlines”.
The Ruby documentation has more on Unicode character classes: http://www.ruby-doc.org/core-1.9.3/Regexp.html (look for \p{<character_class_name>}
)
Picky uses downcase!
to make searches case insensitive. There are two reasons why this might cause problems with unicode strings.
Ruby only knows how to make ASCII characters lower case. As an example, see issue 76. Ruby does not downcase cyrillic characters.
The equivalence between characters is a little bit more complicated than with ASCII strings. There is a technical and a cultural reason, for the full discussion on the issues see: Unicode equivalence.
Mitya solved this specific case the following way: http://github.com/floere/picky/issues/76#issuecomment-5280965, reprinted in full:
require 'unicode' # gem
class String
def downcase
Unicode::downcase(self)
end
def downcase!
self.replace downcase
end
end
Manfred Stienstra helpfully offers a few other solutions: http://github.com/floere/picky/issues/76#issuecomment-5280616. For example, if you used his Unichars lib:
require 'unichars' # gem
class String
def downcase
Unichars.new(self).normalize.downcase
end
def downcase!
self.replace downcase
end
end
The mentioned libs differ in scope and performance. We suggest you use one fitting your needs.
Remember: If all else fails, you can always override String#downcase!
to have Ruby and Picky work as you need it to.
class String
def downcase
# Your correct downcase implementation that also works for Klingon
end
def downcase!
self.replace downcase
end
end