-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Name2CType data wrong for many Indic scripts? #146
Comments
See Why do some Unicode combining markers (like \u0BCD) not match [:alpha:] in Ruby? on Stack Overflow for a discussion, partially reproduced below: The two characters in question are (I have marked some interesting things in bold):
The Ruby documentation for the While not explicitly documented, it makes sense to equate the On the other hand, the documentation for Onigmo does explicitly specify the workings of
So, what seems to be going on, is that the Unicode Consortium does not consider U+0BCD to be alphabetic, and therefore, Onigmo and Ruby do not classify it as |
Thanks, Joerg.
Given |
I found this when trying to use Ruby Regexp on Tamil Unicode codepoint data.
Notice that both
\u0BC0
and\u0BCD
are combining vowel markers in theMark, Nonspacing [Mn]
character category, which should match the[:alpha:]
class. But\u0BCD
does not seem to match the class. Stackoverflow told me Ruby uses Onigmo under the hood, and I found the following except inname2ctype.h
inCR_Alpha
,CR_Alnum
, etc.Notice the missing
0x0bcd
.P.S. I found a number of other missing Indic codepoints as well in that file. If you agree this is a bug I can look in the file some more and do an audit. Thanks!
The text was updated successfully, but these errors were encountered: