-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reversion for unicode confusables #20
Conversation
This should help take care of, if not close out #2 |
Hi, With your pull request, I have the following error with python 3: Possible work around: use this instead |
Hmm. I haven't checked this in a while since it was never merged, so it's possible I was expecting bytes, and am getting a str instead. I'll jump into this when I have time, and see if I can update it for the latest phishing_catcher. |
Let me know how it goes. Should I close this PR? |
@x0rz I finally had a chance to update it for the latest version, and fixed some things that would've caused issues otherwise (no idea why I used I also moved |
Pretty cool! I'm merging this :) |
The change seems incomplete
|
Hey @ant1 There's a lot to be done in terms of confusables. Unfortunately, the confusable list from the Unicode consortium is a bit tough to work with, and as is, I had to manually spelunk around for possible candidates outside of the explicitly designated ones. Even with the list they provide, I'm not positive if everything could be covered due to the organization of it. To explicitly cover your example, |
The issue was fixed in 1d52ad2 |
This uses the list mentioned in Unicode TR39 to take care of many lookalike characters. It may be possible to automatically parse the provided confusables list in the future, but some characters are similar looking, but not marked as a confusable for that character. E.g. In the list,
𝞳
is marked as a confusable ofĸ
, butĸ
isn't marked as a confusable ofk
orK
. Might not be an issue, but something to consider.