Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reversion for unicode confusables #20

Merged
merged 3 commits into from
Sep 6, 2018

Conversation

caffeinewriter
Copy link

This uses the list mentioned in Unicode TR39 to take care of many lookalike characters. It may be possible to automatically parse the provided confusables list in the future, but some characters are similar looking, but not marked as a confusable for that character. E.g. In the list, 𝞳 is marked as a confusable of ĸ, but ĸ isn't marked as a confusable of k or K. Might not be an issue, but something to consider.

@caffeinewriter
Copy link
Author

caffeinewriter commented Nov 14, 2017

This should help take care of, if not close out #2

@ant1
Copy link

ant1 commented Aug 28, 2018

Hi,

With your pull request, I have the following error with python 3:
'str' object has no attribute 'decode'

Possible work around: use this instead
domain = bytearray(domain,'ascii').decode('idna')

@caffeinewriter
Copy link
Author

Hmm. I haven't checked this in a while since it was never merged, so it's possible I was expecting bytes, and am getting a str instead. I'll jump into this when I have time, and see if I can update it for the latest phishing_catcher.

@x0rz
Copy link
Owner

x0rz commented Aug 30, 2018

Let me know how it goes. Should I close this PR?
(sorry I wasn't very reactive on this one)

@caffeinewriter
Copy link
Author

caffeinewriter commented Sep 5, 2018

@x0rz I finally had a chance to update it for the latest version, and fixed some things that would've caused issues otherwise (no idea why I used str as a variable name in the first place). So the PR should still be good. If there's any issues/changes you need for it to be mergeable, let me know.

I also moved unconfuse earlier in the chain so it'll catch disguised fake TLDs. (E.g. .ċom-account-management)

@x0rz
Copy link
Owner

x0rz commented Sep 6, 2018

Pretty cool! I'm merging this :)

@x0rz x0rz merged commit 9bc0035 into x0rz:master Sep 6, 2018
@ant1
Copy link

ant1 commented Sep 7, 2018

The change seems incomplete

>>> from confusables import unconfuse
>>> unconfuse('xn--lockchin-66gn.info')
'xn--lockchin-66gn.info'
>>> print(b'xn--lockchin-66gn.info'.decode('idna'))
вlockchаin.info
>>> print(unconfuse(b'xn--lockchin-66gn.info'.decode('idna')))
blockchain.info

@caffeinewriter
Copy link
Author

caffeinewriter commented Sep 7, 2018

Hey @ant1 There's a lot to be done in terms of confusables. Unfortunately, the confusable list from the Unicode consortium is a bit tough to work with, and as is, I had to manually spelunk around for possible candidates outside of the explicitly designated ones. Even with the list they provide, I'm not positive if everything could be covered due to the organization of it.

To explicitly cover your example, в \u0432 CYRILLIC SMALL LETTER VE is only marked as a confusable with ʙ \u0299 LATIN LETTER SMALL CAPITAL B, but neither of those are marked as confusables with B \u0042 LATIN CAPITAL LETTER B.

@ant1
Copy link

ant1 commented Sep 8, 2018

The issue was fixed in 1d52ad2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants