flawunicode

Detect unreadable unicode text

Ever encounter any text when crawl text from the internet or inside your raw corpus?

srtytyrtyrty
Á¶ÀÌ½ÃÆ¼, ¡®3on3 ÇÁ¸®½ºÅ¸ÀÏ¡¯ 2Á¾ÀÇ ¿¡µð¼Ç ¹øµé Ãâ½Ã
��>+ٽT}$@�������Э�����ٗ_���=���e��

This is what flawunicode aims to pick these out for you. flawunicode ranks each unicode text and output a score of -1 to 1 which indicates the "completeness" of the unicode text. If the text has a score of lower than 0.4, it is likely this text is not readable by human.

Usage

import flawunicode
text = "fdsfdxvdhjkf"
flawunicode.detect(text)
>> 0.2727272727272727
flawunicode.detect("Hello World!")
>> 0.6439393939393939

Note

The underlying statistic came from news corpus in currents api database. So social network style text maybe rank with low score. You just need to calculate your own frequently used bi-gram characters and it should be fine.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
flawunicode		flawunicode
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements-dev.txt		requirements-dev.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flawunicode

Usage

Note

About

Releases 1

Packages

Languages

License

currentslab/flawunicode

Folders and files

Latest commit

History

Repository files navigation

flawunicode

Usage

Note

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages