space-diff

Description

space-diff is a tool that highlights inconsistencies in word segmentation within spaced texts (such as training corpora) for any spaceless orthography.

This tool is Pure Python and requires Python 3.7+

Installation

pip install space-diff

Usage/Tutorial

Included with this project's homepage are two sample corpora of segmented traditional Chinese which will be used in this tutorial for ease in following along. (Adapted from Universal Dependencies' Chinese corpora.) The following instructions assume that you have space-diff installed already as well as downloaded the sample corpora.

Command line usage

You can simply call the tool at the command line as follows:

$ space-diff [-h] [-d] corp [corp ...]

with the optional -h/--help argument, the optional -d/--digits argument, and one or more corpus file of segmented text.

Using the sample data

By running:

$ space-diff sample_corp_a.txt sample_corp_b.txt

you will see the that the program updates you as it processes, and then ultimately prints a human-readable summary of its findings. Here's a sample:

This output allows manual review each instance of segmentation inconsistency, where you can note which ones are errors and which are inherent variation. The idea is to then fix those that are actual errors in your corpora before training (a segmenter or some other stochastic tool) on that data.

Using your own data

For your own data, just pass the files and their paths if necessary, separated by spaces to space-diff and optionally save the output to wherever you'd like.

$ space-diff ~/path/to/thisfile.txt ~/path/to/another.txt ~/path/to/third.txt > ~/Desktop/seg_inconsistency.txt

Excluding digits

By default, the tool considers strings like 12, 712, 1 20, and 1220 as inconsistent segmentations of a 'multi-character' token 12. If you wish to declutter the output with numerical cases like this, pass space-diff the flag -d to ignore digits in its searching.

$ space-diff -d sample_corp_a.txt sample_corp_b.txt

or

$ space-diff sample_corp_a.txt sample_corp_b.txt --digits

License

GNU GPLv3 - see LICENSE file for details.

Contact

Blake Perry Smith middlename DOT lastname+'b' AT gmail

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
bin		bin
build		build
dist		dist
space-diff		space-diff
space_diff.egg-info		space_diff.egg-info
LICENSE		LICENSE
README.md		README.md
sample_corp_a.txt		sample_corp_a.txt
sample_corp_b.txt		sample_corp_b.txt
sample_output.png		sample_output.png
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

space-diff

Description

Installation

Usage/Tutorial

Command line usage

Using the sample data

Using your own data

Excluding digits

License

Contact

About

Releases

Packages

Languages

License

smithnlp/space-diff

Folders and files

Latest commit

History

Repository files navigation

space-diff

Description

Installation

Usage/Tutorial

Command line usage

Using the sample data

Using your own data

Excluding digits

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages