irisa-text-normalizer

Text normalisation tools from IRISA lab ( https://github.com/glecorve/irisa-text-normalizer )

Synopsis

The tools provided here are split into 3 steps:

Tokenisation (adding blanks around punctation marks, dealing with special cases like URLs, etc.)
Generic normalisation (leading to homogeneous texts where (almost) information have been lost and where tags have been added for some entities)
Specific normalisation (projection of the generic texts into specific forms)

How to cite

@misc{lecorve2017normalizer,
  title={The IRISA Text Normalizer},
  author={Lecorv{\'e}, Gw{\'e}nol{\'e}},
  howpublished={\url{https://github.com/glecorve/irisa-text-normalizer}},
  year={2017}
}

Supported languages:

English
French

Commands

LANGUAGE="en" # (or "fr")

Tokenisation

perl bin/$LANGUAGE/basic-tokenizer.pl examples/$LANGUAGE/text.raw > examples/$LANGUAGE/text.tokenized.txt

Generic normalisation

perl bin/$LANGUAGE/start-generic-normalisation.pl examples/$LANGUAGE/text.tokenized > examples/$LANGUAGE/text.norm.step1
# <-- Here you may wish to run some extra tool -->
perl bin/$LANGUAGE/end-generic-normalisation.pl examples/$LANGUAGE/text.norm.step1.txt > examples/$LANGUAGE/text.norm.step2.txt

or simply:

bash bin/$LANGUAGE/generic-normalisation.sh text-normalisation/examples/$LANGUAGE/text.tokenized.txt

2 examples of specific normalisations

perl bin/$LANGUAGE/specific-normalisation.pl cfg/asr.cfg examples/$LANGUAGE/text.norm.step2 > examples/$LANGUAGE/text.asr.txt
perl bin/$LANGUAGE/specific-normalisation.pl cfg/tts.cfg examples/$LANGUAGE/text.norm.step2 > examples/$LANGUAGE/text.tts.txt

Create your own configuration for specific normalisation

perl bin/$LANGUAGE/specific-normalisation.pl -h

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
bin		bin
cfg		cfg
doc		doc
examples		examples
lib		lib
prepare-text		prepare-text
rsrc		rsrc
LICENSE		LICENSE
README.md		README.md
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

irisa-text-normalizer

Synopsis

How to cite

Supported languages:

Commands

Tokenisation

Generic normalisation

2 examples of specific normalisations

Create your own configuration for specific normalisation

About

Releases

Packages

Languages

License

glecorve/irisa-text-normalizer

Folders and files

Latest commit

History

Repository files navigation

irisa-text-normalizer

Synopsis

How to cite

Supported languages:

Commands

Tokenisation

Generic normalisation

2 examples of specific normalisations

Create your own configuration for specific normalisation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages