-
Notifications
You must be signed in to change notification settings - Fork 251
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
2.2.0: Add support for tokenizers (#566)
* Add support for tokenizers
- Loading branch information
1 parent
5304068
commit e88a65e
Showing
84 changed files
with
2,759 additions
and
1,495 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
Helper functionality | ||
==================== | ||
|
||
Helper | ||
------ | ||
|
||
.. currentmodule:: montreal_forced_aligner.tokenization.tokenizer | ||
|
||
.. autosummary:: | ||
:toctree: generated/ | ||
|
||
TokenizerRewriter | ||
TokenizerArguments | ||
TokenizerFunction |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
|
||
.. _tokenization_api: | ||
|
||
Tokenizers | ||
========== | ||
|
||
Tokenizers allow for adding spaces as word boundaries for orthographic systems that don't normally use them (i.e., Japanese, Chinese, Thai). | ||
|
||
.. currentmodule:: montreal_forced_aligner.models | ||
|
||
.. autosummary:: | ||
:toctree: generated/ | ||
|
||
TokenizerModel | ||
|
||
.. toctree:: | ||
|
||
training | ||
tokenizer | ||
helper |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
|
||
.. _tokenizer_api: | ||
|
||
Corpus tokenizer | ||
================= | ||
|
||
.. currentmodule:: montreal_forced_aligner.tokenization.tokenizer | ||
|
||
.. autosummary:: | ||
:toctree: generated/ | ||
|
||
CorpusTokenizer | ||
TokenizerValidator |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
|
||
.. _tokenizer_model_training_api: | ||
|
||
Training tokenizer models | ||
========================= | ||
|
||
.. currentmodule:: montreal_forced_aligner.tokenization.trainer | ||
|
||
.. autosummary:: | ||
:toctree: generated/ | ||
|
||
TokenizerTrainer -- Trainer for language model on text corpora |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,3 +9,4 @@ Workflows | |
transcription/index | ||
segmentation/index | ||
diarization/index | ||
tokenization/index |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
|
||
.. _tokenize_cli: | ||
|
||
Tokenize utterances ``(mfa tokenize)`` | ||
========================================= | ||
|
||
Use a model trained from :ref:`train_tokenizer_cli` to tokenize a corpus (i.e. insert spaces as word boundaries for orthographic systems that do not require them). | ||
|
||
Command reference | ||
----------------- | ||
|
||
.. click:: montreal_forced_aligner.command_line.tokenize:tokenize_cli | ||
:prog: mfa tokenize | ||
:nested: full | ||
|
||
|
||
API reference | ||
------------- | ||
|
||
- :ref:`tokenization_api` |
24 changes: 24 additions & 0 deletions
24
docs/source/user_guide/corpus_creation/train_tokenizer.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
|
||
.. _train_tokenizer_cli: | ||
|
||
Train a word tokenizer ``(mfa train_tokenizer)`` | ||
================================================ | ||
|
||
Training a tokenizer uses a simplified sequence-to-sequence model like G2P, but with the following differences: | ||
|
||
* Both the input and output symbols are graphemes | ||
* Symbols can only output themselves | ||
* Only allow for inserting space characters | ||
|
||
Command reference | ||
----------------- | ||
|
||
.. click:: montreal_forced_aligner.command_line.train_tokenizer:train_tokenizer_cli | ||
:prog: mfa train_tokenizer | ||
:nested: full | ||
|
||
|
||
API reference | ||
------------- | ||
|
||
- :ref:`tokenization_api` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.