-
Notifications
You must be signed in to change notification settings - Fork 36
A Step by step manual of NiuTrans.Syntax
-
The NiuTrans system is a "data-driven" MT system which requires "data" for training and/or tuning the system. It requires users to prepare the following data files before running the system.
a). Training data: bilingual sentence-pairs and word alignments.
b). Tuning data: source sentences with one or more reference translations.
c). Test data: some new sentences.
d). Evaluation data: reference translations of test sentences.
In the NiuTrans package, some sample files are offered for experimenting with the system and studying the format requirement. They are located in "NiuTrans/sample-data/sample-submission-version".
sample-submission-version/
-- TM-training-set/ # word-aligned bilingual corpus (100,000 sentence-pairs)
-- chinese.txt # source sentences
-- english.txt # target sentences (case-removed)
-- Alignment.txt # word alignments of the sentence-pairs
-- chinese.tree.txt # parse trees of source sentences
-- english.tree.txt # parse trees of target sentences
-- LM-training-set/
-- e.lm.txt # monolingual corpus for training language model (100K target sentences)
-- Dev-set/
-- Niu.dev.txt # development dataset for weight tuning (400 sentences)
-- Niu.dev.tree.txt # development dataset with tree annotation (on source sentences)
-- Test-set/
-- Niu.test.txt # test dataset (1K sentences)
-- Niu.test.tree.txt # test dataset with tree annotation
-- Reference-for-evaluation/
-- Niu.test.reference # references of the test sentences (1K sentences)
-- description-of-the-sample-data # a description of the sample data
-
Format: please unpack "NiuTrans/sample-data/sample.tar.gz", and refer to "description-of-the-sample-data" to find more information about data format.
-
In the following, the above data files are used to illustrate how to run the NiuTrans system (e.g. how to train MT models, tune feature weights, and decode test sentences).
- Instructions (perl is required. Also, Cygwin is required for Windows users).
string-to-tree
$> cd NiuTrans/sample-data/
$> tar xzf sample.tar.gz
$> cd ../
$> mkdir work/model.syntax.s2t/ -p
$> cd scripts/
$> perl NiuTrans-syntax-train-model.pl \
-model s2t \
-src ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
-tgt ../sample-data/sample-submission-version/TM-training-set/english.txt \
-aln ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
-ttree ../sample-data/sample-submission-version/TM-training-set/english.tree.txt \
-out ../work/model.syntax.s2t/syntax.string2tree.rule
tree-to-string
$> cd NiuTrans/sample-data/
$> tar xzf sample.tar.gz
$> cd ../
$> mkdir work/model.syntax.t2s/ -p
$> cd scripts/
$> perl NiuTrans-syntax-train-model.pl \
-model t2s \
-src ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
-stree ../sample-data/sample-submission-version/TM-training-set/chinese.tree.txt \
-tgt ../sample-data/sample-submission-version/TM-training-set/english.txt \
-aln ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
-out ../work/model.syntax.t2s/syntax.tree2string.rule
tree-to-tree
$> cd NiuTrans/sample-data/
$> tar xzf sample.tar.gz
$> cd ../
$> mkdir work/model.syntax.t2t/ -p
$> cd scripts/
$> perl NiuTrans-syntax-train-model.pl \
-model t2t \
-src ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
-stree ../sample-data/sample-submission-version/TM-training-set/chinese.tree.txt \
-tgt ../sample-data/sample-submission-version/TM-training-set/english.txt \
-ttree ../sample-data/sample-submission-version/TM-training-set/english.tree.txt \
-aln ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
-out ../work/model.syntax.t2t/syntax.tree2tree.rule
"-model" specifies SMT translation model, the model decides what type of rules can be generated, its value can be "s2t", "t2s" or "t2t".
"-src", "-tgt" and "-aln" specify the source sentences, the target sentences and the alignments between them (one sentence per line).
"-stree" specifies the parse trees of source sentences.
"-ttree" specifies the parse trees of target sentences.
- Output
string-to-tree
Output: three files are generated and placed in "NiuTrans/work/model.syntax.s2t/":
- syntax.string2tree.rule # syntax rule table
- syntax.string2tree.rule.bina # binarization rule table for decoder
- syntax.string2tree.rule.unbina # unbinarization rule table for decoder
tree-to-string
Output: three files are generated and placed in "NiuTrans/work/model.syntax.t2s/":
- syntax.tree2string.rule # syntax rule table
- syntax.tree2string.rule.bina # binarization rule table for decoder
- syntax.tree2string.rule.unbina # unbinarization rule table for decoder
tree-to-tree
Output: three files are generated and placed in "NiuTrans/work/model.syntax.t2t/":
- syntax.tree2tree.rule # syntax rule table
- syntax.tree2tree.rule.bina # binarization rule table for decoder
- syntax.tree2tree.rule.unbina # unbinarization rule table for decoder
- Note: Please enter the "NiuTrans/scripts/" directory before running the script "NiuTrans-syntax-train-model.pl".
- Instructions
$> cd ../
$> mkdir work/lm/
$> cd scripts/
$> perl NiuTrans-training-ngram-LM.pl \
-corpus ../sample-data/sample-submission-version/LM-training-set/e.lm.txt \
-ngram 3 \
-vocab ../work/lm/lm.vocab \
-lmbin ../work/lm/lm.trie.data
"-ngram" specifies the order of n-gram LM. E.g. "-ngram 3" indicates a 3-gram language model.
"-vocab" specifies where the target-side vocabulary is generated.
"-lmbin" specifies where the language model file is generated.
- Output: two files are generated and placed in "NiuTrans/work/lm/":
- lm.vocab # target-side vocabulary
- lm.trie.data # binary-encoded language model
- Instructions
string-to-tree
$> cd NiuTrans/scripts/
$> mkdir ../work/config/ -p
$> perl NiuTrans-syntax-generate-mert-config.pl \
-model s2t \
-syntaxrule ../work/model.syntax.s2t/syntax.string2tree.rule.bina \
-lmdir ../work/lm/ \
-nref 1 \
-ngram 3 \
-out ../work/config/NiuTrans.syntax.s2t.user.config
tree-to-string
$> cd NiuTrans/scripts/
$> mkdir ../work/config/ -p
$> perl NiuTrans-syntax-generate-mert-config.pl \
-model t2s \
-syntaxrule ../work/model.syntax.t2s/syntax.tree2string.rule.bina \
-lmdir ../work/lm/ \
-nref 1 \
-ngram 3 \
-out ../work/config/NiuTrans.syntax.t2s.user.config
tree-to-tree
$> cd NiuTrans/scripts/
$> mkdir ../work/config/ -p
$> perl NiuTrans-syntax-generate-mert-config.pl \
-model t2t \
-syntaxrule ../work/model.syntax.t2t/syntax.tree2tree.rule.bina \
-lmdir ../work/lm/ \
-nref 1 \
-ngram 3 \
-out ../work/config/NiuTrans.syntax.t2t.user.config
"-model" specifies what type of rules can be used to mert, its value can be "s2t", "t2s" or "t2t".
"-syntaxrule" specifies the path to the syntactic rule table.
"-lmdir" specifies the directory that holds the n-gram language model and the target-side vocabulary.
"-nref" specifies how many reference translations per source-sentence are provided.
"-ngram" specifies the order of n-gram language model.
"-out" specifies the output (i.e. a config file).
- Output
string-to-tree
Output: a configuration file is generated and placed in "NiuTrans/work/config". Users can modify this generated config file as needed.
- NiuTrans.syntax.s2t.user.config # configuration file for MERT and decoding
tree-to-string
Output: a configuration file is generated and placed in "NiuTrans/work/config".
- NiuTrans.syntax.t2s.user.config # configuration file for MERT and decoding
tree-to-tree
Output: a configuration file is generated and placed in "NiuTrans/work/config".
- NiuTrans.syntax.t2t.user.config # configuration file for MERT and decoding
- Instructions (perl is required).
string-to-tree
$> cd NiuTrans/scripts/
$> perl NiuTrans-syntax-mert-model.pl \
-model s2t \
-config ../work/config/NiuTrans.syntax.s2t.user.config \
-dev ../sample-data/sample-submission-version/Dev-set/Niu.dev.tree.txt \
-nref 1 \
-round 2 \
-log ../work/syntax-s2t-mert-model.log
tree-to-string
$> cd NiuTrans/scripts/
$> perl NiuTrans-syntax-mert-model.pl \
-model t2s
-config ../work/config/NiuTrans.syntax.t2s.user.config \
-dev ../sample-data/sample-submission-version/Dev-set/Niu.dev.tree.txt \
-nref 1 \
-round 2 \
-log ../work/syntax-t2s-mert-model.log
tree-to-tree
$> perl NiuTrans-syntax-mert-model.pl \
-model t2t \
-config ../work/config/NiuTrans.syntax.t2t.user.config \
-dev ../sample-data/sample-submission-version/Dev-set/Niu.dev.tree.txt \
-nref 1 \
-round 2 \
-log ../work/syntax-t2t-mert-model.log
"-model" specifies what type of rules can be used to mert, its value can be "s2t", "t2s" or "t2t".
"-config" specifies the configuration file generated in the previous steps.
"-dev" specifies the development dataset (or tuning set) for weight tuning.
"-nref" specifies how many reference translations per source-sentence are provided.
"-round" specifies how many rounds the MERT performs (by default, 1 round = 10 MERT iterations).
"-log" specifies the log file generated by MERT.
- Output: After MER training, the optimized feature weights are automatically recorded in the "-config" file (last line). Then, the config can be used to decode new sentences.
- Instructions (perl is required). Take tree-to-string model as an instance.
string-to-tree
$> cd NiuTrans/scripts/
$> mkdir ../work/syntax.trans.result/ -p
$> perl NiuTrans-syntax-decoder-model.pl \
-model s2t \
-config ../work/config/NiuTrans.syntax.s2t.user.config \
-test ../sample-data/sample-submission-version/Test-set/Niu.test.tree.txt \
-output ../work/syntax.trans.result/Niu.test.syntax.s2t.translated.en.txt
tree-to-string
$> cd NiuTrans/scripts/
$> mkdir ../work/syntax.trans.result/ -p
$> perl NiuTrans-syntax-decoder-model.pl \
-model t2s \
-config ../work/config/NiuTrans.syntax.t2s.user.config \
-test ../sample-data/sample-submission-version/Test-set/Niu.test.tree.txt \
-output ../work/syntax.trans.result/Niu.test.syntax.t2s.translated.en.txt
tree-to-tree
$> cd NiuTrans/scripts/
$> mkdir ../work/syntax.trans.result/ -p
$> perl NiuTrans-syntax-decoder-model.pl \
-model t2t \
-config ../work/config/NiuTrans.syntax.t2t.user.config \
-test ../sample-data/sample-submission-version/Test-set/Niu.test.tree.txt \
-output ../work/syntax.trans.result/Niu.test.syntax.t2t.translated.en.txt
"-model" specifies what type of rules can be used for decoder, its value can be "s2t", "t2s" or "t2t".
"-config" specifies the configuration file.
"-test" specifies the test dataset (one sentence per line).
"-output" specifies the translation result file (the result is dumped to "stdout" if this option is not specified).
- Output
string-to-tree
Output: a new file is generated in "NiuTrans/work/syntax.trans.result":
- Niu.test.syntax.s2t.translated.en.txt # 1-best translation of the test sentences
tree-to-string
Output: a new file is generated in "NiuTrans/work/syntax.trans.result":
- Niu.test.syntax.t2s.translated.en.txt # 1-best translation of the test sentences
tree-to-tree
Output: a new file is generated in "NiuTrans/work/syntax.trans.result":
- Niu.test.syntax.t2t.translated.en.txt # 1-best translation of the test sentences
- Instructions (perl is required)
string-to-tree
$> perl NiuTrans-generate-xml-for-mteval.pl \
-1f ../work/syntax.trans.result/Niu.test.syntax.s2t.translated.en.txt \
-tf ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
-rnum 1
$> perl mteval-v13a.pl \
-r ref.xml \
-s src.xml \
-t tst.xml
tree-to-string
$> perl NiuTrans-generate-xml-for-mteval.pl \
-1f ../work/syntax.trans.result/Niu.test.syntax.t2s.translated.en.txt \
-tf ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
-rnum 1
$> perl mteval-v13a.pl \
-r ref.xml \
-s src.xml \
-t tst.xml
- tree-to-tree
$> perl NiuTrans-generate-xml-for-mteval.pl \
-1f ../work/syntax.trans.result/Niu.test.syntax.t2t.translated.en.txt \
-tf ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
-rnum 1
$> perl mteval-v13a.pl \
-r ref.xml \
-s src.xml \
-t tst.xml
"-1f" specifies the file of the 1-best translations of the test dataset.
"-tf" specifies the file of the source sentences and their reference translations of the test dataset.
"-rnum" specifies how many reference translations per test sentence are provided.
"-r" specifies the file of the reference translations.
"-s" specifies the file of source sentences.
"-t" specifies the file of (1-best) translations generated by the MT system.
-
Output: The IBM-version BLEU score is displayed. If everything goes well, you will obtain a score of about 0.2277,0.2205,0.1939 for the sample data set.
-
Note: script mteval-v13a.pl relies on the package XML::Parser. If XML::Parser is not installed on your system, please follow the following commands to install it.
$> su root
$> tar xzf XML-Parser-2.41.tar.gz
$> cd XML-Parser-2.41/
$> perl Makefile.PL
$> make install