Skip to content

A Step by step manual of NiuTrans.Syntax

liyinqiao edited this page May 14, 2018 · 1 revision

1.Data Preparation

  • The NiuTrans system is a "data-driven" MT system which requires "data" for training and/or tuning the system. It requires users to prepare the following data files before running the system.

    a). Training data: bilingual sentence-pairs and word alignments.

    b). Tuning data: source sentences with one or more reference translations.

    c). Test data: some new sentences.

    d). Evaluation data: reference translations of test sentences.

In the NiuTrans package, some sample files are offered for experimenting with the system and studying the format requirement. They are located in "NiuTrans/sample-data/sample-submission-version".

sample-submission-version/
  -- TM-training-set/                   # word-aligned bilingual corpus (100,000 sentence-pairs)
       -- chinese.txt                   # source sentences
       -- english.txt                   # target sentences (case-removed)
       -- Alignment.txt                 # word alignments of the sentence-pairs
       -- chinese.tree.txt              # parse trees of source sentences
       -- english.tree.txt              # parse trees of target sentences
  -- LM-training-set/
       -- e.lm.txt                      # monolingual corpus for training language model (100K target sentences)
  -- Dev-set/
       -- Niu.dev.txt                   # development dataset for weight tuning (400 sentences)
       -- Niu.dev.tree.txt              # development dataset with tree annotation (on source sentences)
  -- Test-set/
       -- Niu.test.txt                  # test dataset (1K sentences)
       -- Niu.test.tree.txt             # test dataset with tree annotation
  -- Reference-for-evaluation/
       -- Niu.test.reference            # references of the test sentences (1K sentences)
  -- description-of-the-sample-data     # a description of the sample data
  • Format: please unpack "NiuTrans/sample-data/sample.tar.gz", and refer to "description-of-the-sample-data" to find more information about data format.

  • In the following, the above data files are used to illustrate how to run the NiuTrans system (e.g. how to train MT models, tune feature weights, and decode test sentences).

2.Obtaining Syntactic Transfer Rules

  • Instructions (perl is required. Also, Cygwin is required for Windows users).

string-to-tree

$> cd NiuTrans/sample-data/
$> tar xzf sample.tar.gz
$> cd ../
$> mkdir work/model.syntax.s2t/ -p
$> cd scripts/
$> perl NiuTrans-syntax-train-model.pl \
        -model s2t \
        -src   ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
        -tgt   ../sample-data/sample-submission-version/TM-training-set/english.txt \
        -aln   ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
        -ttree ../sample-data/sample-submission-version/TM-training-set/english.tree.txt \
        -out   ../work/model.syntax.s2t/syntax.string2tree.rule

tree-to-string

$> cd NiuTrans/sample-data/
$> tar xzf sample.tar.gz
$> cd ../
$> mkdir work/model.syntax.t2s/ -p
$> cd scripts/
$> perl NiuTrans-syntax-train-model.pl \
        -model t2s \
        -src   ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
        -stree ../sample-data/sample-submission-version/TM-training-set/chinese.tree.txt \
        -tgt   ../sample-data/sample-submission-version/TM-training-set/english.txt \
        -aln   ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
        -out   ../work/model.syntax.t2s/syntax.tree2string.rule

tree-to-tree

$> cd NiuTrans/sample-data/
$> tar xzf sample.tar.gz
$> cd ../
$> mkdir work/model.syntax.t2t/ -p
$> cd scripts/
$> perl NiuTrans-syntax-train-model.pl \
        -model t2t \
        -src   ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
        -stree ../sample-data/sample-submission-version/TM-training-set/chinese.tree.txt \
        -tgt   ../sample-data/sample-submission-version/TM-training-set/english.txt \
        -ttree ../sample-data/sample-submission-version/TM-training-set/english.tree.txt \
        -aln   ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
        -out   ../work/model.syntax.t2t/syntax.tree2tree.rule

"-model" specifies SMT translation model, the model decides what type of rules can be generated, its value can be "s2t", "t2s" or "t2t".

"-src", "-tgt" and "-aln" specify the source sentences, the target sentences and the alignments between them (one sentence per line).

"-stree" specifies the parse trees of source sentences.

"-ttree" specifies the parse trees of target sentences.

  • Output

string-to-tree

Output: three files are generated and placed in "NiuTrans/work/model.syntax.s2t/":

- syntax.string2tree.rule                    # syntax rule table
- syntax.string2tree.rule.bina               # binarization rule table for decoder
- syntax.string2tree.rule.unbina             # unbinarization rule table for decoder

tree-to-string

Output: three files are generated and placed in "NiuTrans/work/model.syntax.t2s/":

- syntax.tree2string.rule                    # syntax rule table
- syntax.tree2string.rule.bina               # binarization rule table for decoder
- syntax.tree2string.rule.unbina             # unbinarization rule table for decoder

tree-to-tree

Output: three files are generated and placed in "NiuTrans/work/model.syntax.t2t/":

- syntax.tree2tree.rule                      # syntax rule table
- syntax.tree2tree.rule.bina                 # binarization rule table for decoder
- syntax.tree2tree.rule.unbina               # unbinarization rule table for decoder
  • Note: Please enter the "NiuTrans/scripts/" directory before running the script "NiuTrans-syntax-train-model.pl".

3.Training n-gram language model

  • Instructions
$> cd ../
$> mkdir work/lm/
$> cd scripts/
$> perl NiuTrans-training-ngram-LM.pl \
        -corpus ../sample-data/sample-submission-version/LM-training-set/e.lm.txt \
        -ngram  3 \
        -vocab  ../work/lm/lm.vocab \
        -lmbin  ../work/lm/lm.trie.data

"-ngram" specifies the order of n-gram LM. E.g. "-ngram 3" indicates a 3-gram language model.

"-vocab" specifies where the target-side vocabulary is generated.

"-lmbin" specifies where the language model file is generated.

  • Output: two files are generated and placed in "NiuTrans/work/lm/":
- lm.vocab                            # target-side vocabulary
- lm.trie.data                        # binary-encoded language model

4.Generating Configuration File

  • Instructions

string-to-tree

$> cd NiuTrans/scripts/
$> mkdir ../work/config/ -p
$> perl NiuTrans-syntax-generate-mert-config.pl \
        -model      s2t \
        -syntaxrule ../work/model.syntax.s2t/syntax.string2tree.rule.bina \
        -lmdir      ../work/lm/ \
        -nref       1 \
        -ngram      3 \
        -out        ../work/config/NiuTrans.syntax.s2t.user.config

tree-to-string

$> cd NiuTrans/scripts/
$> mkdir ../work/config/ -p
$> perl NiuTrans-syntax-generate-mert-config.pl \
        -model      t2s \
        -syntaxrule ../work/model.syntax.t2s/syntax.tree2string.rule.bina \
        -lmdir      ../work/lm/ \
        -nref       1 \
        -ngram      3 \
        -out        ../work/config/NiuTrans.syntax.t2s.user.config

tree-to-tree

$> cd NiuTrans/scripts/
$> mkdir ../work/config/ -p
$> perl NiuTrans-syntax-generate-mert-config.pl \
        -model      t2t \
        -syntaxrule ../work/model.syntax.t2t/syntax.tree2tree.rule.bina \
        -lmdir      ../work/lm/ \
        -nref       1 \
        -ngram      3 \
        -out ../work/config/NiuTrans.syntax.t2t.user.config

"-model" specifies what type of rules can be used to mert, its value can be "s2t", "t2s" or "t2t".

"-syntaxrule" specifies the path to the syntactic rule table.

"-lmdir" specifies the directory that holds the n-gram language model and the target-side vocabulary.

"-nref" specifies how many reference translations per source-sentence are provided.

"-ngram" specifies the order of n-gram language model.

"-out" specifies the output (i.e. a config file).

  • Output

string-to-tree

Output: a configuration file is generated and placed in "NiuTrans/work/config". Users can modify this generated config file as needed.

- NiuTrans.syntax.s2t.user.config           # configuration file for MERT and decoding

tree-to-string

Output: a configuration file is generated and placed in "NiuTrans/work/config".

- NiuTrans.syntax.t2s.user.config            # configuration file for MERT and decoding

tree-to-tree

Output: a configuration file is generated and placed in "NiuTrans/work/config".

- NiuTrans.syntax.t2t.user.config           # configuration file for MERT and decoding

5.Weight Tuning

  • Instructions (perl is required).

string-to-tree

$> cd NiuTrans/scripts/
$> perl NiuTrans-syntax-mert-model.pl \
        -model  s2t \
        -config ../work/config/NiuTrans.syntax.s2t.user.config \
        -dev    ../sample-data/sample-submission-version/Dev-set/Niu.dev.tree.txt \
        -nref   1 \
        -round  2 \
        -log    ../work/syntax-s2t-mert-model.log

tree-to-string

$> cd NiuTrans/scripts/
$> perl NiuTrans-syntax-mert-model.pl \
        -model  t2s
        -config ../work/config/NiuTrans.syntax.t2s.user.config \
        -dev    ../sample-data/sample-submission-version/Dev-set/Niu.dev.tree.txt \
        -nref   1 \
        -round  2 \
        -log    ../work/syntax-t2s-mert-model.log

tree-to-tree

$> perl NiuTrans-syntax-mert-model.pl \
        -model  t2t \
        -config ../work/config/NiuTrans.syntax.t2t.user.config \
        -dev    ../sample-data/sample-submission-version/Dev-set/Niu.dev.tree.txt \
        -nref   1 \
        -round  2 \
        -log    ../work/syntax-t2t-mert-model.log

"-model" specifies what type of rules can be used to mert, its value can be "s2t", "t2s" or "t2t".

"-config" specifies the configuration file generated in the previous steps.

"-dev" specifies the development dataset (or tuning set) for weight tuning.

"-nref" specifies how many reference translations per source-sentence are provided.

"-round" specifies how many rounds the MERT performs (by default, 1 round = 10 MERT iterations).

"-log" specifies the log file generated by MERT.

  • Output: After MER training, the optimized feature weights are automatically recorded in the "-config" file (last line). Then, the config can be used to decode new sentences.

6.Decoding Test Sentences

  • Instructions (perl is required). Take tree-to-string model as an instance.

string-to-tree

$> cd NiuTrans/scripts/
$> mkdir ../work/syntax.trans.result/ -p
$> perl NiuTrans-syntax-decoder-model.pl \
        -model  s2t \
        -config ../work/config/NiuTrans.syntax.s2t.user.config \
        -test   ../sample-data/sample-submission-version/Test-set/Niu.test.tree.txt \
        -output ../work/syntax.trans.result/Niu.test.syntax.s2t.translated.en.txt

tree-to-string

$> cd NiuTrans/scripts/
$> mkdir ../work/syntax.trans.result/ -p
$> perl NiuTrans-syntax-decoder-model.pl \
        -model  t2s \
        -config ../work/config/NiuTrans.syntax.t2s.user.config \
        -test   ../sample-data/sample-submission-version/Test-set/Niu.test.tree.txt \
        -output ../work/syntax.trans.result/Niu.test.syntax.t2s.translated.en.txt

tree-to-tree

$> cd NiuTrans/scripts/
$> mkdir ../work/syntax.trans.result/ -p
$> perl NiuTrans-syntax-decoder-model.pl \
        -model  t2t \
        -config ../work/config/NiuTrans.syntax.t2t.user.config \
        -test   ../sample-data/sample-submission-version/Test-set/Niu.test.tree.txt \
        -output ../work/syntax.trans.result/Niu.test.syntax.t2t.translated.en.txt

"-model" specifies what type of rules can be used for decoder, its value can be "s2t", "t2s" or "t2t".

"-config" specifies the configuration file.

"-test" specifies the test dataset (one sentence per line).

"-output" specifies the translation result file (the result is dumped to "stdout" if this option is not specified).

  • Output

string-to-tree

Output: a new file is generated in "NiuTrans/work/syntax.trans.result":

- Niu.test.syntax.s2t.translated.en.txt                 # 1-best translation of the test sentences

tree-to-string

Output: a new file is generated in "NiuTrans/work/syntax.trans.result":

- Niu.test.syntax.t2s.translated.en.txt                # 1-best translation of the test sentences

tree-to-tree

Output: a new file is generated in "NiuTrans/work/syntax.trans.result":

- Niu.test.syntax.t2t.translated.en.txt                # 1-best translation of the test sentences

7. Evaluation

  • Instructions (perl is required)

string-to-tree

$> perl NiuTrans-generate-xml-for-mteval.pl \
        -1f   ../work/syntax.trans.result/Niu.test.syntax.s2t.translated.en.txt \
        -tf   ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \ 
        -rnum 1
$> perl mteval-v13a.pl \
        -r    ref.xml \
        -s    src.xml \
        -t    tst.xml

tree-to-string

$> perl NiuTrans-generate-xml-for-mteval.pl \
        -1f   ../work/syntax.trans.result/Niu.test.syntax.t2s.translated.en.txt \
        -tf   ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
        -rnum 1
$> perl mteval-v13a.pl \
        -r    ref.xml \
        -s    src.xml \
        -t    tst.xml
  • tree-to-tree
$> perl NiuTrans-generate-xml-for-mteval.pl \
        -1f   ../work/syntax.trans.result/Niu.test.syntax.t2t.translated.en.txt \
        -tf   ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
        -rnum 1
$> perl mteval-v13a.pl \
        -r    ref.xml \
        -s    src.xml \
        -t    tst.xml

"-1f" specifies the file of the 1-best translations of the test dataset.

"-tf" specifies the file of the source sentences and their reference translations of the test dataset.

"-rnum" specifies how many reference translations per test sentence are provided.

"-r" specifies the file of the reference translations.

"-s" specifies the file of source sentences.

"-t" specifies the file of (1-best) translations generated by the MT system.

  • Output: The IBM-version BLEU score is displayed. If everything goes well, you will obtain a score of about 0.2277,0.2205,0.1939 for the sample data set.

  • Note: script mteval-v13a.pl relies on the package XML::Parser. If XML::Parser is not installed on your system, please follow the following commands to install it.

$> su root
$> tar xzf XML-Parser-2.41.tar.gz
$> cd XML-Parser-2.41/
$> perl Makefile.PL
$> make install