Skip to content

GoLang implementation of Tokenizer & Normalizer from Moses Decoder

Notifications You must be signed in to change notification settings

kabychow/go-mosestokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Moses Tokenizer for GoLang

GoLang implementation of Tokenizer & Normalizer from Moses Decoder

Installation

go get github.com/khaibin/go-mosestokenizer

Usage

package main

import (
    "fmt"
    "github.com/khaibin/go-mosestokenizer"
    "github.com/khaibin/go-mosestokenizer/nonbreaking_prefix"
)

func main() {
    text := "This is a string"
    lang := "en"

    // Tokenize and get the result as []string
    mosestokenizer.Tokenize(text, lang)

    // Tokenize and get the result as string
    mosestokenizer.TokenizeAsString(text, lang)

    // Normalization
    mosestokenizer.Normalize(text, lang)
    
    prefix := "mr"
    prefix_lang := "en"

    // Returns true if string is non-breaking prefix
    nonbreaking_prefix.Find(prefix, prefix_lang)

    // Returns true if string is non-breaking numeric only prefix
    nonbreaking_prefix.FindNumeric(prefix, prefix_lang)

    // Constants
    //   perluniprops.ALPHA
    //   perluniprops.NUM
    //   perluniprops.ALNUM
}

Publications

The segmentation methods are described in:

Rico Sennrich, Barry Haddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

About

GoLang implementation of Tokenizer & Normalizer from Moses Decoder

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages