Skip to content

NLPRL/NMT-Based-Word-Transduction

 
 

Repository files navigation

Learning Cross-Lingual Phonological and Orthagraphic Adaptations made-with-python

This repo hosts the code necessary to reproduce the results of our paper, formerly titled as Neural machine translation based word transduction mechanisms for Low Resource Languages (recently accepted at the Journal of Language Modelling).

enc-dec


Generating char2vec from pre-trained Hindi fastText embeddings

The pre-trained Hindi character vectors can be downloaded from here. This repo contains two such methods for generating these character embeddings:

  1. Running the file generate_char2vec.py generates the character vectors for 71 Devanagari characters from the pre-trained word vectors. The outputs can be found in char2vec.txt.
  2. Running the file char_rnn.py trains a language model over the hindi-wikipedia-articles-55000 (i.e., generating the 30th character given the sequence of 29 consecutive characters). The embedding weights are then retained to extract the character-level embeddings.

Models Used

We experimented with four variants of sequence-to-sequence models for our project:

  • Peeky Seq2seq Model: Run the file peeky_Seq2seq.py. The implementation is based on Sequence to Sequence Learning with Keras.

  • Alignment Model (AM): Run the file attentionDecoder.py. Following the work of Bahdanau et al. [1], the file attention_decoder.py contains the custom Keras layer based on Tensorflow backend. The original implementation can be found here. A good blog post guiding the use of this implementation can be found here.

  • Heirarchical Attention Model (HAM): Run the file attentionEncoder.py. Inspired from the work of Yang et al. [2] Original implementation can be found here.

  • Transformer Network: generate_data_for_tensor2tensor.py generates the data as required by the Transformer network. [3] The data is required while registering your own database (See this for further reading). For a detailed look at installation and usage, visit their official github page.

Evaluation metrics

  • bleu_score.py measures the BLEU score between the transduced and the actual Bhojpuri words averaged over the entire output file.

  • word_accuracy.py simply measures the proportion of correctly transduced words in the output file.

  • measure_distance.py measures the Soundex score similarity between the actual and transduced Bhojpuri word pairs, averaged over the output file. A good blog post explaining the implementation can be found here.

Citation

If our code was helpful in your research, consider citing our work:

@article{jha2018neural,
  title={Neural Machine Translation based Word Transduction Mechanisms for Low-Resource Languages},
  author={Jha, Saurav and Sudhakar, Akhilesh and Singh, Anil Kumar},
  journal={arXiv preprint arXiv:1811.08816},
  year={2018}
}

References

[1] Bahdanau, D., Bengio, Y., & Cho, K. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, abs/1409.0473.

[2] Dyer, C., He, X., Hovy, E.H., Smola, A.J., Yang, Z., & Yang, D. (2016). Hierarchical Attention Networks for Document Classification. HLT-NAACL.

[3] Gomez, A.N., Jones, L., Kaiser, L., Parmar, N., Polosukhin, I., Shazeer, N., Uszkoreit, J., & Vaswani, A. (2017). Attention is All you Need. NIPS.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%