This repo hosts the code necessary to reproduce the results of our paper, formerly titled as Neural machine translation based word transduction mechanisms for Low Resource Languages (recently accepted at the Journal of Language Modelling).
The pre-trained Hindi character vectors can be downloaded from here. This repo contains two such methods for generating these character embeddings:
- Running the file
generate_char2vec.py
generates the character vectors for 71 Devanagari characters from the pre-trained word vectors. The outputs can be found inchar2vec.txt
. - Running the file
char_rnn.py
trains a language model over thehindi-wikipedia-articles-55000
(i.e., generating the 30th character given the sequence of 29 consecutive characters). The embedding weights are then retained to extract the character-level embeddings.
We experimented with four variants of sequence-to-sequence models for our project:
-
Peeky Seq2seq Model: Run the file
peeky_Seq2seq.py
. The implementation is based on Sequence to Sequence Learning with Keras. -
Alignment Model (AM): Run the file
attentionDecoder.py
. Following the work of Bahdanau et al. [1], the fileattention_decoder.py
contains the custom Keras layer based on Tensorflow backend. The original implementation can be found here. A good blog post guiding the use of this implementation can be found here. -
Heirarchical Attention Model (HAM): Run the file
attentionEncoder.py
. Inspired from the work of Yang et al. [2] Original implementation can be found here. -
Transformer Network:
generate_data_for_tensor2tensor.py
generates the data as required by the Transformer network. [3] The data is required while registering your own database (See this for further reading). For a detailed look at installation and usage, visit their official github page.
-
bleu_score.py
measures the BLEU score between the transduced and the actual Bhojpuri words averaged over the entire output file. -
word_accuracy.py
simply measures the proportion of correctly transduced words in the output file. -
measure_distance.py
measures the Soundex score similarity between the actual and transduced Bhojpuri word pairs, averaged over the output file. A good blog post explaining the implementation can be found here.
If our code was helpful in your research, consider citing our work:
@article{jha2018neural,
title={Neural Machine Translation based Word Transduction Mechanisms for Low-Resource Languages},
author={Jha, Saurav and Sudhakar, Akhilesh and Singh, Anil Kumar},
journal={arXiv preprint arXiv:1811.08816},
year={2018}
}
[1] Bahdanau, D., Bengio, Y., & Cho, K. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, abs/1409.0473.
[2] Dyer, C., He, X., Hovy, E.H., Smola, A.J., Yang, Z., & Yang, D. (2016). Hierarchical Attention Networks for Document Classification. HLT-NAACL.
[3] Gomez, A.N., Jones, L., Kaiser, L., Parmar, N., Polosukhin, I., Shazeer, N., Uszkoreit, J., & Vaswani, A. (2017). Attention is All you Need. NIPS.