Learning Cross-Lingual Phonological and Orthagraphic Adaptations

This repo hosts the code necessary to reproduce the results of our paper, formerly titled as Neural machine translation based word transduction mechanisms for Low Resource Languages (recently accepted at the Journal of Language Modelling).

Generating char2vec from pre-trained Hindi fastText embeddings

The pre-trained Hindi character vectors can be downloaded from here. This repo contains two such methods for generating these character embeddings:

Running the file generate_char2vec.py generates the character vectors for 71 Devanagari characters from the pre-trained word vectors. The outputs can be found in char2vec.txt.
Running the file char_rnn.py trains a language model over the hindi-wikipedia-articles-55000 (i.e., generating the 30th character given the sequence of 29 consecutive characters). The embedding weights are then retained to extract the character-level embeddings.

Models Used

We experimented with four variants of sequence-to-sequence models for our project:

Peeky Seq2seq Model: Run the file peeky_Seq2seq.py. The implementation is based on Sequence to Sequence Learning with Keras.
Alignment Model (AM): Run the file attentionDecoder.py. Following the work of Bahdanau et al. [1], the file attention_decoder.py contains the custom Keras layer based on Tensorflow backend. The original implementation can be found here. A good blog post guiding the use of this implementation can be found here.
Heirarchical Attention Model (HAM): Run the file attentionEncoder.py. Inspired from the work of Yang et al. [2] Original implementation can be found here.
Transformer Network: generate_data_for_tensor2tensor.py generates the data as required by the Transformer network. [3] The data is required while registering your own database (See this for further reading). For a detailed look at installation and usage, visit their official github page.

Evaluation metrics

bleu_score.py measures the BLEU score between the transduced and the actual Bhojpuri words averaged over the entire output file.
word_accuracy.py simply measures the proportion of correctly transduced words in the output file.
measure_distance.py measures the Soundex score similarity between the actual and transduced Bhojpuri word pairs, averaged over the output file. A good blog post explaining the implementation can be found here.

Citation

If our code was helpful in your research, consider citing our work:

@article{jha2018neural,
  title={Neural Machine Translation based Word Transduction Mechanisms for Low-Resource Languages},
  author={Jha, Saurav and Sudhakar, Akhilesh and Singh, Anil Kumar},
  journal={arXiv preprint arXiv:1811.08816},
  year={2018}
}

References

[1] Bahdanau, D., Bengio, Y., & Cho, K. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, abs/1409.0473.

[2] Dyer, C., He, X., Hovy, E.H., Smola, A.J., Yang, Z., & Yang, D. (2016). Hierarchical Attention Networks for Document Classification. HLT-NAACL.

[3] Gomez, A.N., Jones, L., Kaiser, L., Parmar, N., Polosukhin, I., Shazeer, N., Uszkoreit, J., & Vaswani, A. (2017). Attention is All you Need. NIPS.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
character-model		character-model
dataset_char2vec		dataset_char2vec
dataset_cognates		dataset_cognates
evaluate		evaluate
model_weights		model_weights
models		models
outputs		outputs
preprocess		preprocess
README.md		README.md
attentionDecoder.py		attentionDecoder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Cross-Lingual Phonological and Orthagraphic Adaptations

Generating char2vec from pre-trained Hindi fastText embeddings

Models Used

Evaluation metrics

Citation

References

About

Releases

Packages

Languages

NLPRL/NMT-Based-Word-Transduction

Folders and files

Latest commit

History

Repository files navigation

Learning Cross-Lingual Phonological and Orthagraphic Adaptations

Generating char2vec from pre-trained Hindi fastText embeddings

Models Used

Evaluation metrics

Citation

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages