Final Report: https://docs.google.com/document/d/1kv8xii1NR8fFiUgijCGUAp0I_tkiU4294Vrt3HfXN68/edit?usp=sharing
Completed Trained Model (323 MB):
-https://drive.google.com/u/1/uc?id=1MkRyroGmtHM75ZdaTgJ4bScemO75qAST&export=download
Datasets used:
- https://www.kaggle.com/team-ai/japaneseenglish-bilingual-corpus
- https://nlp.stanford.edu/projects/jesc/ (Processed data file needed for project.ipynb available here: https://drive.google.com/u/0/uc?id=19_jCHSv3AYqFOiXdzi1MSwaYNfIfQJSh&export=download)
Final submission: project.py - seq2seq model with attention using Keras
Folder: src file descriptions
- encDecoderGRU.py - pretrained vector embeddings for encoder and decoder
- encDecoderLSTM.py - basic sequence to sequence model using torchText for embedding layer
- genFileWikipedia.py - generate wikipedia_raw (jap, eng pairs)
- torchAttn_v1.py - Pytorch tutorial, spacy for japanese tokenization, string encoded japanese for model inputs, attention layer
- torchAttn_v2.py - Pytorch tutorial, Fugashi for japanese tokenization, string encoded japanese for model inputs, attention layer
- torchAttn_v3.py - Pytorch tutorial, Fugashi for japanese tokenization, fed Unidic objects into langauge dictionary, attention layer
- torchTT.ipynb - basic seq2seq model using pretrained embeddings using Word2Vec and fugashi for tokenization
- torchTT.py - basic seq2seq model using torchText for embedding layer and spacy for tokenization
- util.py - text preprocessing of corpus files generating final clean corpus files (nltk, text sanitization)
- z_genPair.py - created the pickle file with and tokens to be loaded by model
- z_genVectors.py - gensim vector embedding trained models
- z_seq2seq_translation_tutorial.py - pytorch tutorial that was modified to translate japanese to english using attention
- z_translateWEEB.py - Pytorch tutorial, using pretrained word embeddings (gensim) to translate japanese to english