CS175_NLP: TranslationWeeb

Completed Trained Model (323 MB):

Datasets used:

Final submission: project.py - seq2seq model with attention using Keras

Folder: src file descriptions

encDecoderGRU.py - pretrained vector embeddings for encoder and decoder
encDecoderLSTM.py - basic sequence to sequence model using torchText for embedding layer
genFileWikipedia.py - generate wikipedia_raw (jap, eng pairs)
torchAttn_v1.py - Pytorch tutorial, spacy for japanese tokenization, string encoded japanese for model inputs, attention layer
torchAttn_v2.py - Pytorch tutorial, Fugashi for japanese tokenization, string encoded japanese for model inputs, attention layer
torchAttn_v3.py - Pytorch tutorial, Fugashi for japanese tokenization, fed Unidic objects into langauge dictionary, attention layer
torchTT.ipynb - basic seq2seq model using pretrained embeddings using Word2Vec and fugashi for tokenization
torchTT.py - basic seq2seq model using torchText for embedding layer and spacy for tokenization
util.py - text preprocessing of corpus files generating final clean corpus files (nltk, text sanitization)
z_genPair.py - created the pickle file with and tokens to be loaded by model
z_genVectors.py - gensim vector embedding trained models
z_seq2seq_translation_tutorial.py - pytorch tutorial that was modified to translate japanese to english using attention
z_translateWEEB.py - Pytorch tutorial, using pretrained word embeddings (gensim) to translate japanese to english

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
TranslationWeeb		TranslationWeeb
datafiles		datafiles
src		src
.gitignore		.gitignore
README.md		README.md
project.html		project.html
project.ipynb		project.ipynb
project.py		project.py

Provide feedback