Skip to content

Latest commit

 

History

History
73 lines (61 loc) · 3.43 KB

README.md

File metadata and controls

73 lines (61 loc) · 3.43 KB

Department of Linguistics, UC Davis

Overview

Expanding linguistic resources for speech enabled systems

tokenizer.py

Wikipedia database dump tokenizer

  1. First download the Wikipedia database dump
  2. Use WikiExtractor to extract and clean text from the Wikipedia database dump
   $ git clone https://github.com/attardi/wikiextractor.git
  1. Install the script by doing
   $ (sudo) python setup.py install
  1. Apply the script to the Wikipedia database dump
   $ WikiExtractor.py (your-database-dump).xml.bz2
  1. Specify the database path in the script and run tokenizer.py
  $ python tokenizer.py
  1. After running the script, tokens are stored in tokenizer_tokens.txt, unless otherwise specified. The script also outputs tokenizer_tokens2.txt, in which infrequent tokens (those appearing <= 10 times) are replaced with an 'UNK' token.

inspect_words.py

Inspect Wikipedia database dump model with Word2Vec and create vocabulary files

  1. Build fastText using the following commands
  $ git clone https://github.com/facebookresearch/fastText.git
  $ cd fastText
  $ make
  1. Learn word vectors for the Wikipedia database dump articles
  $ ./fasttext skipgram -input results.txt -output model
  1. Running the command above will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters.
  2. Download questions-words.txt. This file contains approximately 19,500 analogies, divided into several categories, that will be used to perform a sanity check of the Wikipedia database dump model.
  3. Running the following command will train the Wikipedia database dump model and store the "words" into the directory vocabulary. It will also perform a sanity check of the model using questions-words.txt and output the accuracies of the model's predictions for each analogy category, as well as the overall accuracy of the model's predictions.
   $ mkdir vocabulary
   $ python inspect_words.py

word2vec.py

Applying Word2Vec to Wikipedia database dump

  1. Uses gensim's Wikipedia parsing (WikiCorpus) to extract and tokenize the Wikipedia database dump compressed in bz2.
  2. Runs gensim's Word2Vec to train model (gensim does not have GPU support). Saved as word2vec.model.
  3. Performs sanity check with the file specified in analogy_path (questions-word.txt)
   $ python word2vec.py 

Built With

Contributors