Department of Linguistics, UC Davis
Expanding linguistic resources for speech enabled systems
Wikipedia database dump tokenizer
- First download the Wikipedia database dump
- Use WikiExtractor to extract and clean text from the Wikipedia database dump
$ git clone https://github.com/attardi/wikiextractor.git
- Install the script by doing
$ (sudo) python setup.py install
- Apply the script to the Wikipedia database dump
$ WikiExtractor.py (your-database-dump).xml.bz2
- Specify the database path in the script and run tokenizer.py
$ python tokenizer.py
- After running the script, tokens are stored in tokenizer_tokens.txt, unless otherwise specified. The script also outputs tokenizer_tokens2.txt, in which infrequent tokens (those appearing <= 10 times) are replaced with an 'UNK' token.
Inspect Wikipedia database dump model with Word2Vec and create vocabulary files
- Build fastText using the following commands
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make
- Learn word vectors for the Wikipedia database dump articles
$ ./fasttext skipgram -input results.txt -output model
- Running the command above will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters.
- Download questions-words.txt. This file contains approximately 19,500 analogies, divided into several categories, that will be used to perform a sanity check of the Wikipedia database dump model.
- Running the following command will train the Wikipedia database dump model and store the "words" into the directory vocabulary. It will also perform a sanity check of the model using questions-words.txt and output the accuracies of the model's predictions for each analogy category, as well as the overall accuracy of the model's predictions.
$ mkdir vocabulary
$ python inspect_words.py
Applying Word2Vec to Wikipedia database dump
- Uses gensim's Wikipedia parsing (WikiCorpus) to extract and tokenize the Wikipedia database dump compressed in bz2.
- Runs gensim's Word2Vec to train model (gensim does not have GPU support). Saved as word2vec.model.
- Performs sanity check with the file specified in analogy_path (questions-word.txt)
$ python word2vec.py