Computational Linguistics Lab

Overview

Expanding linguistic resources for speech enabled systems

tokenizer.py

Wikipedia database dump tokenizer

First download the Wikipedia database dump
Use WikiExtractor to extract and clean text from the Wikipedia database dump

   $ git clone https://github.com/attardi/wikiextractor.git

Install the script by doing

   $ (sudo) python setup.py install

Apply the script to the Wikipedia database dump

   $ WikiExtractor.py (your-database-dump).xml.bz2

Specify the database path in the script and run tokenizer.py

  $ python tokenizer.py

After running the script, tokens are stored in tokenizer_tokens.txt, unless otherwise specified. The script also outputs tokenizer_tokens2.txt, in which infrequent tokens (those appearing <= 10 times) are replaced with an 'UNK' token.

inspect_words.py

Inspect Wikipedia database dump model with Word2Vec and create vocabulary files

Build fastText using the following commands

  $ git clone https://github.com/facebookresearch/fastText.git
  $ cd fastText
  $ make

Learn word vectors for the Wikipedia database dump articles

  $ ./fasttext skipgram -input results.txt -output model

Running the command above will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters.
Download questions-words.txt. This file contains approximately 19,500 analogies, divided into several categories, that will be used to perform a sanity check of the Wikipedia database dump model.
Running the following command will train the Wikipedia database dump model and store the "words" into the directory vocabulary. It will also perform a sanity check of the model using questions-words.txt and output the accuracies of the model's predictions for each analogy category, as well as the overall accuracy of the model's predictions.

   $ mkdir vocabulary
   $ python inspect_words.py

word2vec.py

Applying Word2Vec to Wikipedia database dump

Uses gensim's Wikipedia parsing (WikiCorpus) to extract and tokenize the Wikipedia database dump compressed in bz2.
Runs gensim's Word2Vec to train model (gensim does not have GPU support). Saved as word2vec.model.
Performs sanity check with the file specified in analogy_path (questions-word.txt)

   $ python word2vec.py

Built With

Python
fastText and Word2Vec - For learning word representations
NLTK - Used to tokenize words

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.gitignore		.gitignore
README.md		README.md
inspect_words.py		inspect_words.py
queries.txt		queries.txt
questions-words.txt		questions-words.txt
re.py		re.py
re_line.py		re_line.py
tokenizer.py		tokenizer.py
word2vec.py		word2vec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computational Linguistics Lab

Overview

tokenizer.py

inspect_words.py

word2vec.py

Built With

Contributors

About

Releases

Packages

Contributors 2

Languages

ucdaviscl/soliloquy_wordvectors

Folders and files

Latest commit

History

Repository files navigation

Computational Linguistics Lab

Overview

tokenizer.py

inspect_words.py

word2vec.py

Built With

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages