Skip to content

Latest commit

 

History

History
125 lines (79 loc) · 3.66 KB

README.md

File metadata and controls

125 lines (79 loc) · 3.66 KB

DiMo. Distributional Models Evaluation

license

DiMo is a collection of scripts for my bachelor's thesis Comparison and Evaluation of Models for Distributional Semantics.

Take a look at notebooks on the thesis's official website to see how these scripts can be used:

Notice! Parts of the code require internal Sketch Engine's packages manatee and wmap.

Other required packages are:

  • numpy
  • scipy
  • gensim
  • sklearn

The code runs on python 2.7.

Models

Sketch Engine Thesaurus (SkEThes)

Unlike the original implementation, the one in this project operates directly on a co-occurrence matrix.

If you have a corpus with compiled word sketches (let's say it is called bnc2), use wm2thes.py script to create such a matrix:

python wm2thes.py bnc2 bnc2-matrix

This creates 4 files representing a sparse word x (relation, word) matrix:

- bnc2-matrix-target2i.pickle  # dictionary: words to indices
- bnc2-matrix-rows.npy         # row indices
- bnc2-matrix-cols.npy         # col indices
- bnc2-matrix-vals.npy         # values

Now that you have the matrix, you may decide which similarity measure to use.

from models import SkEThesSKE, SkEThesCOS

model_ske = SkEThesSKE("bnc2-matrix")
model_cos = SkEThesCOS("bnc2-matrix")

Now you can call functions like similarity, similarities, most_similar or eval_analogy to evaluate the models on datasets of analogy queries.

There is also a wrapper for the original implementation in oskethes.py, but the interface is a bit different as it is just a collection of several word similarities, the co-occurrence matrix is gone, similarities < 0.05 are gone...

Word-Word Co-Occurrence Matrix

If you have a corpus in text file (one line -- one sentence), you may create a similar model with linear contexts (weighted symmetric context window):

python coocs.py plain-bnc.txt plain-bnc-matrix 20 5
  • 20 is the minimum word frequency
  • 5 is the context window size

The matrix will contain raw co-occurrence counts, so you may consider using some weighting.

from models import SkEThesCOS
from weightings import ppmi

model_ske = SkEThesSKE("plain-bnc-matrix", weighting=ppmi)

Word2Vec

For Word2Vec models, this project wraps over gensim package. Everything that you can open with:

from gensim.models import Word2Vec

model = Word2Vec(model_name)

... you can open also with:

from models import Word2Vec

model = Word2Vec(model_name)

The interface as well as the evaluation script stays the same as in SkEThesXXX.

Evaluation

evaluation = model.eval_analogy(dataset)

Dataset is a dictionary category: list_of_queries. Each query should be a tuple like:

("paris", "france", "london", {"england", "britain", "uk"})

You may configure the evaluation in various ways:

from formulas import mul

my_mul = lambda a, b, aa: mul(a, b, aa, coeff=0.05)
evaluation = model.eval_analogy(dataset, topn=5, exclusion_trick=False, formula=my_mul)

And see the results:

evaluation[category]["acc"]  # 0.0--1.0
evaluation[category]["acc_top1"]  # 0.0--1.0
evaluation[category]["oov"]  # nb of queries containing an oov word
evaluation[category]["oovs"]  # set of oov words
evaluation[category]["queries"]  # list of queries and their candidate answers  (excluding queries with oov words)