-
Notifications
You must be signed in to change notification settings - Fork 2
Word2Vec & Doc2Vec Wishlist
See also general gensim Ideas & Feature proposals.
###Implement 'Translation Matrix' of 'Exploiting similarities among languages for machine translation'
Section 4 of Mikolov, Le, & Sutskever's paper on word2vec for machine translation describes a way to map words between two separate vector models, as in the example of word vectors induced for two different natural languages.
Section 2.2 of 'Skip-Thought Vectors' uses a similar technique to bootstrap a larger vocabulary in their model, from a pre-existing larger word2vec model.
The same technique could be valuable for adapting to drifting word representations, when training over large datasets over long timeframes. Specifically: as new information introduces extra words, and newer examples of word usage, older words may (and probably should) relocate for the model to continue to perform optimally on the training task, on more-recent text. (In a sense, words should rearrange to 'make room' for the new words and examples.) As these changes accumulate, older representations (or cached byproducts) may not be directly comparable to the latest representations – unless a translation-matrix-like adjustment is made. (The specifics of the translation may also indicate areas of interest, where usage or meanings are changing rapidly.)
Implementation work by Georgiana Dinu, linked from the word2vec homepage, may be relevant if license-compatible. (Update: In correspondence, Dinu has given approval to re-use that code in gensim, if it's helpful.)
Jason of jxieeducation.com blog has also run an experiment suggesting the usefulness of this approach, in this case using sklearn's Linear Regression to learn the projection.
###Add 'Adagrad' Gradient-Descent Option
Some Word2Vec/Doc2Vec papers or projects suggest they've used 'Adagrad' to speed gradient-descent. Having it as an option (for comparative evaluation) and then possibly the default (if it's a clear speed win) would be nice for Word2Vec/Doc2Vec.