- documentation changes in word2vec_similarity regarding the use of braces
- Drop C++11 specification in Makevars
- Building a word2vec model is now possible by providing a list of tokenised sentences (issue #14)
- word2vec is now a generic function with 2 implemented methods: word2vec.character and word2vec.list
- The embeddings with the file-based (word2vec.character) and list-based approach (word2vec.list) are proven to be the same if the tokenisation is the same and the hyperparameters of the model are the same
- In order to make sure the embeddings are the same the vocabulary had to be sorted according to the number of times it appears in the corpus as well as the token itself in case the number of times the 2 tokens occur is the same. This has as a consequence that the embeddings generated with version 0.4.0 will be slightly different as the ones obtained with package version < 0.4.0 due to a possible ordering difference in the vocabulary
- examples provided in the help of ?word2vec and in the README
- writing text data to files before training for the file-based approach (word2vec.character) now uses useBytes = TRUE (see issue #7)
- Remove LazyData from DESCRIPTION as there is no data to be lazy about
- Add option type to word2vec_similarity to allow both 'dot' similarity which is the default as 'cosine' similarity (requested in issue #5)
- Allow doc2vec also to be used on word2vec_trained
- Add txt_clean_word2vec
- Make example conditionally on availability of udpipe
- word2vec gains argument encoding
- Add doc2vec
- Fix R CMD check warning message on Fedora clang
- Extended predict.w2v with nearest if you pass on a vector or matrix. This allows to perform word2vec analogies or extract other similarities.
- Added word2vec_similarity
- Change classes returned by word2vec to 'word2vec_trained' and read.word2vec to 'word2vec'
- Add detailed docs of predict.word2vec and as.matrix.word2vec
- Added normalize option in read.word2vec usefull when wanting to extract the raw embedding (e.g. trained with other software)
- By default models trained with version 0.2 of this R package do normalization upfront before saving the model. For version 0.1 of this package this was not the case so load these in with option normalize set to TRUE
- Use Rcpp::runif as initialiser of embeddings instead of std::mt19937_64
- Functionalities default usage assumes UTF-8 encoding and predict.w2v now returns character text instead of factors
- Added read.wordvectors
- Initial package based on https://github.com/maxoodf/word2vec commit ad08b14ba6b554a10284c59c473ee81cb7f3af34