Skip to content
This repository has been archived by the owner on Feb 3, 2021. It is now read-only.

Latest commit

 

History

History
92 lines (57 loc) · 3.05 KB

README.md

File metadata and controls

92 lines (57 loc) · 3.05 KB

Source2vec

This repository contains source code embeddings for various programming languages.

Source code files are preprocessed using standard tokenization (it could not be ideal solution for source code), this is work in progress. Also we are working on enhancing embeddings with ASTs and PDGs.

Java

Created from 1720 million tokens
Window (context) size 5
Minimum number of occurrences 10
Vocabulary http://dizp.fufygen.eu/embeddings/java/java_vocab.txt.zip
Sample visualisation (FastText) java fasttext vis

Word2vec

http://dizp.fufygen.eu/embeddings/java/word2vec/java_word2vec.zip

FastText

http://dizp.fufygen.eu/embeddings/java/fasttext/java_fasttext_model.bin.zip http://dizp.fufygen.eu/embeddings/java/fasttext/java_fasttext_model.vec.zip

GloVe

http://dizp.fufygen.eu/embeddings/java/glove/java_glove_vectors.bin.zip
http://dizp.fufygen.eu/embeddings/java/glove/java_glove_vectors.txt.zip

Python

Created from 838 million tokens
Window (context) size 5
Minimum number of occurrences 10
Vocabulary http://dizp.fufygen.eu/embeddings/python/python_vocab.txt.zip
Sample visualisation (FastText, 128 vector size) python fasttext vis

Word2vec

http://dizp.fufygen.eu/embeddings/python/word2vec/python_word2vec.zip

FastText

http://dizp.fufygen.eu/embeddings/python/fasttext/python_fasttext_model.bin.zip
http://dizp.fufygen.eu/embeddings/python/fasttext/python_fasttext_model.vec.zip
http://dizp.fufygen.eu/embeddings/python/fasttext/python_fasttext_model_128.bin.zip (vector size 128)
http://dizp.fufygen.eu/embeddings/python/fasttext/python_fasttext_model_128.vec.zip (vector size 128)

GloVe

http://dizp.fufygen.eu/embeddings/python/glove/python_glove_vectors.bin.zip
http://dizp.fufygen.eu/embeddings/python/glove/python_glove_vectors.txt.zip

C

Created from 6589 million tokens
Window (context) size 7
Minimum number of occurrences 20
Vocabulary http://dizp.fufygen.eu/embeddings/c/c_vocab.txt.zip
Sample visualisation (FastText) c fasttext vis

Word2vec

http://dizp.fufygen.eu/embeddings/c/word2vec/c_word2vec.zip

FastText

http://dizp.fufygen.eu/embeddings/c/fasttext/c_fasttext_model.bin.zip
http://dizp.fufygen.eu/embeddings/c/fasttext/c_fasttext_model.vec.zip

GloVe

http://dizp.fufygen.eu/embeddings/c/glove/c_glove_vectors.bin.zip
http://dizp.fufygen.eu/embeddings/c/glove/c_glove_vectors.txt.zip

...if you need another languages or different params feel free to open issue

Common parameters

Vector size 64
Word2vec vectors are created using skipgram method
FastText vectors are created using 2 to 6 character ngrams

How to load embeddings (python)

see load_embeddings.ipynb

Dimensionality reduction and visualisation of embeddings

see visualise_embeddings.ipynb

Paper comming out soon.