Skip to content

Latest commit

 

History

History
67 lines (44 loc) · 1.63 KB

README.md

File metadata and controls

67 lines (44 loc) · 1.63 KB

bigcode-embeddings

NOTE: data must be generated with bigcode-ast-tools before being able to use this tool

bigcode-embeddings allows to generate and visualize embeddings for AST nodes.

Install

This project should be used with Python 3.

To install the package either run

pip install bigcode-embeddings

or clone the repository and run

cd bigcode-embeddings
pip install -r requirements.txt
python setup.py install

NOTE: tensorflow needs to be installed separately.

Usage

Training embeddings

Training data can be generated using bigcode-ast-tools

Given a data.txt.gz generated from a vocabulary of size 30000, 100D embeddings can be trained using

./bin/bigcode-embeddings train -o embeddings/ --vocab-size 30000 --emb-size 100 --l2-value 0.05 --learning-rate 0.01 data.txt.gz

Tensorboard can be used to visualize the progress

tensorboard --logdir embeddings/

After the first epoch, embeddings visualization becomes available from Tensorboard. The vocabulary TSV file generated by bigcode-ast-tools can be loaded to have labels on the embeddings.

Visualizing the embeddings

Trained embeddings can be visualized using the visualize subcommand If the generated vocabulary file is vocab.tsv, the above embeddings can be visualized with the following command

./bin/data-explorer visualize clusters -m embeddings/embeddings.bin-STEP -l vocab.tsv

where STEP should be the largest value found in the embeddings/ directory.

The -i flag can be passed to generate an interactive plot.