NOTE: data must be generated with bigcode-ast-tools
before being able to use
this tool
bigcode-embeddings
allows to generate and visualize embeddings for
AST nodes.
This project should be used with Python 3.
To install the package either run
pip install bigcode-embeddings
or clone the repository and run
cd bigcode-embeddings
pip install -r requirements.txt
python setup.py install
NOTE: tensorflow needs to be installed separately.
Training data can be generated using bigcode-ast-tools
Given a data.txt.gz
generated from a vocabulary of size 30000,
100D embeddings can be trained using
./bin/bigcode-embeddings train -o embeddings/ --vocab-size 30000 --emb-size 100 --l2-value 0.05 --learning-rate 0.01 data.txt.gz
Tensorboard can be used to visualize the progress
tensorboard --logdir embeddings/
After the first epoch, embeddings visualization becomes available from
Tensorboard. The vocabulary TSV file generated by bigcode-ast-tools
can
be loaded to have labels on the embeddings.
Trained embeddings can be visualized using the visualize
subcommand
If the generated vocabulary file is vocab.tsv
, the above embeddings
can be visualized with the following command
./bin/data-explorer visualize clusters -m embeddings/embeddings.bin-STEP -l vocab.tsv
where STEP
should be the largest value found in the embeddings/
directory.
The -i
flag can be passed to generate an interactive plot.