Integrate the neural token splitter #402

vmarkovtsev · 2019-04-10T17:38:21Z

Based on the paper and the existing code we should have an ability to parse identifiers with ML, not with heuristics.

The model has been partially written by @warenlg I don't remember where is it, can you please find Waren.

The splitting should be batched for performance reasons.

vmarkovtsev · 2019-04-10T17:39:37Z

We should extend TokenParser.

Besides, we need to code a benchmark to measure the performance of heuristics vs advanced ML.

warenlg · 2019-04-10T17:41:59Z

The model from the paper https://drive.google.com/file/d/1-vTJ1Ib-WVETNdmnzMqSW3PaFYlI2gvu/view?usp=sharing

vmarkovtsev · 2019-04-10T17:51:11Z

@warenlg I mean the code. I remember that you coded one.

warenlg · 2019-04-10T18:04:37Z

The code to train the model is here https://github.com/src-d/ml/blob/master/sourced/ml/cmd/train_id_split.py
A snippet that loads the model and demo some identifier splitting aka split.py, yes I already gave it to Tristan by DM on slack

glimow · 2019-04-11T09:44:59Z

Some insights about how it is going:

Tried to train the model using my laptop + eGPU. Failed miserably due to memory usage. I suspect dataset replication in memory because original identifiers dataset is 2.3Gb. Shouldn't use ~28Gb of RAM.
I will investigate that memory usage during training on science-3 (just had my credentials).
Adding the model to modelforge is ongoing. The class itself is almost finished. I am not entirely familiar with modelforge so it's a bit slower than it should but I'm starting to understand how it works.
@warenlg gave me his old model weights so I can test my integration into TokenParser while training on science-3 at the same time.
@vmarkovtsev

glimow · 2019-04-12T11:09:28Z

@vmarkovtsev @zurk I was able to make the model to work with the modelforge API along with asdf saving and loading support.
I have a question regarding tests: since this is an ML model aka not deterministic across different trains, shall we really compare results identifier by identifier or rather evaluate the overall metrics of the model ? Like predicting an ~100 identifiers and ensuring precision > 80%
@warenlg this concerns you as well since this is your model

zurk · 2019-04-12T12:49:23Z

We should be able to reproduce the model as precise as possible, bit to bit in the ideal case. That is why we fix all package versions, all random seeds, sort arrays, etc. So we make all train process deterministic. If you see that model is different from one train run to another, let's find out why.
In this case, it is totally fine to compare results identifier by identifier. It also helps us to see if our new changes affect model performance.

warenlg · 2019-04-12T13:17:11Z

At first, I'd would say, let's reproduce the model with the current training pipeline and the parameters from the paper with the same resources i.e. on 2 GPUs (it should take 1/2 day):

epochs: 10
RNN seq len: 40
batch size: 512
optimizer: Adam
learning rate: 0.001

And if we get the same precision and recall on the overall dataset, update the model in modelforge

vmarkovtsev assigned glimow Apr 10, 2019

irinakhismatullina mentioned this issue Apr 18, 2019

Improve typos analyzer quality src-d/style-analyzer#758

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate the neural token splitter #402

Integrate the neural token splitter #402

vmarkovtsev commented Apr 10, 2019

vmarkovtsev commented Apr 10, 2019

warenlg commented Apr 10, 2019

vmarkovtsev commented Apr 10, 2019

warenlg commented Apr 10, 2019 •

edited

Loading

glimow commented Apr 11, 2019

glimow commented Apr 12, 2019 •

edited

Loading

zurk commented Apr 12, 2019

warenlg commented Apr 12, 2019 •

edited

Loading

Integrate the neural token splitter #402

Integrate the neural token splitter #402

Comments

vmarkovtsev commented Apr 10, 2019

vmarkovtsev commented Apr 10, 2019

warenlg commented Apr 10, 2019

vmarkovtsev commented Apr 10, 2019

warenlg commented Apr 10, 2019 • edited Loading

glimow commented Apr 11, 2019

glimow commented Apr 12, 2019 • edited Loading

zurk commented Apr 12, 2019

warenlg commented Apr 12, 2019 • edited Loading

warenlg commented Apr 10, 2019 •

edited

Loading

glimow commented Apr 12, 2019 •

edited

Loading

warenlg commented Apr 12, 2019 •

edited

Loading