-
Notifications
You must be signed in to change notification settings - Fork 44
Integrate the neural token splitter #402
Comments
We should extend Besides, we need to code a benchmark to measure the performance of heuristics vs advanced ML. |
The model from the paper https://drive.google.com/file/d/1-vTJ1Ib-WVETNdmnzMqSW3PaFYlI2gvu/view?usp=sharing |
@warenlg I mean the code. I remember that you coded one. |
The code to train the model is here https://github.com/src-d/ml/blob/master/sourced/ml/cmd/train_id_split.py |
Some insights about how it is going:
|
@vmarkovtsev @zurk I was able to make the model to work with the |
We should be able to reproduce the model as precise as possible, bit to bit in the ideal case. That is why we fix all package versions, all random seeds, sort arrays, etc. So we make all train process deterministic. If you see that model is different from one train run to another, let's find out why. |
At first, I'd would say, let's reproduce the model with the current training pipeline and the parameters from the paper with the same resources i.e. on 2 GPUs (it should take 1/2 day):
And if we get the same precision and recall on the overall dataset, update the model in modelforge |
Based on the paper and the existing code we should have an ability to parse identifiers with ML, not with heuristics.
The model has been partially written by @warenlg I don't remember where is it, can you please find Waren.
The splitting should be batched for performance reasons.
The text was updated successfully, but these errors were encountered: