Skip to content
This repository has been archived by the owner on May 22, 2019. It is now read-only.

Integrate the neural token splitter #402

Open
vmarkovtsev opened this issue Apr 10, 2019 · 8 comments
Open

Integrate the neural token splitter #402

vmarkovtsev opened this issue Apr 10, 2019 · 8 comments
Assignees

Comments

@vmarkovtsev
Copy link
Collaborator

Based on the paper and the existing code we should have an ability to parse identifiers with ML, not with heuristics.

The model has been partially written by @warenlg I don't remember where is it, can you please find Waren.

The splitting should be batched for performance reasons.

@vmarkovtsev
Copy link
Collaborator Author

We should extend TokenParser.

Besides, we need to code a benchmark to measure the performance of heuristics vs advanced ML.

@warenlg
Copy link
Contributor

warenlg commented Apr 10, 2019

@vmarkovtsev
Copy link
Collaborator Author

@warenlg I mean the code. I remember that you coded one.

@warenlg
Copy link
Contributor

warenlg commented Apr 10, 2019

The code to train the model is here https://github.com/src-d/ml/blob/master/sourced/ml/cmd/train_id_split.py
A snippet that loads the model and demo some identifier splitting aka split.py, yes I already gave it to Tristan by DM on slack

@glimow
Copy link

glimow commented Apr 11, 2019

Some insights about how it is going:

  • Tried to train the model using my laptop + eGPU. Failed miserably due to memory usage. I suspect dataset replication in memory because original identifiers dataset is 2.3Gb. Shouldn't use ~28Gb of RAM.
    I will investigate that memory usage during training on science-3 (just had my credentials).
  • Adding the model to modelforge is ongoing. The class itself is almost finished. I am not entirely familiar with modelforge so it's a bit slower than it should but I'm starting to understand how it works.
  • @warenlg gave me his old model weights so I can test my integration into TokenParser while training on science-3 at the same time.
    @vmarkovtsev

@glimow
Copy link

glimow commented Apr 12, 2019

@vmarkovtsev @zurk I was able to make the model to work with the modelforge API along with asdf saving and loading support.
I have a question regarding tests: since this is an ML model aka not deterministic across different trains, shall we really compare results identifier by identifier or rather evaluate the overall metrics of the model ? Like predicting an ~100 identifiers and ensuring precision > 80%
@warenlg this concerns you as well since this is your model

@zurk
Copy link
Contributor

zurk commented Apr 12, 2019

We should be able to reproduce the model as precise as possible, bit to bit in the ideal case. That is why we fix all package versions, all random seeds, sort arrays, etc. So we make all train process deterministic. If you see that model is different from one train run to another, let's find out why.
In this case, it is totally fine to compare results identifier by identifier. It also helps us to see if our new changes affect model performance.

@warenlg
Copy link
Contributor

warenlg commented Apr 12, 2019

At first, I'd would say, let's reproduce the model with the current training pipeline and the parameters from the paper with the same resources i.e. on 2 GPUs (it should take 1/2 day):

  • epochs: 10
  • RNN seq len: 40
  • batch size: 512
  • optimizer: Adam
  • learning rate: 0.001

And if we get the same precision and recall on the overall dataset, update the model in modelforge

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants