-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tune crosslingual model for language detection #30
Comments
I'd get started on this |
Awesome, I've assigned you to this project. Let's keep track of progress here. |
Alright, sure :) |
Hi Artit @artitw Please I need your help, I’m facing some roadblocks. I decided to start with the second approach you suggested, which is Fine-tuning the cross-lingual translator with a softmax output. My thought process for this:
Also, does this sound like the right track for approach number 2? |
Very much appreciate the updates on this. The dataset you cite looks appropriate; I suggest filtering for the languages which the pretrained model supports for tokenization. Your ideas on the second approach seem fine so far. Yes, you are correct that you would have to write another module to finetune with a softmax output. I expect this second approach to be more challenging for this reason. If it helps, consider taking the first approach to get things working and then come back to the second approach to get better performance. |
Alright then, I'll get started with approach 1 |
Hi Art @artitw Here is the link to the notebook: https://colab.research.google.com/drive/1VxRRURRAaXBZFsYsXC5hSTkSc-4TGdOj?usp=sharing Based on the last discussion:
I would add more embeddings and retrain the models to see how performance improves. What do you think about it? Thank you :) |
Hi @Mofetoluwa Thanks for all the work and the summary. It looks like the MLP model is best performing. Would you be able to add it to the repo? It would be great if language classification is available for use while we continue improving it with finetuning and other methods. |
Hi @artitw Alright then :) So just to clarify, how would we want to use the model on the repo? Asides from pushing the saved model, are we also creating a module for it to be called e.g |
It would be awesome to create an |
Alright, I'll add the model to the repo first. In which of the folders should I put it? |
Can we store the model in somewhere like Google Drive and only download it when the Identifier is used? This approach would follow the existing convention to keep the core library lightweight. |
Alright then :) |
Hi @artitw My sincere apologies the updates are just coming in, I wanted to have done some work on the
Thank you :) |
@Mofetoluwa thanks for the updates and the pull request. I added some comments there. With regards to the third point you raise, when I tested the model, it returned "hy" for "hello" and "ja" for "你好!". Is this consistent with your testing as well? |
Yeah, it's a problem I noticed with most languages. I believe approach 2 would resolve this? Another thing could be to generate shorter texts for this approach. What do you think? |
I think training with shorter texts and approach 2 would address the issue. Another approach us to use 2D embeddings. Currently we are using 1D embeddings, which are calculate by averaging the last layer outputs, but we can use the last layer outputs directly as 2D embeddings. I also just realized from adding the 2D embeddings option in the latest release that the last layer averaging could be improved by removing the paddings from the calculations. In other words, I think it might be helpful to re-train the MLP identification model on the latest release. |
Hi @artitw Oh that sounds great. So how can the 2D embeddings be gotten? Is it still by using the vectorize() function? |
@Mofetoluwa yes, we can do Also note that the default 1D output should be improved now compared to the version you used most recently. |
@artitw Oh alright. So should we do a comparison of both? Then also... adding shorter texts did not really improve the performance of the model. The F1 score and accuracy dropped to about ~0.66. |
Yes, a comparison of both would be useful. Thanks so much for checking the shorter texts. It will help to confirm the fix for the way 1D embeddings are calculated. |
Hi Mofe,
|
Hi Art,
|
Could we also add the |
@Mofetoluwa, what do you think about using the TFIDF embeddings to perform the language prediction? I think that might be better than the neural embeddings currently used, as it won't have the length dependency. |
Hi Art @artitw sure that should work actually... I'll try it out and let you know how it goes. I hope you're doing great :) |
great, thanks so much Mofe. Really looking forward to it |
Hi Mofe, in the latest release I fixed an issue with TFIDF embeddings so that they now output a consistent embedding size. Hope this helps |
Hi Art, Alright that's cool :)... I'll work with it and let you know how it goes soon |
Two approaches to try:
The text was updated successfully, but these errors were encountered: