This repository contains State of the Art Language models and Classifier for Tamil language, which is spoken in India, Srilanka, Malaysia and Singapore.
The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)
- iNLTK Headlines Corpus - Tamil : Uses Tamil News Dataset prepared above.
Architecture/Dataset | Tamil Wikipedia Articles | Vocab size |
---|---|---|
ULMFiT | 19.80 | 8k |
TransformerXL | 18.91 | 8k |
TransformerXL | 17.22 | 16k |
Dataset | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|
iNLTK Headlines Corpus - Tamil | 95.22 | 92.70 | Link |
Architecture | Vocab Size | Visualization |
---|---|---|
ULMFiT | 8k | Embeddings projection |
TransformerXL | 8k | Embeddings projection |
TransformerXL | 16k | Embeddings projection |
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
iNLTK Headlines Corpus - Tamil | (5346, 669, 669) | 95.22 | 92.70 | Link |
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
iNLTK Headlines Corpus - Tamil | (267, 669, 669) | 86.25 | 79.42 | Link |
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
iNLTK Headlines Corpus - Tamil | (267, 669, 669) | 89.84 | 84.63 | Link |
Download pretrained ULMFiT LM with 8k vocab from here
Download pretrained TransformerXL LM with 8k vocab from here
Download pretrained TransformerXL LM with 16k vocab from here
Trained tokenizer using Google's sentencepiece
Download the trained model and vocabulary from here