Skip to content

Sparten-Ashvinee/LangID

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Indian Multilanguage classification

This work is part of the research project titled “An Architecture of Machine Translation for Text Analysis and Speech to Sign Language”.

The objective was to build a machine learning classifier capable of categorizing text into various categories across multiple languages spoken in India. Given India’s linguistic diversity, this task involves handling several Indian languages, each with distinct scripts, grammar, vocabulary, and cultural contexts.

Dataset

The data was collected from the ground up. The dataset includes text in the following languages:

  • Hindi
  • Bengali
  • Tamil
  • Telugu
  • Urdu

Indian multilanguage classification presents unique challenges due to the variety of scripts used by different languages (e.g., Devanagari for Hindi and Tamil script for Tamil).

Model

Experimented with different models

  • K-Nearest Neighbors (KNN)
  • Support Vector Machine (SVM)
  • Naive Bayes (NB)

An LSTM model was also explored but was not included in the comparison.

Results

The table below compares the performance of different models in classifying Indian multilingual text:

Model Accuracy
Naive Bayes (NB) 87.18%
Support Vector Machine (SVM) 78.03%
K-Nearest Neighbors (KNN) 78.65%

Deployment

We are developing a web inference system for testing, available at Ashvinee.xyz. We are utilizing Flask for the web framework and MLflow for managing the machine learning lifecycle to facilitate deployment.

Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published