amazon_ml_preprocessing.ipynb
: code to preprocess textamazon_ml_translation_csv.ipynb
: code to translate non-english textamazon_ml_mode.ipynb
: code to create submission file from multiple submission files using mode techniqueamazon_ml_training.ipynb
: code to train embeddings and predict on test embeddingsamazon_ml_embeddings.ipynb
: code to generate embeddings from csv filesubmission_top-score.csv
: submission file with top score [Accuracy :66.85
]
- Competition : Multi Class Text Classification
- Host : Hacker-Earth
- Metric : Accuracy
- Time of Competition : 2days:23hrs:59min
- Checkout competition here
- Key column –
PRODUCT_ID
- Input features –
TITLE
,DESCRIPTION
,BULLET_POINTS
,BRAND
- Target column –
BROWSE_NODE_ID
- Train dataset size –
2,903,024
- Number of classes in Train –
9,919
- Overall Test dataset size –
110,775
re
langdetect
deep-translator
- Removed special characters and emojis using re.
- Translated non-english text to english using langdetect and deep-translator.
- Removed stop-words.
- Decontracted some of the words.
Sentence_transoformer
RAPIDS
- First the text is converted to embeddings using pre-trained models
such as
paraphrase-mpnet-base-v2
,paraphrase-MiniLM-L6-v2
paraphrase-MiniLM-L3-v2
- Dimension of Embeddings :
384
- Embeddings of training data are sent into
KNNClassifier
present inCuML
library - Then the trained KNNClassifier is used to predict on test embeddings.
- We also used mode technique i.e. using most frequently predicted label obtained from different experiments
Cross Validation
is also used to trainKNNClassifier
- Used
NearestNeighbour
,SVM
,RandomForest Classifier
techniques but results are not better compared to KNNClassifier. - Different size embeddings
(384,768)
- Combined TITLE,DESCRIPTION and TITLE,DESCRIPTION, BULLET_POINTS and TITLE, DESCRIPTION, BULLET_POINTS