This research project explores movie sentiment analysis using three distinct models: XGBoost, LightGBM, and BERT. The objective is to evaluate their performance in terms of accuracy, F1 Score, and the area under the ROC curve, shedding light on the most effective approach for sentiment classification.
The project addresses the challenge of accurately classifying sentiment in movie reviews, aiming to discern the strengths and weaknesses of XGBoost, LightGBM, and BERT models in this context.
Utilizing the IMDB Dataset, the project leverages a comprehensive collection of movie reviews for training and testing the sentiment analysis models. The dataset undergoes preprocessing, lemmatization, and vectorizing to enhance model performance. The dataset can be found here: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
The study employs three models: XGBoost, LightGBM, and BERT, each renowned for its unique capabilities. By comparing their outcomes, the project seeks to identify the model that excels in discerning sentiments from movie reviews.
The models had trained done on Google Colab, and were able to achieve optimal performance with the provided Python environment with usable GPUs for training.
Visualizing key insights through three distinctive visuals included confusion matrices which gave a detailed breakdown of classification outcomes, and also the plotting of the area under the ROC curve provided a comprehensive view of model discrimination abilities. Additionally, training and validation loss charts illustrated the learning dynamics for movie sentiment analysis on the IMDB Dataset.
- Ahmad Bajwa
- Atharva Biyani
- Aditya Kumar
- Jack Le
- Pranava Ravindran
- Anish Nyalakonda (Research Lead)
- Dr. Doug Degroot (Faculty Advisor)