Here are some of my notable projects:
-
Clustering Spotify Podcasts with NLP-Driven Insights
- Scraped
$\approx$ 284,481 episode details from 818 podcasts using Selenium and Spotify API pipeline. - Preprocessed and tokenized podcast descriptions with NLTK, including lemmatization and stopword removal.
- Developed metrics to quantify directional, overlap, diversity similarities, and engineered recommendation system.
- Deployed a Dash app for podcast clustering and personalized recommendations.
- Scraped
-
Predicting Flight Delays and Cancellations: An Integrated Analysis of Airport Data and Weather Data
- Automated scraping of 23 GB airport and 30 GB weather data using Selenium.
- Utilized reverse geocoding, Haversine-based, and UTC-normalized alignments to join datasets.
- Trained random forest models, achieving
$\approx$ 25 min test RMSE for delays and$\approx$ 98% test accuracy for cancellations. - Developed scalable workflows on Google Cloud, and deployed interactive web-app using Dash.
-
Modeling and Forecasting Walmart Stock Prices: A Comparative Analysis of ARMA and GARCH Approaches
- Leveraged ARIMA and GARCH models using tseries and fGarch in R to analyze Walmart stock price volatility.
- Developed and validated ensemble models through residual diagnostics and forecast evaluation.
- Achieved RMSE of
$\approx$ $ 0.01 for log-returns and$\approx$ $ 1.17 for closing prices on unseen 10-day forecast and actual prices.
-
Statistical Modeling and Deployment of Body Fat Percentage Prediction System
- Implemented anomaly detection and imputation strategy using prior body fat estimation model.
- Constructed Stepwise regression model with Goodness of Fit and Holm-Bonferroni F-tests to control Type I errors.
- Developed Multiple Linear Regression model (R-squared 0.6592 and RMSE 4.38) with residual diagnostics.
- Deployed an interactive Dash app with comprehensive explanations, detailed visuals, and predictions.
-
Outlier Detection and De-noising for Audio-Based Neural Network Language Classification
- Developed an ensemble outlier detector using Isolation Forest and Local Outlier Factor in Scikit-Learn.
- Designed a speech detection and spectral gating algorithm with Librosa and NoiseReduce for noise suppression.
- Processed 1320 parallel jobs for anomaly detection and de-noising using Linux Bash and HTCondor.
- Trained a preliminary CNN on a random sample in TensorFlow, improving test accuracy by 2% and AUC by 4%.
-
Finding Lyman Break Galaxy cB58 Resemblances Using High-Throughput Computing
- Identified noisy spectra matching Lyman Break Galaxy cB58 from Sloan Digital Sky Survey (SDSS) datasets.
- Computed similarity between spectra using distance metrics implemented in R.
- Processed 2459 parallel jobs over 281 GB data using Linux Bash and HTCondor.
-
Supervised machine learning model to predict gender based on first names
- Performed feature pre-processing and one-hot alphabet encoding using NumPy and Pandas.
- Implemented an ensemble gradient boosting model with fine-tuned hyper-parameters using Scikit-Learn.
- Achieved approximately 80% in accuracy, precision, recall, and AUC metrics.