Merge pull request #868 from Varunshiyam/fixes-864

Playtime Classification with Logistic Regression
UppuluriKalyani · Nov 10, 2024 · 9c5be4c · 9c5be4c
2 parents af97f29 + aeae294
commit 9c5be4c
Show file tree

Hide file tree

Showing 2 changed files with 299 additions and 0 deletions.
diff --git a/Prediction Models/Playtime_prediction/Readme.md b/Prediction Models/Playtime_prediction/Readme.md
@@ -0,0 +1,31 @@
+# Playtime Classification with Logistic Regression
+
+## Project Overview
+
+This project employs a logistic regression model to classify playtime data into predefined categories. By analyzing features such as session length, frequency of play, and user demographics, the model helps predict user engagement levels. This insight can be valuable for game developers and analysts to enhance user retention and engagement strategies.
+
+## Problem Statement
+
+Understanding player engagement patterns is crucial in the gaming industry to personalize experiences and optimize retention. This project addresses the need for a predictive model that can classify users based on playtime data, supporting data-driven decisions in game development and marketing.
+
+## Features
+
+The dataset contains several key attributes:
+- **Session Length**: Duration of each gaming session.
+- **Frequency of Play**: How often users play.
+- **Demographic Details**: Information such as age and region.
+
+## Methodology
+
+1. **Data Preprocessing**: Clean and preprocess data, including handling missing values and feature scaling.
+2. **Feature Engineering**: Selecting and engineering relevant features for the model.
+3. **Model Training**: Train a logistic regression model on the dataset.
+4. **Evaluation**: Evaluate model performance using accuracy, precision, recall, and F1-score.
+
+## Requirements
+
+- Python 3.x
+- pandas
+- numpy
+- scikit-learn
+
diff --git a/Prediction Models/Playtime_prediction/playtime-classification-with-logit-model.ipynb b/Prediction Models/Playtime_prediction/playtime-classification-with-logit-model.ipynb
@@ -0,0 +1,268 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "_execution_state": "idle",
+    "_uuid": "051d70d956493feee0c6d64651c6a088724dca2a"
+   },
+   "outputs": [],
+   "source": [
+    "library(tidyverse) # metapackage of all tidyverse packages\n",
+    "library(ggcorrplot)\n",
+    "library(grid) # package for arranging plots into a grid\n",
+    "library(gridExtra)\n",
+    "library(caret) # package for confusion matrix\n",
+    "library(boot) # package for K-FoldCV and boostrap"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = read_csv(\"../input/performance-prediction/summary.csv\")\n",
+    "head(dataset,10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "summary(dataset)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "str(dataset)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# EAD\n",
+    "\n",
+    "Let's find the most representative variables through EAD first and then construct a statistical model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data Wrangling"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "I will be removing the FreeThrowPercent, 3PointPercent, FieldGoalPercent since they are just combinations of previous columns of the dataset, and also the Name column. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset$Target = factor(ifelse(dataset$Target==1,\"Above5Years\",\"Less5Years\")) # factor the Target column\n",
+    "dataset = dataset %>% select(-FreeThrowPercent,-`3PointPercent`,-FieldGoalPercent,-Name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Boxplot\n",
+    "\n",
+    "Since these variables are in different scales I will be divinding in various subplots insted of appling a log transormation on the y-ax."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "library(grid)\n",
+    "options(repr.plot.width = 20, repr.plot.height = 20)\n",
+    "plot_boxplot = function(columns=vector()){ #simple function to produce plots\n",
+    "    dataset %>% select(names(dataset[,columns]),Target) %>% \n",
+    "    gather(key = k1,value = VariableValue,-\"Target\") %>%\n",
+    "    ggplot(aes(y = VariableValue,x = Target,fill = Target)) +\n",
+    "    stat_boxplot(aes(fill = Target)) + facet_grid(.~k1) +\n",
+    "    theme(axis.text.x = element_blank(),\n",
+    "        axis.ticks.x = element_blank(),\n",
+    "        axis.title.x = element_blank(),\n",
+    "        strip.text.x = element_text(size = 10, colour = \"black\", angle = 0)) +\n",
+    "    scale_fill_manual(values = c('#e7298a','#66a61e'))\n",
+    "}\n",
+    "\n",
+    "box1 = plot_boxplot(columns = c(11:17))\n",
+    "box2 = plot_boxplot(columns = c(2,3))\n",
+    "box3 = plot_boxplot(columns = c(4,5))\n",
+    "box4 = plot_boxplot(columns = c(6:10))\n",
+    "\n",
+    "grid.arrange(arrangeGrob(box1,box2,box3,box4,ncol=2,nrow=2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As common sense would point out older players apparently do peforme better than younger ones on average."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Correlogram"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "options(repr.plot.width = 10, repr.plot.height = 10)\n",
+    "cnr = cor(dataset%>% select(-Target))\n",
+    "p.values = cor_pmat(dataset%>% select(-Target)) # p-values matrix\n",
+    "ggcorrplot(cnr, hc.order = TRUE, type = \"lower\",\n",
+    "   outline.col = \"black\",\n",
+    "   ggtheme = ggplot2::theme_gray,\n",
+    "   colors = c(\"#6D9EC1\", \"white\", \"#E46726\"),p.mat=p.values,lab = TRUE)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As you can see some variables are not statistically coorelated with others given their p-values are too high, as highlighted by the X mark. With that in mind I'm going to remove the 3PointMade and 3PointAttempt also because they are not highly coorelated with any other variables."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = dataset %>% select(-`3PointMade`,-`3PointAttempt`)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Basic Logit Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "shuffel.rows = sample(nrow(dataset)*0.9) # row shuffeling\n",
+    "dataset_train = dataset[shuffel.rows,]\n",
+    "dataset_test = dataset[-shuffel.rows,]\n",
+    "\n",
+    "logit.fit = glm(Target~., data=dataset_train,family='binomial')\n",
+    "summary(logit.fit)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "logit.probs = predict(logit.fit,newdata = dataset_test,type = 'response')\n",
+    "class.pred = factor(ifelse(logit.probs>0.5,\"Above5Years\",\"Less5Years\"))\n",
+    "confusionMatrix(class.pred,dataset_test$Target)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So the model is not that good, but could use some improvement"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## K-fold Cross-Validation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train.control = trainControl(method = \"cv\", number = 15)\n",
+    "logit.fitKfold = train(Target ~., data = dataset_train, method = \"glm\",\n",
+    "                       trControl = train.control)\n",
+    "logit.fitKfold"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "probs.Kfold = predict(logit.fitKfold, newdata = dataset_test, type = \"raw\")\n",
+    "confusionMatrix(probs.Kfold,dataset_test$Target)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It appers that the model have improved on the test set.\n",
+    "If there are any errors please do let me know."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}