-
Notifications
You must be signed in to change notification settings - Fork 121
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #868 from Varunshiyam/fixes-864
Playtime Classification with Logistic Regression
- Loading branch information
Showing
2 changed files
with
299 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Playtime Classification with Logistic Regression | ||
|
||
## Project Overview | ||
|
||
This project employs a logistic regression model to classify playtime data into predefined categories. By analyzing features such as session length, frequency of play, and user demographics, the model helps predict user engagement levels. This insight can be valuable for game developers and analysts to enhance user retention and engagement strategies. | ||
|
||
## Problem Statement | ||
|
||
Understanding player engagement patterns is crucial in the gaming industry to personalize experiences and optimize retention. This project addresses the need for a predictive model that can classify users based on playtime data, supporting data-driven decisions in game development and marketing. | ||
|
||
## Features | ||
|
||
The dataset contains several key attributes: | ||
- **Session Length**: Duration of each gaming session. | ||
- **Frequency of Play**: How often users play. | ||
- **Demographic Details**: Information such as age and region. | ||
|
||
## Methodology | ||
|
||
1. **Data Preprocessing**: Clean and preprocess data, including handling missing values and feature scaling. | ||
2. **Feature Engineering**: Selecting and engineering relevant features for the model. | ||
3. **Model Training**: Train a logistic regression model on the dataset. | ||
4. **Evaluation**: Evaluate model performance using accuracy, precision, recall, and F1-score. | ||
|
||
## Requirements | ||
|
||
- Python 3.x | ||
- pandas | ||
- numpy | ||
- scikit-learn | ||
|
268 changes: 268 additions & 0 deletions
268
Prediction Models/Playtime_prediction/playtime-classification-with-logit-model.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,268 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"_execution_state": "idle", | ||
"_uuid": "051d70d956493feee0c6d64651c6a088724dca2a" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"library(tidyverse) # metapackage of all tidyverse packages\n", | ||
"library(ggcorrplot)\n", | ||
"library(grid) # package for arranging plots into a grid\n", | ||
"library(gridExtra)\n", | ||
"library(caret) # package for confusion matrix\n", | ||
"library(boot) # package for K-FoldCV and boostrap" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"dataset = read_csv(\"../input/performance-prediction/summary.csv\")\n", | ||
"head(dataset,10)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"summary(dataset)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"str(dataset)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# EAD\n", | ||
"\n", | ||
"Let's find the most representative variables through EAD first and then construct a statistical model." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Data Wrangling" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"I will be removing the FreeThrowPercent, 3PointPercent, FieldGoalPercent since they are just combinations of previous columns of the dataset, and also the Name column. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"dataset$Target = factor(ifelse(dataset$Target==1,\"Above5Years\",\"Less5Years\")) # factor the Target column\n", | ||
"dataset = dataset %>% select(-FreeThrowPercent,-`3PointPercent`,-FieldGoalPercent,-Name)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Boxplot\n", | ||
"\n", | ||
"Since these variables are in different scales I will be divinding in various subplots insted of appling a log transormation on the y-ax." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"library(grid)\n", | ||
"options(repr.plot.width = 20, repr.plot.height = 20)\n", | ||
"plot_boxplot = function(columns=vector()){ #simple function to produce plots\n", | ||
" dataset %>% select(names(dataset[,columns]),Target) %>% \n", | ||
" gather(key = k1,value = VariableValue,-\"Target\") %>%\n", | ||
" ggplot(aes(y = VariableValue,x = Target,fill = Target)) +\n", | ||
" stat_boxplot(aes(fill = Target)) + facet_grid(.~k1) +\n", | ||
" theme(axis.text.x = element_blank(),\n", | ||
" axis.ticks.x = element_blank(),\n", | ||
" axis.title.x = element_blank(),\n", | ||
" strip.text.x = element_text(size = 10, colour = \"black\", angle = 0)) +\n", | ||
" scale_fill_manual(values = c('#e7298a','#66a61e'))\n", | ||
"}\n", | ||
"\n", | ||
"box1 = plot_boxplot(columns = c(11:17))\n", | ||
"box2 = plot_boxplot(columns = c(2,3))\n", | ||
"box3 = plot_boxplot(columns = c(4,5))\n", | ||
"box4 = plot_boxplot(columns = c(6:10))\n", | ||
"\n", | ||
"grid.arrange(arrangeGrob(box1,box2,box3,box4,ncol=2,nrow=2))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"As common sense would point out older players apparently do peforme better than younger ones on average." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Correlogram" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"options(repr.plot.width = 10, repr.plot.height = 10)\n", | ||
"cnr = cor(dataset%>% select(-Target))\n", | ||
"p.values = cor_pmat(dataset%>% select(-Target)) # p-values matrix\n", | ||
"ggcorrplot(cnr, hc.order = TRUE, type = \"lower\",\n", | ||
" outline.col = \"black\",\n", | ||
" ggtheme = ggplot2::theme_gray,\n", | ||
" colors = c(\"#6D9EC1\", \"white\", \"#E46726\"),p.mat=p.values,lab = TRUE)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"As you can see some variables are not statistically coorelated with others given their p-values are too high, as highlighted by the X mark. With that in mind I'm going to remove the 3PointMade and 3PointAttempt also because they are not highly coorelated with any other variables." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"dataset = dataset %>% select(-`3PointMade`,-`3PointAttempt`)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Model" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Basic Logit Model" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"shuffel.rows = sample(nrow(dataset)*0.9) # row shuffeling\n", | ||
"dataset_train = dataset[shuffel.rows,]\n", | ||
"dataset_test = dataset[-shuffel.rows,]\n", | ||
"\n", | ||
"logit.fit = glm(Target~., data=dataset_train,family='binomial')\n", | ||
"summary(logit.fit)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"logit.probs = predict(logit.fit,newdata = dataset_test,type = 'response')\n", | ||
"class.pred = factor(ifelse(logit.probs>0.5,\"Above5Years\",\"Less5Years\"))\n", | ||
"confusionMatrix(class.pred,dataset_test$Target)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"So the model is not that good, but could use some improvement" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## K-fold Cross-Validation" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"train.control = trainControl(method = \"cv\", number = 15)\n", | ||
"logit.fitKfold = train(Target ~., data = dataset_train, method = \"glm\",\n", | ||
" trControl = train.control)\n", | ||
"logit.fitKfold" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"probs.Kfold = predict(logit.fitKfold, newdata = dataset_test, type = \"raw\")\n", | ||
"confusionMatrix(probs.Kfold,dataset_test$Target)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"It appers that the model have improved on the test set.\n", | ||
"If there are any errors please do let me know." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.7" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |