NBA Conference Bias

Statistical modeling of NBA conference bias due to differences in travel and schedule.

Harvard University
Class: STAT 109 — Introduction to Statistical Modeling
Deliverables: Project Report with R code listings

Project Summary

Problem Statement

The National Basketball Association is composed of 30 teams distributed across the United States and equally split between two conferences: the West and the East. We sought to investigate whether there is a potential bias in the NBA that grants teams in one conference a relatively easier path to success due to the differences in travel and schedule.

Data Collection

Several web scrapers were written to collect the required raw data from Baseball-Reference.com on:

The past NBA games, their schedule and outcome, as well as the resulting team performance
All-NBA awards for the best individual players as a reflection of their strength
The names of the venues where each basketball game took place
The home arenas for the teams in each year, that served as the starting point for traveling at the beginning of the respective season

The resulting dataset contains information on a total of 24,462 games, 8,535 of which took place between one Western and one Eastern Conference team.

Feature Engineering

We recreated an itinerary for every team for the past 18 years, similar to the one below for the Boston Celtics during the 2018-19 season:

Google Maps APIs were used:

To geocode the venues into an exact address and its geographic coordinates (latitude and longitude)
To identify the time zones associated with each geographical location

Travel-related predictors: We estimated the amount of travel by each team to a particular location using the geodesic distance. We also computed the number of time zones traveled, the direction of travel (eastward vs westward), and the change in longitude.

Schedule-related variables: We take into consideration several aspects tightly interlinked with travel, which can contribute to a build-up of fatigue:

the number of days of rest a team had prior to the game
the number of games a team had to play in a fixed period
the number of games they played while being on the road

Performance metrics: Finally, we control for the strongest predictors of team success using multiple performance indicators: the total number of wins for the previous season, the so-called Four Factors, advanced box score statistics, and others.

All of the data mentioned above was summarized with rolling averages and combined in a single dataset for further analysis.

Statistical Modeling

Linear Regression: The analysis was conducted using multiple linear regression. The point differential (the number of points scored by one team minus the number of points scored by the opponent) serves as our response variable. The differences between the indicators for the two competing teams act as explanatory variables.

Diagnostics: Particular importance is given to interpretability. To ensure model validity, we run diagnostics and check if the assumptions of linear regression are satisfied, namely: linearity, independence and normality of errors, homoscedasticity, and multicollinearity.

Resampling: Finally, stepwise regression and bootstrap are used to assess the inclusion frequency of certain predictors of interest in the final model.

Please see the project report for a detailed description of the research and the findings.

Requirements

The analysis was conducted in R using a small number of third party packages, particularly:

tidyverse: for all data handling tasks
car: for computing variance inflation factors and other diagnostics
nortest: for normality tests

To reproduce the results, R version 3.6.0 or higher is required due to a new default method for generating a discrete uniform distribution.

The data collection tools are based on the Scrapy framework in Python. Assuming that the layout of the source website stays the same, the crawlers can be run with:

$ scrapy crawl <spider name> -t csv -o <output.csv>

Additional data present in the data/ repo directory may have been gathered manually.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
R		R
data		data
img		img
scrapy		scrapy
.gitattributes		.gitattributes
Makefile		Makefile
README.md		README.md
report.pdf		report.pdf
report.tex		report.tex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBA Conference Bias

Table of Contents

Project Summary

Problem Statement

Data Collection

Feature Engineering

Statistical Modeling

Requirements

About

Languages

dvukolov/nba-conference

Folders and files

Latest commit

History

Repository files navigation

NBA Conference Bias

Table of Contents

Project Summary

Problem Statement

Data Collection

Feature Engineering

Statistical Modeling

Requirements

About

Resources

Stars

Watchers

Forks

Languages