Horse Racing: An analysis to predict 2020 outcomes

Using Scikit-learn, Python Pandas, and Google Cloud SQL to analyze existing horse racing data to create a classification model that will predict future outcomes.

Background

The sport of horse racing is worth around $100 billion dollars
$103 is the largest payout for 1st place in horse races
Many interested parties investing in ML models
Amount of available data is controversial ie. Equibase
Research: articles on AI and ML attempts

Data Sources

Horse Racing Datasets: http://horseracingdatasets.com/kentucky-derby/
Data.World: https://data.world/sya/horses-for-courses/workspace/query?filename=runners.csv%2Frunners.csv&newQueryType=SQL&selectedTable=runners&tempId=1583168958109

Other Sources

Relationships between race earnings and horse age, sex, gait, track surface and number of race starts for Thoroughbred and Standardbred racehorses in North America: https://beva.onlinelibrary.wiley.com/doi/abs/10.1111/j.2042-3306.2010.00032.x

Analysis

Findings

Research shows "All independent variables (age, breed, sex, gait, track surface and total number of starts) had a significant impact on total earnings.”(1)
Horse racing is completely random

Big Query

Datasize: 27k
horse weight, jockey, finish place
Finish place -> boolean winner column: place = 1 (true), place = >=2 (false)
Data Studio Viz -- Big Query, SQL

Overall

Datasize: 44k rows
Random Forest: dam id, sire id, trainer id, rider id, prize money, handicap weight, age, sex id
Logistic Regression

Kentucky Derby

Datasize: 235 rows, KDerby Data 2008 - 2019
Manually aggregated
Random Forest: Starters, Dosage Index, Finish Position (Place_Bins)
Dosage index: mathematical figure used to quantify a horse’s ability to handle various distances based on the appearance of influential sires in the bloodline. Calculated based on an analysis of the horse's pedigree.
Linear & Logistic Regression
Predict_Proba: Predicting the highest probability of horse to win
Pickeled Model

Conclusion:

Obstacles: One-off situations like how to use data from Horses that won or placed, but were later disqualified (like what happened in last Kentucky Derby). Data and definitions of data provided in datasets.
If I had more time: Look at more statistics like odds ("final odds" before the race, “morning odds”) as well as the owner, trainer, track, time, distance, and track surface conditions.
This model does not have predictive value, but there is huge potential to create one, keeping in mind that horse racing is pretty random

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.DS_Store		.DS_Store
Additional Data Points to Collect.xlsx		Additional Data Points to Collect.xlsx
FinalProject_KentuckyDerby.ipynb		FinalProject_KentuckyDerby.ipynb
FinalProject_OverallPredicts.ipynb		FinalProject_OverallPredicts.ipynb
FinalProject_Workspace.ipynb		FinalProject_Workspace.ipynb
K_Derby_Stats_2019-2008.csv		K_Derby_Stats_2019-2008.csv
Kentucky_Derby_Placing_Order2008-2019.csv		Kentucky_Derby_Placing_Order2008-2019.csv
README.md		README.md
Stats_Horses.csv		Stats_Horses.csv
Stats_Races.csv		Stats_Races.csv
finalized_winning_horsemodel.sav		finalized_winning_horsemodel.sav
horse_data.csv		horse_data.csv
horse_sexes_id.csv		horse_sexes_id.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Horse Racing: An analysis to predict 2020 outcomes

Using Scikit-learn, Python Pandas, and Google Cloud SQL to analyze existing horse racing data to create a classification model that will predict future outcomes.

Background

Data Sources

Other Sources

Analysis

Conclusion:

About

Releases

Packages

Languages

kerrieliz/Predictive-Modeling

Folders and files

Latest commit

History

Repository files navigation

Horse Racing: An analysis to predict 2020 outcomes

Using Scikit-learn, Python Pandas, and Google Cloud SQL to analyze existing horse racing data to create a classification model that will predict future outcomes.

Background

Data Sources

Other Sources

Analysis

Conclusion:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages