Skip to content

A machine learning project using Scikit-learn, Python Pandas, and Google Cloud SQL to analyze existing horse racing data to create a classification model that will predict future outcomes.

Notifications You must be signed in to change notification settings

kerrieliz/Predictive-Modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Horse Racing: An analysis to predict 2020 outcomes

Using Scikit-learn, Python Pandas, and Google Cloud SQL to analyze existing horse racing data to create a classification model that will predict future outcomes.

Background

  • The sport of horse racing is worth around $100 billion dollars
  • $103 is the largest payout for 1st place in horse races
  • Many interested parties investing in ML models
  • Amount of available data is controversial ie. Equibase
  • Research: articles on AI and ML attempts

Data Sources

Other Sources

  1. Relationships between race earnings and horse age, sex, gait, track surface and number of race starts for Thoroughbred and Standardbred racehorses in North America: https://beva.onlinelibrary.wiley.com/doi/abs/10.1111/j.2042-3306.2010.00032.x

Analysis

Findings

  • Research shows "All independent variables (age, breed, sex, gait, track surface and total number of starts) had a significant impact on total earnings.”(1)
  • Horse racing is completely random

Big Query

  • Datasize: 27k
  • horse weight, jockey, finish place
  • Finish place -> boolean winner column: place = 1 (true), place = >=2 (false)
  • Data Studio Viz -- Big Query, SQL

Overall

  • Datasize: 44k rows
  • Random Forest: dam id, sire id, trainer id, rider id, prize money, handicap weight, age, sex id
  • Logistic Regression

Kentucky Derby

  • Datasize: 235 rows, KDerby Data 2008 - 2019
  • Manually aggregated
  • Random Forest: Starters, Dosage Index, Finish Position (Place_Bins)
  • Dosage index: mathematical figure used to quantify a horse’s ability to handle various distances based on the appearance of influential sires in the bloodline. Calculated based on an analysis of the horse's pedigree.
  • Linear & Logistic Regression
  • Predict_Proba: Predicting the highest probability of horse to win
  • Pickeled Model

Conclusion:

  • Obstacles: One-off situations like how to use data from Horses that won or placed, but were later disqualified (like what happened in last Kentucky Derby). Data and definitions of data provided in datasets.
  • If I had more time: Look at more statistics like odds ("final odds" before the race, “morning odds”) as well as the owner, trainer, track, time, distance, and track surface conditions.
  • This model does not have predictive value, but there is huge potential to create one, keeping in mind that horse racing is pretty random

About

A machine learning project using Scikit-learn, Python Pandas, and Google Cloud SQL to analyze existing horse racing data to create a classification model that will predict future outcomes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published