statcast-era-pitches

pybaseball is a great tool for downloading baseball data. Even though the library is optimized and scrapes this data in parallel, it can be time consuming.

The point of this repository is to utilize GitHub Actions to scrape new baseball data weekly during the MLB season, and update a parquet file hosted as a huggingface dataset. Reading this data as a huggingface dataset is much faster than scraping the new data each time you re run your code, or just want updated statcast pitch data in general.

The main.py script updates each week during the MLB season, updating the statcast-era-pitches HuggingFace Dataset so that you don't have to re scrape this data yourself.

You can explore the entire dataset in your browser at this link

Usage

With statcast_pitches package

pip install git+https://github.com/Jensen-holm/statcast-era-pitches.git

Example 1 w/ polars (suggested)

import statcast_pitches
import polars as pl

# load all pitches from 2015-present
pitches_lf = statcast_pitches.load()

# filter to get 2024 bat speed data
bat_speed_24_df = (pitches_lf
                    .filter(pl.col("game_date").dt.year() == 2024)
                    .select("bat_speed", "swing_length")
                    .collect())

Notes

Because statcast_pitches.load() uses a LazyFrame, we can load it much faster and even perform operations on it before 'collecting' it into memory. If it were loaded as a DataFrame, this code would execute in ~30-60 seconds, instead it runs between 2-8 seconds.

Example 2 Duckdb

import statcast_pitches

# get bat tracking data from 2024
params = ("2024",)
query_2024_bat_speed = f"""
    SELECT bat_speed, swing_length
    FROM pitches
    WHERE 
        YEAR(game_date) =?
        AND bat_speed IS NOT NULL;
    """

if __name__ == "__main__":
    bat_speed_24_df = statcast_pitches.load(
        query=query_2024_bat_speed,
        params=params,
    ).collect()

    print(bat_speed_24_df.head(3))

output:

	bat_speed	swing_length
0	73.61710	6.92448
1	58.63812	7.56904
2	71.71226	6.46088

Notes:

If no query is specified, all data from 2015-present will be loaded into a DataFrame.
The table in your query MUST be called 'pitches', or it will fail.
Since load() returns a LazyFrame, notice that I had to call pl.DataFrame.collect() before calling head()
This is slower than the other polars approach, however sometimes using SQL is fun

With HuggingFace API

Pandas

import pandas as pd

df = pd.read_parquet("hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet")

Polars

import polars as pl

df = pl.read_parquet('hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet')

Duckdb

SELECT *
FROM 'hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet';

HuggingFace Dataset

from datasets import load_dataset

ds = load_dataset("Jensen-holm/statcast-era-pitches")

Tidyverse

library(tidyverse)

statcast_pitches <- read_parquet(
    "https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches/resolve/main/data/statcast_era_pitches.parquet"
)

see the dataset on HugingFace itself for more details.

Eager Benchmarking

Eager Load Time (s)	API
1421.103	pybaseball
26.899	polars
33.093	pandas
68.692	duckdb

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
data		data
statcast_pitches		statcast_pitches
tests		tests
update		update
.gitignore		.gitignore
README.md		README.md
all_requirements.txt		all_requirements.txt
benchmark.ipynb		benchmark.ipynb
dataset_load_times.png		dataset_load_times.png
pyproject.toml		pyproject.toml
update.py		update.py
update_requirements.txt		update_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

statcast-era-pitches

Usage

With statcast_pitches package

With HuggingFace API

Eager Benchmarking

About

Releases 2

Packages

Languages

Jensen-holm/statcast-era-pitches

Folders and files

Latest commit

History

Repository files navigation

statcast-era-pitches

Usage

With statcast_pitches package

With HuggingFace API

Eager Benchmarking

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages