GitHub - slevine/pyspark-pandas-vs-pandas: Dataframe Performance Comparison

Simple Dataframe Comparison

The goal of the code used was to highlight the fact that the only thing that needs to change in order to leverage Pandas API on Spark vs Pandas is to change an import. Since the frameworks do not handle reading in multiple files the same exact way, a few changes were required in order to achieve the same results.

Requirements

python > 3.10
poetry

Setup

Download 2021 and all available 2022 data

mkdir data
cd data
for i in {01..12}; do curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-$i.parquet -O -s&; done; wait
for i in {01..12}; do curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-$i.parquet -O -s&; done; wait

poetry install

Running

To run the polars version

python tests/polars_test.py

To run the pyspark pandas version

python tests/pyspark_pandas_test.py

To run the pyspark dataframe version

python tests/pyspark_dataframe_test.py

To run the plain pandas version

python tests/tests/pandas_test.py

Alternatively, you can use this notebook to run the tests by starting a jupyter-lab after running poetry install

Additional Details

Accompanying Blog Post

Change Log

Late 2023

Added additional 2023 parquet to test
Added pyspark df version
Updated most libraries to latest versions available
Updated python to 3.11

Early 2023

Added a Notebook to showing the test runs
Upgraded pyspark to 3.3.1
Upgraded pandas to 1.5.3
Added 2022 dataset
Changed file type to Parquet
Added polars test

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
results		results
tests		tests
util		util
README.md		README.md
TestRuns.ipynb		TestRuns.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Dataframe Comparison

Requirements

Setup

Running

Additional Details

Change Log

Late 2023

Early 2023

About

Contributors 2

Languages

slevine/pyspark-pandas-vs-pandas

Folders and files

Latest commit

History

Repository files navigation

Simple Dataframe Comparison

Requirements

Setup

Running

Additional Details

Change Log

Late 2023

Early 2023

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages