The goal of the code used was to highlight the fact that the only thing that needs to change in order to leverage Pandas API on Spark vs Pandas is to change an import. Since the frameworks do not handle reading in multiple files the same exact way, a few changes were required in order to achieve the same results.
python > 3.10
- poetry
- Download 2021 and all available 2022 data
mkdir data
cd data
for i in {01..12}; do curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-$i.parquet -O -s&; done; wait
for i in {01..12}; do curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-$i.parquet -O -s&; done; wait
poetry install
- To run the
polars
version
python tests/polars_test.py
- To run the
pyspark pandas
version
python tests/pyspark_pandas_test.py
- To run the
pyspark dataframe
version
python tests/pyspark_dataframe_test.py
- To run the
plain pandas
version
python tests/tests/pandas_test.py
Alternatively, you can use this notebook to run the tests by starting a jupyter-lab
after
running poetry install
- Added additional 2023 parquet to test
- Added
pyspark
df version - Updated most libraries to latest versions available
- Updated python to 3.11
- Added a Notebook to showing the test runs
- Upgraded
pyspark
to3.3.1
- Upgraded
pandas
to1.5.3
- Added 2022 dataset
- Changed file type to Parquet
- Added
polars
test