GitHub

Description

Utilizing the GPUs for end-to-end data science projects (including ETL and data processing) can drastically decrease computation time.

Because resources are freed up much sooner, multiple iterations of a project (or different projects) can be completes in the same time it would have taken to run one single iteration of a project using the CPU.

You can find another similar benchmark series here: CPU vs GPU Benchmarks Series

UPDATE: A updated version of this experiment can be found in GPU_vs_CPU_v2.
We use a newer (more powerful) CPU, and run the CPU experiment using multiple cores/threads.

Results

Video

A short 9-minute video presentation can be found here.
Please start by watching this.

Slides

A PDF of the slides can be found here

Article

The published article can be found here

Summary

For ETL, we see more than 85% reduction in time
For Machine Learning, we see a more than 98% reduction in time
End-to-end, we see a more than 95% reduction in time

Methodology

The Data

All datasets had 10 columns
We tested datasets with 500k rows, 1 million rows, 2 millions rows, 4 millions rows, and 20 million rows
(8 million rows was not test, but the dataset is available).

ETL functionality

Each dataset tested went through the following ETL steps:

csv file read
csv file write
describe dataframe
set index of dataframe for each columns
concatenated 3 dataframes, each 1/3 the size of the data
performed a groupby aggregation on each categorical column to find mean.
fit label encoder on each categorical column
encoded categorical columns
scaled value columns
split data into train, val, and test partitions

Machine learning models

For each dataset, the following models were fitted after ETL:

OLS Regression
Logistic Regression
K-Means
Random Forest
Gradient Boosting

Files

The dataset used was a fake dataset generated by me.
It is a binary classification problem for detecting counterfeit drugs.

They are available for public download via Google Drive here:

Python Packages

The following packages were used for the CPU experiment:

The following packages were used for the GPU experiment:

Hardware

The experiment was run on a Lenova Legion laptop with the following:

NVIDIA GeForce RTX Laptop 3080 with 16GB of video memory
AMD Ryzen 9 5900HX with 8 dual-cores (16 cores) and 32GB of RAM

Instructions

The following requirements must be met:

Jupyter Lab
A conda environment with the rapids.ai suite installed

Clone the github repository
Download the data files located in the Files section above

move the downloaded files into the sample_data folder

Choose any file, and update the filenames in BOTH of the following notebooks

CPU_demo.ipynb
GPU_demo.ipynb

Run both of the notebooks (CPU_demo.ipynb and GPU_demo.ipynb) or...
Alternatively, you can run the time_benchmark.ipynb notebook

Individual Task Figures

ETL

Machine Learning

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.ipynb_checkpoints		.ipynb_checkpoints
docs		docs
output_data		output_data
plots		plots
sample_data		sample_data
.gitignore		.gitignore
CPU_demo.ipynb		CPU_demo.ipynb
GPU_demo.ipynb		GPU_demo.ipynb
README.md		README.md
cufile.log		cufile.log
plot_results.ipynb		plot_results.ipynb
time_benchmark.ipynb		time_benchmark.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Results

Video

Slides

Article

Summary

Methodology

The Data

ETL functionality

Machine learning models

Files

Python Packages

Hardware

Instructions

Individual Task Figures

ETL

Machine Learning

About

Releases

Packages

Contributors 2

Languages

jonathancosme/GPU_vs_CPU

Folders and files

Latest commit

History

Repository files navigation

Description

Results

Video

Slides

Article

Summary

Methodology

The Data

ETL functionality

Machine learning models

Files

Python Packages

Hardware

Instructions

Individual Task Figures

ETL

Machine Learning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages