Skip to content

jonathancosme/GPU_vs_CPU

Repository files navigation

Description

Utilizing the GPUs for end-to-end data science projects (including ETL and data processing) can drastically decrease computation time.

Because resources are freed up much sooner, multiple iterations of a project (or different projects) can be completes in the same time it would have taken to run one single iteration of a project using the CPU.

You can find another similar benchmark series here: CPU vs GPU Benchmarks Series

UPDATE: A updated version of this experiment can be found in GPU_vs_CPU_v2.
We use a newer (more powerful) CPU, and run the CPU experiment using multiple cores/threads.

Results

Video

Slides

Article

Summary

  • For ETL, we see more than 85% reduction in time
  • For Machine Learning, we see a more than 98% reduction in time
  • End-to-end, we see a more than 95% reduction in time

Methodology

The Data

  • All datasets had 10 columns
  • We tested datasets with 500k rows, 1 million rows, 2 millions rows, 4 millions rows, and 20 million rows
    (8 million rows was not test, but the dataset is available).

ETL functionality

Each dataset tested went through the following ETL steps:

  1. csv file read
  2. csv file write
  3. describe dataframe
  4. set index of dataframe for each columns
  5. concatenated 3 dataframes, each 1/3 the size of the data
  6. performed a groupby aggregation on each categorical column to find mean.
  7. fit label encoder on each categorical column
  8. encoded categorical columns
  9. scaled value columns
  10. split data into train, val, and test partitions

Machine learning models

For each dataset, the following models were fitted after ETL:

  1. OLS Regression
  2. Logistic Regression
  3. K-Means
  4. Random Forest
  5. Gradient Boosting

Files

The dataset used was a fake dataset generated by me.
It is a binary classification problem for detecting counterfeit drugs.

They are available for public download via Google Drive here:

Python Packages

The following packages were used for the CPU experiment:

The following packages were used for the GPU experiment:

Hardware

The experiment was run on a Lenova Legion laptop with the following:

  • NVIDIA GeForce RTX Laptop 3080 with 16GB of video memory
  • AMD Ryzen 9 5900HX with 8 dual-cores (16 cores) and 32GB of RAM

Instructions

The following requirements must be met:

  • Jupyter Lab
  • A conda environment with the rapids.ai suite installed
  1. Clone the github repository
  2. Download the data files located in the Files section above

  1. move the downloaded files into the sample_data folder

  1. Choose any file, and update the filenames in BOTH of the following notebooks
  • CPU_demo.ipynb
  • GPU_demo.ipynb

  1. Run both of the notebooks (CPU_demo.ipynb and GPU_demo.ipynb) or...
  2. Alternatively, you can run the time_benchmark.ipynb notebook

Individual Task Figures

ETL

Machine Learning

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published