Utilizing the GPUs for end-to-end data science projects (including ETL and data processing) can drastically decrease computation time.
Because resources are freed up much sooner, multiple iterations of a project (or different projects) can be completes in the same time it would have taken to run one single iteration of a project using the CPU.
You can find another similar benchmark series here: CPU vs GPU Benchmarks Series
UPDATE: A updated version of this experiment can be found in GPU_vs_CPU_v2.
We use a newer (more powerful) CPU, and run the CPU experiment using multiple cores/threads.
- A short 9-minute video presentation can be found here.
Please start by watching this.
- For ETL, we see more than 85% reduction in time
- For Machine Learning, we see a more than 98% reduction in time
- End-to-end, we see a more than 95% reduction in time
- All datasets had 10 columns
- We tested datasets with 500k rows, 1 million rows, 2 millions rows, 4 millions rows, and 20 million rows
(8 million rows was not test, but the dataset is available).
Each dataset tested went through the following ETL steps:
- csv file read
- csv file write
- describe dataframe
- set index of dataframe for each columns
- concatenated 3 dataframes, each 1/3 the size of the data
- performed a groupby aggregation on each categorical column to find mean.
- fit label encoder on each categorical column
- encoded categorical columns
- scaled value columns
- split data into train, val, and test partitions
For each dataset, the following models were fitted after ETL:
- OLS Regression
- Logistic Regression
- K-Means
- Random Forest
- Gradient Boosting
The dataset used was a fake dataset generated by me.
It is a binary classification problem for detecting counterfeit drugs.
They are available for public download via Google Drive here:
- sample_data_500k.csv
- sample_data_1m.csv
- sample_data_2m.csv
- sample_data_4m.csv
- sample_data_8m.csv
- sample_data_20m.csv
The following packages were used for the CPU experiment:
The following packages were used for the GPU experiment:
The experiment was run on a Lenova Legion laptop with the following:
- NVIDIA GeForce RTX Laptop 3080 with 16GB of video memory
- AMD Ryzen 9 5900HX with 8 dual-cores (16 cores) and 32GB of RAM
The following requirements must be met:
- Jupyter Lab
- A conda environment with the rapids.ai suite installed
- Clone the github repository
- Download the data files located in the Files section above
- move the downloaded files into the sample_data folder
- Choose any file, and update the filenames in BOTH of the following notebooks
- CPU_demo.ipynb
- GPU_demo.ipynb
- Run both of the notebooks (CPU_demo.ipynb and GPU_demo.ipynb) or...
- Alternatively, you can run the time_benchmark.ipynb notebook