TPC-H and TPC-DS performance benchmark for Spark with DeltaLake

Based on: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

1. Clone submodules

git submodule init
git submodule update

2. Prerequisites

2.1 Install Dependencies

sudo apt-get install unzip zip gcc make flex bison byacc git build-essential -y
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install java 8.0.302-open
sdk install sbt 0.13.18

2.2 Download spark

wget https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
mkdir spark3
tar -xzf spark-3.2.0-bin-hadoop3.2.tgz --strip 1 -C spark3

NOTE: Deploy spark in standalone cluster mode.

2.3 Download sbt launcher

wget -P /tmp https://github.com/sbt/sbt/releases/download/v0.13.18/sbt-0.13.18.tgz
tar -xf /tmp/sbt-0.13.18.tgz  -C /tmp

3. Build

3.1 spark-sql-perf

cd spark-sql-perf
cp /tmp/sbt/bin/sbt-launch.jar build/sbt-launch-0.13.18.jar
bin/run
sbt +package

3.2 tpcds-kit

cd ../tpcds-kit/tools
make OS=LINUX

3.3 tpch-dbgen

cd ../../tpch-dbgen
git checkout 0469309147b42abac8857fa61b4cf69a6d3128a8 -- bm_utils.c
make

NOTE: This should be installed on all cluster nodes with the same location and build tpcds-kit, tpch-dbgen

4. Performance Benchmarking

NOTE: Change master-ip executor-memory , num-executors,executor-cores a/c to your machine specifications in .sh files.

4.1 TPCH

4.1.1 Parquet

cd ../tpch
#For generating ~100GB parquet data
./gendata_parquet.sh
# For runing all 22 TPC-H Queries
./runtpch_parquet.sh

4.1.2 ORC

cd ../tpch
#For generating ~100GB orc data
./gendata_orc.sh
# For runing all 22 TPC-H Queries
./runtpch_orc.sh

4.1.3 CSV

cd ../tpch
#For generating ~100GB csv data
./gendata_csv.sh
# For runing all 22 TPC-H Queries
./runtpch_csv.sh

4.1.4 JSON

cd ../tpch
#For generating ~100GB csv data
./gendata_json.sh
# For runing all 22 TPC-H Queries
./runtpch_son.sh

4.2 TPCDS

4.2.1 Parquet

cd ../tpcds
#For generating ~100GB parquet data
./gendata_parquet.sh
# For runing all 99 TPC-DS Queries
./runtpch_parquet.sh

4.2.2 ORC

cd ../tpcds
#For generating ~100GB orc data
./gendata_orc.sh
# For runing all 99 TPC-DS Queries
./runtpch_orc.sh

4.2.3 CSV

cd ../tpcds
#For generating ~100GB csv data
./gendata_csv.sh
# For runing all 99 TPC-DS Queries
./runtpch_csv.sh

4.2.4 JSON

cd ../tpcds
#For generating ~100GB csv data
./gendata_json.sh
# For runing all 99 TPC-DS Queries
./runtpch_json.sh

## 5. Reports

### 5.1 TPCH reports

```bash
cd tpch/tpch_<parquet,orc,csv>_reports
# result will be present in part*.csv file

5.2 TPCDS reports

cd tpcds/tpcds_<parquet,orc,csv,json>_reports
# result will be present in part*.csv file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

TPC-H and TPC-DS performance benchmark for Spark with DeltaLake

Table of Contents

1. Clone submodules

2. Prerequisites

2.1 Install Dependencies

2.2 Download spark

2.3 Download sbt launcher

3. Build

3.1 spark-sql-perf

3.2 tpcds-kit

3.3 tpch-dbgen

4. Performance Benchmarking

4.1 TPCH

4.1.1 Parquet

4.1.2 ORC

4.1.3 CSV

4.1.4 JSON

4.2 TPCDS

4.2.1 Parquet

4.2.2 ORC

4.2.3 CSV

4.2.4 JSON

5.2 TPCDS reports

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

TPC-H and TPC-DS performance benchmark for Spark with DeltaLake

Table of Contents

1. Clone submodules

2. Prerequisites

2.1 Install Dependencies

2.2 Download spark

2.3 Download sbt launcher

3. Build

3.1 spark-sql-perf

3.2 tpcds-kit

3.3 tpch-dbgen

4. Performance Benchmarking

4.1 TPCH

4.1.1 Parquet

4.1.2 ORC

4.1.3 CSV

4.1.4 JSON

4.2 TPCDS

4.2.1 Parquet

4.2.2 ORC

4.2.3 CSV

4.2.4 JSON

5.2 TPCDS reports