TPC-H and TPC-DS performance benchmark for Spark with DeltaLake

Based on: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

1. Clone submodules

git submodule init
git submodule update

2. Prerequisites

2.1 Install Dependencies

sudo apt-get install unzip zip gcc make flex bison byacc git build-essential -y
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install java 8.0.302-open
sdk install sbt 0.13.18

2.2 Download spark

wget https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
mkdir spark3
tar -xzf spark-3.2.0-bin-hadoop3.2.tgz --strip 1 -C spark3

NOTE: Deploy spark in standalone cluster mode.

2.3 Download sbt launcher

wget -P /tmp https://github.com/sbt/sbt/releases/download/v0.13.18/sbt-0.13.18.tgz
tar -xf /tmp/sbt-0.13.18.tgz  -C /tmp

3. Build

3.1 spark-sql-perf

cd spark-sql-perf
cp /tmp/sbt/bin/sbt-launch.jar build/sbt-launch-0.13.18.jar
bin/run
sbt +package

3.2 tpcds-kit

cd ../tpcds-kit/tools
make OS=LINUX

3.3 tpch-dbgen

cd ../../tpch-dbgen
git checkout 0469309147b42abac8857fa61b4cf69a6d3128a8 -- bm_utils.c
make

NOTE: This should be installed on all cluster nodes with the same location and build tpcds-kit, tpch-dbgen

4. Performance Benchmarking

NOTE: Change master-ip executor-memory , num-executors,executor-cores a/c to your machine specifications in .sh files.

4.1 TPCH

4.1.1 Parquet

cd ../tpch
#For generating ~100GB parquet data
./gendata_parquet.sh
# For runing all 22 TPC-H Queries
./runtpch_parquet.sh

4.1.2 ORC

cd ../tpch
#For generating ~100GB orc data
./gendata_orc.sh
# For runing all 22 TPC-H Queries
./runtpch_orc.sh

4.1.3 CSV

cd ../tpch
#For generating ~100GB csv data
./gendata_csv.sh
# For runing all 22 TPC-H Queries
./runtpch_csv.sh

4.1.4 JSON

cd ../tpch
#For generating ~100GB csv data
./gendata_json.sh
# For runing all 22 TPC-H Queries
./runtpch_son.sh

4.2 TPCDS

4.2.1 Parquet

cd ../tpcds
#For generating ~100GB parquet data
./gendata_parquet.sh
# For runing all 99 TPC-DS Queries
./runtpch_parquet.sh

4.2.2 ORC

cd ../tpcds
#For generating ~100GB orc data
./gendata_orc.sh
# For runing all 99 TPC-DS Queries
./runtpch_orc.sh

4.2.3 CSV

cd ../tpcds
#For generating ~100GB csv data
./gendata_csv.sh
# For runing all 99 TPC-DS Queries
./runtpch_csv.sh

4.2.4 JSON

cd ../tpcds
#For generating ~100GB csv data
./gendata_json.sh
# For runing all 99 TPC-DS Queries
./runtpch_json.sh

## 5. Reports

### 5.1 TPCH reports

```bash
cd tpch/tpch_<parquet,orc,csv>_reports
# result will be present in part*.csv file

5.2 TPCDS reports

cd tpcds/tpcds_<parquet,orc,csv,json>_reports
# result will be present in part*.csv file

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
logo		logo
spark-sql-perf @ a49ea78		spark-sql-perf @ a49ea78
tpcds-kit @ 45ab85a		tpcds-kit @ 45ab85a
tpcds		tpcds
tpch-dbgen @ 6985da4		tpch-dbgen @ 6985da4
tpch		tpch
.gitignore		.gitignore
.gitmodules		.gitmodules
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TPC-H and TPC-DS performance benchmark for Spark with DeltaLake

Table of Contents

1. Clone submodules

2. Prerequisites

2.1 Install Dependencies

2.2 Download spark

2.3 Download sbt launcher

3. Build

3.1 spark-sql-perf

3.2 tpcds-kit

3.3 tpch-dbgen

4. Performance Benchmarking

4.1 TPCH

4.1.1 Parquet

4.1.2 ORC

4.1.3 CSV

4.1.4 JSON

4.2 TPCDS

4.2.1 Parquet

4.2.2 ORC

4.2.3 CSV

4.2.4 JSON

5.2 TPCDS reports

About

Releases

Packages

Languages

dionboles-asym/spark_benchmark

Folders and files

Latest commit

History

Repository files navigation

TPC-H and TPC-DS performance benchmark for Spark with DeltaLake

Table of Contents

1. Clone submodules

2. Prerequisites

2.1 Install Dependencies

2.2 Download spark

2.3 Download sbt launcher

3. Build

3.1 spark-sql-perf

3.2 tpcds-kit

3.3 tpch-dbgen

4. Performance Benchmarking

4.1 TPCH

4.1.1 Parquet

4.1.2 ORC

4.1.3 CSV

4.1.4 JSON

4.2 TPCDS

4.2.1 Parquet

4.2.2 ORC

4.2.3 CSV

4.2.4 JSON

5.2 TPCDS reports

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages