Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts and AWS results for perf section of super command doc #5506

Merged
merged 6 commits into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .markdownlint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ whitespace: true
MD010:
code_blocks: false # Disallow hard tabs except in code blocks.
MD033:
allowed_elements: ["p"]
allowed_elements: ["p","br"]
487 changes: 0 additions & 487 deletions docs/commands/search.sql

This file was deleted.

954 changes: 602 additions & 352 deletions docs/commands/super.md

Large diffs are not rendered by default.

106 changes: 106 additions & 0 deletions scripts/super-cmd-perf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Query Performance From `super` Command Doc

These scripts were used to generate the results in the
[Performance](https://zed.brimdata.io/docs/next/commands/super#performance)
section of the [`super` command doc](https://zed.brimdata.io/docs/next/commands/super).
The scripts have been made available to allow for easy reproduction of the
results under different conditions and/or as tested systems evolve.

# Environments

The scripts were written to be easily run in two different environments.

## AWS

As an environment that's available to everyone, the scripts were developed
primarily for use on a "scratch" EC2 instance in [AWS](https://aws.amazon.com/).
Specifically, we chose the [`m6idn.2xlarge`](https://aws.amazon.com/ec2/instance-types/m6i/)
instance that has the following specifications:

* 8x vCPU
* 32 GB of RAM
* 474 GB NVMe instance SSD

The instance SSD in particular was seen as important to ensure consistent I/O
performance.

Assuming a freshly-created `m6idn.2xlarge` instance running Ubuntu 24.04, to
start the run:

```
curl -s https://github.com/brimdata/super/blob/main/scripts/super-cmd-perf/benchmark.sh | bash -xv 2>&1 | tee runlog.txt
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL referenced in this curl command line won't work until this PR merges.

```

The run proceeds in three phases:

1. **(AWS only)** Instance SSD is formatted and required tools & data platforms tools are downloaded/installed
2. Test data is downloaded and loaded into needed storage formats
3. Queries are executed on all data platforms

As the benchmarks may take a long time to run, the use of [`screen`](https://www.gnu.org/software/screen/)
or a similar "detachable" terminal tool is recommended in case your remote
network connection drops during a run.

## macOS/other

Whereas on [AWS](#aws) the scripts assume they're in a "scratch" environment
where it may format the instance SSD for optimal storage and install required
software, on other systems such as macOS it's assumed the required data
platforms are already installed, and it will skip ahead right to
downloading/loading test data and then running queries.

For instance on macOS, the software needed can be first installed via:

```
brew install hyperfine datafusion duckdb clickhouse go
go install github.com/brimdata/super/cmd/super@main
```

Then clone the [super repo](https://github.com/brimdata/super.git) and run the
benchmarks.

```
git clone https://github.com/brimdata/super.git
cd scripts/super-cmd-perf
./benchmark.sh
Comment on lines +63 to +65
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commands are forward-looking because these scripts won't exist on main until this PR merges.

```

All test data will remain in this directory.

# Results

Results from the run will accumulate in a subdirectory named for the date/time
when the run started, e.g., `2024-11-19_01:10:30/`. In this directory, summary
reports will be created in files ending in `.md` and `.csv` extensions, and
details from each individual step in generating the results will be in files
ending in `.out`. If run on AWS using the [`curl` command line shown above](#aws),
the `runlog.txt` will also be present that holds the full console output of the
entire run.

An archive of results from our most recent run of the benchmarks on November
26, 2024 can be downloaded [here](https://super-cmd-perf.s3.us-east-2.amazonaws.com/2024-11-26_03-17-25.tgz).

# Debugging

The scripts are configured to exit immediately if failures occur during the
run. If you encounter a failure, look in the results directory for the `.out`
file mentioned last in the console output as this will contain any detailed
error message from the operation that experienced the failure.

A problem that was encountered when developing the scripts that you may also
encounter is DuckDB running out of memory. Specifically, this happened when
we tried to run the scripts on an Intel-based Macbook with only 16 GB of
RAM, and this is part of why we used an AWS instance with 32 GB of RAM as the
reference platform. On the Macbooks, we found we could work around the memory
problem by telling DuckDB it had the use of more memory than its default
[80% heuristic for `memory_limit`](https://duckdb.org/docs/configuration/overview.html).
The scripts support an environment variable to make it easy to increase this
value, e.g., we found the scripts ran successfully at 16 GB:

```
$ DUCKDB_MEMORY_LIMIT="16GB" ./benchmark.sh
```

Of course, this ultimately caused swapping on our Macbook and a significant
hit to performance, but it at least allowed the scripts to run without
failure.
97 changes: 97 additions & 0 deletions scripts/super-cmd-perf/benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
#!/bin/bash -xv
set -euo pipefail
export RUNNING_ON_AWS_EC2=""

# If we can detect we're running on an AWS EC2 m6idn.2xlarge instance, we'll
# treat it as a scratch host, installing all needed software and using the
# local SSD for best I/O performance.
if command -v dmidecode && [ "$(sudo dmidecode --string system-uuid | cut -c1-3)" == "ec2" ] && [ "$(TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") && curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-type)" == "m6idn.2xlarge" ]; then

export RUNNING_ON_AWS_EC2=true

sudo apt-get -y update
sudo apt-get -y upgrade
sudo apt-get -y install make gcc unzip hyperfine

# Prepare local SSD for best I/O performance
sudo fdisk -l /dev/nvme1n1
sudo mkfs.ext4 -E discard -F /dev/nvme1n1
sudo mount /dev/nvme1n1 /mnt
sudo chown ubuntu:ubuntu /mnt
sudo chmod 777 /mnt
echo 'export TMPDIR="/mnt/tmpdir"' >> "$HOME"/.profile
mkdir /mnt/tmpdir

# Install ClickHouse
if ! command -v clickhouse-client > /dev/null 2>&1; then
sudo apt-get install -y apt-transport-https ca-certificates curl gnupg
curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee \
/etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y clickhouse-client
fi

# Install DuckDB
if ! command -v duckdb > /dev/null 2>&1; then
curl -L -O https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
sudo mv duckdb /usr/local/bin
fi

# Install Rust
curl -L -O https://static.rust-lang.org/dist/rust-1.82.0-x86_64-unknown-linux-gnu.tar.xz
tar xf rust-1.82.0-x86_64-unknown-linux-gnu.tar.xz
sudo rust-1.82.0-x86_64-unknown-linux-gnu/install.sh
# shellcheck disable=SC2016
echo 'export PATH="$PATH:$HOME/.cargo/bin"' >> "$HOME"/.profile

# Install DataFusion CLI
if ! command -v datafusion-cli > /dev/null 2>&1; then
cargo install datafusion-cli
fi

# Install Go
if ! command -v go > /dev/null 2>&1; then
curl -L -O https://go.dev/dl/go1.23.3.linux-amd64.tar.gz
rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.23.3.linux-amd64.tar.gz
# shellcheck disable=SC2016
echo 'export PATH="$PATH:/usr/local/go/bin:$HOME/go/bin"' >> "$HOME"/.profile
source "$HOME"/.profile
fi

# Install SuperDB
if ! command -v super > /dev/null 2>&1; then
git clone https://github.com/brimdata/super.git
cd super
make install
fi

cd scripts/super-cmd-perf

fi

rundir="$(date +%F_%T)"
mkdir "$rundir"
report="$rundir/report_$rundir.md"

echo -e "|**Software**|**Version**|\n|-|-|" | tee -a "$report"
for software in super duckdb datafusion-cli clickhouse
do
if ! command -v $software > /dev/null; then
echo "error: \"$software\" not found in PATH"
exit 1
fi
echo "|$software|$($software --version)|" | tee -a "$report"
done
echo >> "$report"

# Prepare the test data
./prep-data.sh "$rundir"

# Run the queries and generate the summary report
./run-queries.sh "$rundir"

if [ -n "$RUNNING_ON_AWS_EC2" ]; then
mv "$HOME/runlog.txt" "$rundir"
fi
58 changes: 58 additions & 0 deletions scripts/super-cmd-perf/prep-data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/bin/bash -xv
set -euo pipefail
pushd "$(cd "$(dirname "$0")" && pwd)"

if [ "$#" -ne 1 ]; then
echo "Specify results directory string"
exit 1
fi
rundir="$(pwd)/$1"
mkdir -p "$rundir"

RUNNING_ON_AWS_EC2="${RUNNING_ON_AWS_EC2:-}"
if [ -n "$RUNNING_ON_AWS_EC2" ]; then
cd /mnt
fi

function run_cmd {
outputfile="$1"
shift
{ hyperfine \
--show-output \
--warmup 0 \
--runs 1 \
--time-unit second \
"$@" ;
} \
> "$outputfile" \
2>&1
}

mkdir gharchive_gz
cd gharchive_gz
for num in $(seq 0 23)
do
curl -L -O "https://data.gharchive.org/2023-02-08-${num}.json.gz"
done
cd ..

DUCKDB_MEMORY_LIMIT="${DUCKDB_MEMORY_LIMIT:-}"
if [ -n "$DUCKDB_MEMORY_LIMIT" ]; then
increase_duckdb_memory_limit='SET memory_limit = '\'"${DUCKDB_MEMORY_LIMIT}"\''; '
else
increase_duckdb_memory_limit=""
fi

run_cmd \
"$rundir/duckdb-table-create.out" \
"duckdb gha.db -c \"${increase_duckdb_memory_limit}CREATE TABLE gha AS FROM read_json('gharchive_gz/*.json.gz', union_by_name=true)\""

run_cmd \
"$rundir/duckdb-parquet-create.out" \
"duckdb gha.db -c \"${increase_duckdb_memory_limit}COPY (from gha) TO 'gha.parquet'\""

run_cmd \
"$rundir/super-bsup-create.out" \
"super -o gha.bsup gharchive_gz/*.json.gz"

du -h gha.db gha.parquet gha.bsup gharchive_gz
4 changes: 4 additions & 0 deletions scripts/super-cmd-perf/queries/agg.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
SELECT count(),type
FROM '__SOURCE__'
WHERE repo.name='duckdb/duckdb'
GROUP BY type
3 changes: 3 additions & 0 deletions scripts/super-cmd-perf/queries/count.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
SELECT count()
FROM '__SOURCE__'
WHERE actor.login='johnbieren'
3 changes: 3 additions & 0 deletions scripts/super-cmd-perf/queries/search+.spq
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
SELECT count()
FROM '__SOURCE__'
WHERE grep('in case you have any feedback 😊')
Loading