-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scripts and AWS results for perf section of super command doc #5506
Merged
+1,562
−840
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
15e2449
Scripts for running perf queries from super command doc
philrz cc1ea58
Adjust scripts for running on main
philrz e63b936
super command doc points at perf scripts and results on AWS
philrz 83823b1
Remove search.sql from docs area since it's with the scripts
philrz 81542ea
Allow HTML br to line break before speed-up factors in results table
philrz cc4595b
Clarify early testing of ClickHouse JSON type
philrz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# Query Performance From `super` Command Doc | ||
|
||
These scripts were used to generate the results in the | ||
[Performance](https://zed.brimdata.io/docs/next/commands/super#performance) | ||
section of the [`super` command doc](https://zed.brimdata.io/docs/next/commands/super). | ||
The scripts have been made available to allow for easy reproduction of the | ||
results under different conditions and/or as tested systems evolve. | ||
|
||
# Environments | ||
|
||
The scripts were written to be easily run in two different environments. | ||
|
||
## AWS | ||
|
||
As an environment that's available to everyone, the scripts were developed | ||
primarily for use on a "scratch" EC2 instance in [AWS](https://aws.amazon.com/). | ||
Specifically, we chose the [`m6idn.2xlarge`](https://aws.amazon.com/ec2/instance-types/m6i/) | ||
instance that has the following specifications: | ||
|
||
* 8x vCPU | ||
* 32 GB of RAM | ||
* 474 GB NVMe instance SSD | ||
|
||
The instance SSD in particular was seen as important to ensure consistent I/O | ||
performance. | ||
|
||
Assuming a freshly-created `m6idn.2xlarge` instance running Ubuntu 24.04, to | ||
start the run: | ||
|
||
``` | ||
curl -s https://github.com/brimdata/super/blob/main/scripts/super-cmd-perf/benchmark.sh | bash -xv 2>&1 | tee runlog.txt | ||
``` | ||
|
||
The run proceeds in three phases: | ||
|
||
1. **(AWS only)** Instance SSD is formatted and required tools & data platforms tools are downloaded/installed | ||
2. Test data is downloaded and loaded into needed storage formats | ||
3. Queries are executed on all data platforms | ||
|
||
As the benchmarks may take a long time to run, the use of [`screen`](https://www.gnu.org/software/screen/) | ||
or a similar "detachable" terminal tool is recommended in case your remote | ||
network connection drops during a run. | ||
|
||
## macOS/other | ||
|
||
Whereas on [AWS](#aws) the scripts assume they're in a "scratch" environment | ||
where it may format the instance SSD for optimal storage and install required | ||
software, on other systems such as macOS it's assumed the required data | ||
platforms are already installed, and it will skip ahead right to | ||
downloading/loading test data and then running queries. | ||
|
||
For instance on macOS, the software needed can be first installed via: | ||
|
||
``` | ||
brew install hyperfine datafusion duckdb clickhouse go | ||
go install github.com/brimdata/super/cmd/super@main | ||
``` | ||
|
||
Then clone the [super repo](https://github.com/brimdata/super.git) and run the | ||
benchmarks. | ||
|
||
``` | ||
git clone https://github.com/brimdata/super.git | ||
cd scripts/super-cmd-perf | ||
./benchmark.sh | ||
Comment on lines
+63
to
+65
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These commands are forward-looking because these scripts won't exist on |
||
``` | ||
|
||
All test data will remain in this directory. | ||
|
||
# Results | ||
|
||
Results from the run will accumulate in a subdirectory named for the date/time | ||
when the run started, e.g., `2024-11-19_01:10:30/`. In this directory, summary | ||
reports will be created in files ending in `.md` and `.csv` extensions, and | ||
details from each individual step in generating the results will be in files | ||
ending in `.out`. If run on AWS using the [`curl` command line shown above](#aws), | ||
the `runlog.txt` will also be present that holds the full console output of the | ||
entire run. | ||
|
||
An archive of results from our most recent run of the benchmarks on November | ||
26, 2024 can be downloaded [here](https://super-cmd-perf.s3.us-east-2.amazonaws.com/2024-11-26_03-17-25.tgz). | ||
|
||
# Debugging | ||
|
||
The scripts are configured to exit immediately if failures occur during the | ||
run. If you encounter a failure, look in the results directory for the `.out` | ||
file mentioned last in the console output as this will contain any detailed | ||
error message from the operation that experienced the failure. | ||
|
||
A problem that was encountered when developing the scripts that you may also | ||
encounter is DuckDB running out of memory. Specifically, this happened when | ||
we tried to run the scripts on an Intel-based Macbook with only 16 GB of | ||
RAM, and this is part of why we used an AWS instance with 32 GB of RAM as the | ||
reference platform. On the Macbooks, we found we could work around the memory | ||
problem by telling DuckDB it had the use of more memory than its default | ||
[80% heuristic for `memory_limit`](https://duckdb.org/docs/configuration/overview.html). | ||
The scripts support an environment variable to make it easy to increase this | ||
value, e.g., we found the scripts ran successfully at 16 GB: | ||
|
||
``` | ||
$ DUCKDB_MEMORY_LIMIT="16GB" ./benchmark.sh | ||
``` | ||
|
||
Of course, this ultimately caused swapping on our Macbook and a significant | ||
hit to performance, but it at least allowed the scripts to run without | ||
failure. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
#!/bin/bash -xv | ||
set -euo pipefail | ||
export RUNNING_ON_AWS_EC2="" | ||
|
||
# If we can detect we're running on an AWS EC2 m6idn.2xlarge instance, we'll | ||
# treat it as a scratch host, installing all needed software and using the | ||
# local SSD for best I/O performance. | ||
if command -v dmidecode && [ "$(sudo dmidecode --string system-uuid | cut -c1-3)" == "ec2" ] && [ "$(TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") && curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-type)" == "m6idn.2xlarge" ]; then | ||
|
||
export RUNNING_ON_AWS_EC2=true | ||
|
||
sudo apt-get -y update | ||
sudo apt-get -y upgrade | ||
sudo apt-get -y install make gcc unzip hyperfine | ||
|
||
# Prepare local SSD for best I/O performance | ||
sudo fdisk -l /dev/nvme1n1 | ||
sudo mkfs.ext4 -E discard -F /dev/nvme1n1 | ||
sudo mount /dev/nvme1n1 /mnt | ||
sudo chown ubuntu:ubuntu /mnt | ||
sudo chmod 777 /mnt | ||
echo 'export TMPDIR="/mnt/tmpdir"' >> "$HOME"/.profile | ||
mkdir /mnt/tmpdir | ||
|
||
# Install ClickHouse | ||
if ! command -v clickhouse-client > /dev/null 2>&1; then | ||
sudo apt-get install -y apt-transport-https ca-certificates curl gnupg | ||
curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg | ||
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee \ | ||
/etc/apt/sources.list.d/clickhouse.list | ||
sudo apt-get update | ||
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y clickhouse-client | ||
fi | ||
|
||
# Install DuckDB | ||
if ! command -v duckdb > /dev/null 2>&1; then | ||
curl -L -O https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip | ||
unzip duckdb_cli-linux-amd64.zip | ||
sudo mv duckdb /usr/local/bin | ||
fi | ||
|
||
# Install Rust | ||
curl -L -O https://static.rust-lang.org/dist/rust-1.82.0-x86_64-unknown-linux-gnu.tar.xz | ||
tar xf rust-1.82.0-x86_64-unknown-linux-gnu.tar.xz | ||
sudo rust-1.82.0-x86_64-unknown-linux-gnu/install.sh | ||
# shellcheck disable=SC2016 | ||
echo 'export PATH="$PATH:$HOME/.cargo/bin"' >> "$HOME"/.profile | ||
|
||
# Install DataFusion CLI | ||
if ! command -v datafusion-cli > /dev/null 2>&1; then | ||
cargo install datafusion-cli | ||
fi | ||
|
||
# Install Go | ||
if ! command -v go > /dev/null 2>&1; then | ||
curl -L -O https://go.dev/dl/go1.23.3.linux-amd64.tar.gz | ||
rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.23.3.linux-amd64.tar.gz | ||
# shellcheck disable=SC2016 | ||
echo 'export PATH="$PATH:/usr/local/go/bin:$HOME/go/bin"' >> "$HOME"/.profile | ||
source "$HOME"/.profile | ||
fi | ||
|
||
# Install SuperDB | ||
if ! command -v super > /dev/null 2>&1; then | ||
git clone https://github.com/brimdata/super.git | ||
cd super | ||
make install | ||
fi | ||
|
||
cd scripts/super-cmd-perf | ||
|
||
fi | ||
|
||
rundir="$(date +%F_%T)" | ||
mkdir "$rundir" | ||
report="$rundir/report_$rundir.md" | ||
|
||
echo -e "|**Software**|**Version**|\n|-|-|" | tee -a "$report" | ||
for software in super duckdb datafusion-cli clickhouse | ||
do | ||
if ! command -v $software > /dev/null; then | ||
echo "error: \"$software\" not found in PATH" | ||
exit 1 | ||
fi | ||
echo "|$software|$($software --version)|" | tee -a "$report" | ||
done | ||
echo >> "$report" | ||
|
||
# Prepare the test data | ||
./prep-data.sh "$rundir" | ||
|
||
# Run the queries and generate the summary report | ||
./run-queries.sh "$rundir" | ||
|
||
if [ -n "$RUNNING_ON_AWS_EC2" ]; then | ||
mv "$HOME/runlog.txt" "$rundir" | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
#!/bin/bash -xv | ||
set -euo pipefail | ||
pushd "$(cd "$(dirname "$0")" && pwd)" | ||
|
||
if [ "$#" -ne 1 ]; then | ||
echo "Specify results directory string" | ||
exit 1 | ||
fi | ||
rundir="$(pwd)/$1" | ||
mkdir -p "$rundir" | ||
|
||
RUNNING_ON_AWS_EC2="${RUNNING_ON_AWS_EC2:-}" | ||
if [ -n "$RUNNING_ON_AWS_EC2" ]; then | ||
cd /mnt | ||
fi | ||
|
||
function run_cmd { | ||
outputfile="$1" | ||
shift | ||
{ hyperfine \ | ||
--show-output \ | ||
--warmup 0 \ | ||
--runs 1 \ | ||
--time-unit second \ | ||
"$@" ; | ||
} \ | ||
> "$outputfile" \ | ||
2>&1 | ||
} | ||
|
||
mkdir gharchive_gz | ||
cd gharchive_gz | ||
for num in $(seq 0 23) | ||
do | ||
curl -L -O "https://data.gharchive.org/2023-02-08-${num}.json.gz" | ||
done | ||
cd .. | ||
|
||
DUCKDB_MEMORY_LIMIT="${DUCKDB_MEMORY_LIMIT:-}" | ||
if [ -n "$DUCKDB_MEMORY_LIMIT" ]; then | ||
increase_duckdb_memory_limit='SET memory_limit = '\'"${DUCKDB_MEMORY_LIMIT}"\''; ' | ||
else | ||
increase_duckdb_memory_limit="" | ||
fi | ||
|
||
run_cmd \ | ||
"$rundir/duckdb-table-create.out" \ | ||
"duckdb gha.db -c \"${increase_duckdb_memory_limit}CREATE TABLE gha AS FROM read_json('gharchive_gz/*.json.gz', union_by_name=true)\"" | ||
|
||
run_cmd \ | ||
"$rundir/duckdb-parquet-create.out" \ | ||
"duckdb gha.db -c \"${increase_duckdb_memory_limit}COPY (from gha) TO 'gha.parquet'\"" | ||
|
||
run_cmd \ | ||
"$rundir/super-bsup-create.out" \ | ||
"super -o gha.bsup gharchive_gz/*.json.gz" | ||
|
||
du -h gha.db gha.parquet gha.bsup gharchive_gz |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
SELECT count(),type | ||
FROM '__SOURCE__' | ||
WHERE repo.name='duckdb/duckdb' | ||
GROUP BY type |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
SELECT count() | ||
FROM '__SOURCE__' | ||
WHERE actor.login='johnbieren' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
SELECT count() | ||
FROM '__SOURCE__' | ||
WHERE grep('in case you have any feedback 😊') |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL referenced in this
curl
command line won't work until this PR merges.