Skip to content

Commit

Permalink
[rapids] Test dask rapids (#105)
Browse files Browse the repository at this point in the history
* [rapids] Tested to produce images with dask rapids pre-installed

* [rapids] all cuda ad rapids images are built successfully

custom_image_utils/shell_script_generator.py
* added retry function
* reduced debug output when -x is passed

examples/secure-boot/build-current-images.sh
* using the same timestamp for all images
* added perl code to disk_usage to extract maximum disk usage during installation
* making request for image and instance list once here instead of each
  time an image is about to be created

examples/secure-boot/install_gpu_driver.sh
* add is_debuntu to simplify common is_debian || is_ubuntu use case
* prepend clean to execute_with_retries to reduce maximum image size
* downloading assets to ram disk to reduce maximum image size
* added sync to key points of the script to allow accurate measurement
  of maximum disk usage
* moved initialization steps to new function prepare_to_install
* created exit_handler function which gets executed before script exits

examples/secure-boot/pre-init.screenrc
* re-ordered versions alphabetically

examples/secure-boot/pre-init.sh
* re-using cache of image and instance list from parent script
* increased machine type to accommodate ram disk
* updated disk image sizes based on disk usage analysis
* removed dask intermediate image

examples/secure-boot/rapids.sh
* removed spark-related logic (see spark-rapids)
* remove dask packages if they were installed
* moved initialization steps to new function prepare_to_install
* created exit_handler function which gets executed before script exits

* updated disk sizes from latest successful run

* merged rapids.sh and dask.sh and refactored ; removed spark code paths (see spark-rapids)

* reducing noise a little more

* declaring default value for num_src_certs

* corrected alphabetical ordering

* using correct df command ; using greater or equal to rapids version ; correctly capturing retval of installer program

* dask>=2024.7
  • Loading branch information
cjac authored Oct 28, 2024
1 parent f372ad2 commit cab430a
Show file tree
Hide file tree
Showing 8 changed files with 1,109 additions and 459 deletions.
47 changes: 28 additions & 19 deletions custom_image_utils/shell_script_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,27 @@
base_obj_type="images"
function execute_with_retries() (
set +x
local -r cmd="$*"
for ((i = 0; i < 3; i++)); do
if eval "$cmd"; then return 0 ; fi
sleep 5
done
return 1
)
function exit_handler() {{
echo 'Cleaning up before exiting.'
if [[ -f /tmp/{run_id}/vm_created ]]; then
echo 'Deleting VM instance.'
gcloud compute instances delete {image_name}-install \
execute_with_retries gcloud compute instances delete {image_name}-install \
--project={project_id} --zone={zone} -q
elif [[ -f /tmp/{run_id}/disk_created ]]; then
echo 'Deleting disk.'
gcloud compute ${{base_obj_type}} delete {image_name}-install --project={project_id} --zone={zone} -q
execute_with_retries gcloud compute ${{base_obj_type}} delete {image_name}-install --project={project_id} --zone={zone} -q
fi
echo 'Uploading local logs to GCS bucket.'
Expand Down Expand Up @@ -99,6 +110,7 @@
done
local cert_args=""
local num_src_certs="0"
if [[ -n '{trusted_cert}' ]] && [[ -f '{trusted_cert}' ]]; then
# build tls/ directory from variables defined near the header of
# the examples/secure-boot/create-key-pair.sh file
Expand All @@ -124,9 +136,9 @@
local -a src_img_modulus_md5sums=()
mapfile -t src_img_modulus_md5sums < <(print_img_dbs_modulus_md5sums {dataproc_base_image})
local num_src_certs="${{#src_img_modulus_md5sums[@]}}"
num_src_certs="${{#src_img_modulus_md5sums[@]}}"
echo "${{num_src_certs}} db certificates attached to source image"
if [[ ${{num_src_certs}} -eq 0 ]]; then
if [[ "${{num_src_certs}}" -eq "0" ]]; then
echo "no db certificates in source image"
cert_list=default_cert_list
else
Expand All @@ -153,7 +165,7 @@
fi
date
set -x
if [[ -z "${{cert_args}}" && "${{num_src_certs}}" -ne "0" ]]; then
echo 'Re-using base image'
base_obj_type="reuse"
Expand All @@ -163,7 +175,7 @@
echo 'Creating image.'
base_obj_type="images"
instance_disk_args='--image-project={project_id} --image={image_name}-install --boot-disk-size={disk_size}G --boot-disk-type=pd-ssd'
time gcloud compute images create {image_name}-install \
time execute_with_retries gcloud compute images create {image_name}-install \
--project={project_id} \
--source-image={dataproc_base_image} \
${{cert_args}} \
Expand All @@ -174,20 +186,19 @@
echo 'Creating disk.'
base_obj_type="disks"
instance_disk_args='--disk=auto-delete=yes,boot=yes,mode=rw,name={image_name}-install'
time gcloud compute disks create {image_name}-install \
time execute_with_retries gcloud compute disks create {image_name}-install \
--project={project_id} \
--zone={zone} \
--image={dataproc_base_image} \
--type=pd-ssd \
--size={disk_size}GB
touch "/tmp/{run_id}/disk_created"
fi
set +x
date
echo 'Creating VM instance to run customization script.'
set -x
time gcloud compute instances create {image_name}-install \
( set -x
time execute_with_retries gcloud compute instances create {image_name}-install \
--project={project_id} \
--zone={zone} \
{network_flag} \
Expand All @@ -199,24 +210,23 @@
{service_account_flag} \
--scopes=cloud-platform \
{metadata_flag} \
--metadata-from-file startup-script=startup_script/run.sh
set +x
--metadata-from-file startup-script=startup_script/run.sh )
touch /tmp/{run_id}/vm_created
# clean up intermediate install image
if [[ "${{base_obj_type}}" == "images" ]] ; then
gcloud compute images delete -q {image_name}-install --project={project_id}
execute_with_retries gcloud compute images delete -q {image_name}-install --project={project_id}
fi
echo 'Waiting for customization script to finish and VM shutdown.'
gcloud compute instances tail-serial-port-output {image_name}-install \
execute_with_retries gcloud compute instances tail-serial-port-output {image_name}-install \
--project={project_id} \
--zone={zone} \
--port=1 2>&1 \
| grep 'startup-script' \
| sed -e 's/ {image_name}-install.*startup-script://g' \
| dd bs=64 of={log_dir}/startup-script.log \
| dd bs=1 of={log_dir}/startup-script.log \
|| true
echo 'Checking customization script result.'
date
Expand All @@ -233,14 +243,13 @@
date
echo 'Creating custom image.'
set -x
time gcloud compute images create {image_name} \
( set -x
time execute_with_retries gcloud compute images create {image_name} \
--project={project_id} \
--source-disk-zone={zone} \
--source-disk={image_name}-install \
{storage_location_flag} \
--family={family}
set +x
--family={family} )
touch /tmp/{run_id}/image_created
}}
Expand Down
8 changes: 4 additions & 4 deletions examples/secure-boot/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ in the file examples/secure-boot/env.json.sample.
```bash
cp examples/secure-boot/env.json.sample env.json
vi env.json
docker build -t dataproc-custom-images:latest .
docker run -it dataproc-custom-images:latest /bin/bash examples/secure-boot/cuda.sh
docker build -t dataproc-cuda-pre-init:latest .
docker run -it dataproc-cuda-pre-init:latest /bin/bash examples/secure-boot/cuda.sh
```

To do the same, but for all dataproc variants including supported
Expand All @@ -64,6 +64,6 @@ script can be run in docker:
```bash
cp examples/secure-boot/env.json.sample env.json
vi env.json
docker build -t dataproc-custom-images:latest .
docker run -it dataproc-custom-images:latest /bin/bash examples/secure-boot/build-current-images.sh
docker build -t dataproc-dask-rapids-pre-init:latest .
docker run -it dataproc-dask-rapids-pre-init:latest /bin/bash examples/secure-boot/build-current-images.sh
```
63 changes: 44 additions & 19 deletions examples/secure-boot/build-current-images.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
#
# This script creates a custom image pre-loaded with cuda

set -e
set -ex

function configure_service_account() {
# Create service account
Expand Down Expand Up @@ -84,25 +84,50 @@ configure_service_account
# screen session name
session_name="build-current-images"

# Run all image generation scripts simultaneously
readonly timestamp="$(date +%F-%H-%M)"
#readonly timestamp="2024-10-24-04-21"
export timestamp

export tmpdir=/tmp/${timestamp};
mkdir ${tmpdir}
export ZONE="$(jq -r .ZONE env.json)"
gcloud compute instances list --zones "${ZONE}" --format json > ${tmpdir}/instances.json
gcloud compute images list --format json > ${tmpdir}/images.json

# Run generation scripts simultaneously for each dataproc image version
screen -US "${session_name}" -c examples/secure-boot/pre-init.screenrc

# tail -n 3 /tmp/custom-image-cuda-pre-init-2-*/logs/workflow.log
# grep -A6 'Filesystem.*Avail' /tmp/custom-image-cuda-pre-init-2-*/logs/startup-script.log | perl -ne 'print $1,$/ if( m:( Filesystem.* Avail.*| /dev/.*/\s*$|^--): )'
# tail -n 3 /tmp/custom-image-*/logs/workflow.log
# tail -n 3 /tmp/custom-image-*/logs/startup-script.log
# tail -n 3 /tmp/custom-image-${PURPOSE}-2-*/logs/workflow.log
function find_disk_usage() {
test -f /tmp/genline.pl || cat > /tmp/genline.pl<<'EOF'
#!/usr/bin/perl -w
use strict;
my $fn = $ARGV[0];
my( $config ) = ( $fn =~ /custom-image-(.*-(debian|rocky|ubuntu)\d+)-\d+/ );
my @raw_lines = <STDIN>;
my( $l ) = grep { m: /dev/.*/\s*$: } @raw_lines;
my( $stats ) = ( $l =~ m:\s*/dev/\S+\s+(.*?)\s*$: );
my( $dp_version ) = ($config =~ /-pre-init-(.+)/);
$dp_version =~ s/-/./;
my($max) = map { / maximum-disk-used: (\d+)/ } @raw_lines;
$max+=3;
my $i_dp_version = sprintf(q{%-15s}, qq{"$dp_version"});
print( qq{ $i_dp_version) disk_size_gb="$max" ;; # $stats # $config}, $/ );
EOF
for f in $(grep -l 'Customization script suc' /tmp/custom-image-*/logs/workflow.log|sed -e 's/workflow.log/startup-script.log/')
do
grep -A20 'Filesystem.*Avail' $f | perl /tmp/genline.pl $f
done
}

revoke_bindings
# sleep 8m ; grep 'Customization script' /tmp/custom-image-*/logs/workflow.log
# grep maximum-disk-used /tmp/custom-image-*/logs/startup-script.log

#
# disk size - 20241009
#
# Filesystem Size Used Avail Use% Mounted on

# /dev/sda1 40G 29G 9.1G 76% / # 2.0-debian10
# /dev/sda2 33G 30G 3.4G 90% / # 2.0-rocky8
# /dev/sda1 36G 29G 7.0G 81% / # 2.0-ubuntu18
# /dev/sda1 40G 35G 2.7G 93% / # 2.1-debian11
# /dev/sda2 36G 33G 3.4G 91% / # 2.1-rocky8
# /dev/root 36G 34G 2.1G 95% / # 2.1-ubuntu20
# /dev/sda1 40G 37G 1.1G 98% / # 2.2-debian12
# /dev/sda2 54G 34G 21G 63% / # 2.2-rocky9
# /dev/root 39G 37G 2.4G 94% / # 2.2-ubuntu22
revoke_bindings
Loading

0 comments on commit cab430a

Please sign in to comment.