Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rapids] removed spark tests, updated to a more recent rapids release #1219

Merged
merged 96 commits into from
Oct 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
260c707
[gpu] clean-up of sources.list and keyring file assertion
cjac Sep 30, 2024
c248aaf
merge from master
cjac Oct 4, 2024
f942492
merged from custom-images/examples/secure-boot/install_gpu_driver.sh
cjac Oct 10, 2024
d6a86cb
added comments for difficut to understand functions
cjac Oct 10, 2024
3e8007e
tested with 24.06 ; using conda for cuda 12
cjac Aug 8, 2024
c385546
tested with 24.06 ; using conda for cuda 12
cjac Aug 8, 2024
4bf628a
removed os check functions and the use of them
cjac Aug 9, 2024
e370b80
capturing runtime of mamba install
cjac Aug 9, 2024
cecf837
retry failed mamba with conda
cjac Aug 9, 2024
6f91fb1
increase machine type ; reduce disk size ; test 11.8 (12.4 is default)
cjac Aug 9, 2024
48205a8
spark does not yet have 24.08.0
cjac Aug 9, 2024
d085df2
tested with 2.1 and 2.2
cjac Aug 9, 2024
aae3c86
always create environment ; run test scripts with python from envs/da…
cjac Aug 9, 2024
eb95860
skipping dask with yarn runtime tests for now
cjac Aug 10, 2024
9a4d536
added copyright block
cjac Aug 10, 2024
97dd7ad
temporary changes to improve test performance
cjac Aug 10, 2024
86c7671
increasing machine type, attempting 2024.06 again now that I have fix…
cjac Aug 10, 2024
151597f
refactored code a bit
cjac Aug 10, 2024
a1ab571
how did this get in this change?
cjac Aug 11, 2024
62262db
we are seeing an error in this config file ; investigate
cjac Aug 11, 2024
77f9fa0
temporary changes to improve test performance
cjac Aug 10, 2024
8ccbc27
Adding disable shielded boot flag and disk type ssd flag to enhance t…
prince-cs Aug 8, 2024
25f0d96
tested on debian11 w/ cuda11
cjac Aug 12, 2024
c6991e8
added skein tests for dask-yarn
cjac Aug 12, 2024
52f5fec
accidentally using the wrong bigtable.sh in this PR ; checking out ma…
cjac Aug 12, 2024
aad851a
using correct conda env for dask-yarn environment
cjac Aug 12, 2024
e20aa9a
added skein test for dask
cjac Aug 12, 2024
fd9449b
that was the wrong filename
cjac Aug 12, 2024
c69d951
perform the skein tests before skipping the dask ones
cjac Aug 13, 2024
5b23ddb
whitespace changes
cjac Aug 13, 2024
536aef9
removing the excessive logging
cjac Aug 13, 2024
b476bae
taking master hostname from argv ; added array test
cjac Aug 13, 2024
f7aed92
defining two separate services to ease debugging
cjac Aug 13, 2024
c9d41f4
dask service tests are passing
cjac Aug 14, 2024
b6273c8
refactored yarn tests to its own py file ; updated rapids.sh to separ…
cjac Aug 14, 2024
8d18024
tested with debian and rocky
cjac Aug 14, 2024
f88df7b
added skein test
cjac Aug 14, 2024
d71470f
reduced operations slightly when setting master hostname
cjac Aug 14, 2024
aa68bc8
python operators. amirite?
cjac Aug 14, 2024
facb14b
status fails ; list-units | grep works
cjac Aug 14, 2024
8559fdd
explicitly including cudf
cjac Aug 14, 2024
c3ea723
corrected variable name
cjac Aug 14, 2024
6a14ff1
working with cuda12 + yarn as dask runtime
cjac Aug 14, 2024
8e93293
removed pinning for numba as per jakirkham
cjac Aug 14, 2024
1b82dc1
easing the version constraints some
cjac Aug 14, 2024
7d65472
fully changing the variable name
cjac Aug 15, 2024
7cdf483
removing test_skein.py
cjac Aug 15, 2024
ca74b49
removed extra lines from rebase
cjac Aug 15, 2024
2e7979f
reducing line count
cjac Aug 15, 2024
de965fa
relaxed cuda version to 11.8
cjac Aug 15, 2024
d01e349
disabling rocky9 tests for now
cjac Aug 16, 2024
6aa28a3
skipping the whole test on rocky9 for now
cjac Aug 16, 2024
467ce89
trying 24.08
cjac Aug 16, 2024
33b8d5e
increase max cluster age for rocky9 ; using CUDA_VERSION=11.8 for non…
cjac Aug 16, 2024
2c1c6a0
increase timeout for init actions as well as max-age from previous co…
cjac Aug 16, 2024
f4b6dda
reverted attempt to change a r/o variable
cjac Aug 16, 2024
d72bb06
trying with 24.08
cjac Aug 17, 2024
e22cb45
removing spark from the rapids tests
cjac Aug 17, 2024
973c81b
2.2.20 is known to work
cjac Sep 23, 2024
9963dfb
using new fangled key management path
cjac Sep 23, 2024
5bbb8fc
explicitly specifying path to curl ; also installing curl
cjac Sep 23, 2024
ee13c9a
perform update before install
cjac Sep 23, 2024
c28bb4b
modified to run as a custom-images script
cjac Oct 11, 2024
531a472
remove delta from master for gpu/
cjac Oct 11, 2024
062f087
recently tested to have worked with n1-standard-4 and 54GB
cjac Oct 11, 2024
050f8c4
reduce log noise from Dockerfile
cjac Oct 11, 2024
aa4afb9
removing delta from dask on master
cjac Oct 11, 2024
c75d120
update verify_dask_instance test to use systemd unit defined in dask …
cjac Oct 11, 2024
85ac0ac
removing quotes from systemctl command
cjac Oct 14, 2024
3314334
protecting from empty string state
cjac Oct 14, 2024
c158a55
replacing removed dask-runtime=yarn instance test
cjac Oct 14, 2024
3eda60d
[dask-rapids] merge from custom-images
cjac Oct 24, 2024
dbfa4c0
revert to master
cjac Oct 24, 2024
1c9c7fe
refactored to match dask ; removed all spark code paths (see spark-ra…
cjac Oct 24, 2024
1c7a31d
added some testing helpers and documentation
cjac Oct 25, 2024
caf9307
dask-yarn tests do not work ; disabling until new release of dask-yar…
cjac Oct 25, 2024
7fdda0c
increase max idle time ; print the command to be run
cjac Oct 25, 2024
dd12f02
cleaned up comment positioning and content
cjac Oct 25, 2024
5cd3951
using ram disk for temp files if we have it
cjac Oct 25, 2024
3519fe0
double quotes will allow temp directory variable to be expanded corre…
cjac Oct 25, 2024
12e253d
using else instead of is_rocky
cjac Oct 25, 2024
e8a44fe
corrected release version names
cjac Oct 25, 2024
caab9be
revert to mainline
cjac Oct 25, 2024
6d900bf
simplify and modernize this comment
cjac Oct 25, 2024
13cb723
default to using internal IP ; have not yet renamed rapids to dask-ra…
cjac Oct 25, 2024
aec628d
prepare layout for rename of rapids to dask-rapids
cjac Oct 25, 2024
8c67d21
reduce noise from docker run
cjac Oct 25, 2024
a31f10c
reduce noise in docker build
cjac Oct 25, 2024
a6fa424
removing older GPU from list
cjac Oct 25, 2024
e5b6e3f
removing delta from master
cjac Oct 25, 2024
f0f906a
Merge branch 'GoogleCloudDataproc:master' into rapids-20240806
cjac Oct 25, 2024
5b93e3a
Thread.yield()
cjac Oct 25, 2024
38ba6e3
improved documentation
cjac Oct 25, 2024
91907ae
default to non-private ip ; maybe that is why this last run failed
cjac Oct 25, 2024
6d8c32b
revert dataproc_test_case.py to last known good
cjac Oct 26, 2024
7c8ce57
using correct df command ; using greater or equal to rapids version ;…
cjac Oct 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions cloudbuild/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# This Dockerfile spins up a container where presubmit tests are run.
# This Dockerfile builds the container from which presubmit tests are run
# Cloud Build orchestrates this process.

FROM gcr.io/cloud-builders/gcloud
Expand All @@ -9,8 +9,16 @@ COPY --chown=ia-tests:ia-tests . /init-actions

# Install Bazel:
# https://docs.bazel.build/versions/master/install-ubuntu.html
RUN echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | tee /etc/apt/sources.list.d/bazel.list
RUN curl https://bazel.build/bazel-release.pub.gpg | apt-key add -
RUN apt-get update && apt-get install -y openjdk-8-jdk python3-setuptools bazel
ENV bazel_kr_path=/usr/share/keyrings/bazel-keyring.gpg
RUN apt-get install -y -qq curl >/dev/null 2>&1 && \
apt-get clean
RUN /usr/bin/curl https://bazel.build/bazel-release.pub.gpg | \
gpg --dearmor -o "${bazel_kr_path}"
RUN echo "deb [arch=amd64 signed-by=${bazel_kr_path}] http://storage.googleapis.com/bazel-apt stable jdk1.8" | \
dd of=/etc/apt/sources.list.d/bazel.list status=none && \
apt-get update -qq
RUN apt-get autoremove -y -qq && \
apt-get install -y -qq openjdk-8-jdk python3-setuptools bazel >/dev/null 2>&1 && \
apt-get clean

USER ia-tests
1 change: 1 addition & 0 deletions cloudbuild/presubmit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ determine_tests_to_run() {
changed_dir="${changed_dir%%/*}/"
# Run all tests if common directories modified
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
continue # remove this before squash/merge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

echo "All tests will be run: '${changed_dir}' was changed"
TESTS_TO_RUN=(":DataprocInitActionsTestSuite")
return 0
Expand Down
2 changes: 2 additions & 0 deletions cloudbuild/run-presubmit-on-k8s.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ gcloud container clusters get-credentials "${CLOUDSDK_CONTAINER_CLUSTER}"

LOGS_SINCE_TIME=$(date --iso-8601=seconds)

# This kubectl sometimes fails because services have not caught up. Thread.yield()
sleep 10s
kubectl run "${POD_NAME}" \
--image="${IMAGE}" \
--restart=Never \
Expand Down
18 changes: 11 additions & 7 deletions integration_tests/dataproc_test_case.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

FLAGS = flags.FLAGS
flags.DEFINE_string('image', None, 'Dataproc image URL')
flags.DEFINE_string('image_version', None, 'Dataproc image version, e.g. 1.4')
flags.DEFINE_string('image_version', None, 'Dataproc image version, e.g. 2.2')
flags.DEFINE_boolean('skip_cleanup', False, 'Skip cleanup of test resources')
FLAGS(sys.argv)

Expand Down Expand Up @@ -122,9 +122,9 @@ def createCluster(self,
args.append("--public-ip-address")

for i in init_actions:
if "install_gpu_driver.sh" in i or \
"mlvm.sh" in i or "rapids.sh" in i or \
"spark-rapids.sh" in i or "horovod.sh" in i:
if "install_gpu_driver.sh" in i or "horovod.sh" in i or \
"dask-rapids.sh" in i or "mlvm.sh" in i or \
"spark-rapids.sh" in i:
args.append("--no-shielded-secure-boot")

if optional_components:
Expand Down Expand Up @@ -178,11 +178,15 @@ def createCluster(self,
args.append("--zone={}".format(self.cluster_zone))

if not FLAGS.skip_cleanup:
args.append("--max-age=2h")
args.append("--max-age=60m")

args.append("--max-idle=25m")

cmd = "{} dataproc clusters create {} {}".format(
"gcloud beta" if beta else "gcloud", self.name, " ".join(args))

print("Running command: [{}]".format(cmd))

_, stdout, _ = self.assert_command(
cmd, timeout_in_minutes=timeout_in_minutes or DEFAULT_TIMEOUT)
config = json.loads(stdout).get("config", {})
Expand Down Expand Up @@ -239,7 +243,7 @@ def getClusterName(self):

@staticmethod
def getImageVersion():
# Get a numeric version from the version flag: '1.5-debian10' -> '1.5'.
# Get a numeric version from the version flag: '2.2-debian10' -> '2.2'.
# Special case a 'preview' image versions and return a large number
# instead to make it a higher image version in comparisons
version = FLAGS.image_version
Expand All @@ -248,7 +252,7 @@ def getImageVersion():

@staticmethod
def getImageOs():
# Get OS string from the version flag: '1.5-debian10' -> 'debian'.
# Get OS string from the version flag: '2.2-debian10' -> 'debian'.
# If image version specified without OS suffix ('2.0')
# then return 'debian' by default
version = FLAGS.image_version
Expand Down
2 changes: 0 additions & 2 deletions rapids/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ py_test(
srcs = ["test_rapids.py"],
data = [
"rapids.sh",
"verify_xgboost_spark.scala",
"//dask:dask.sh",
"//gpu:install_gpu_driver.sh",
],
local = True,
Expand Down
40 changes: 40 additions & 0 deletions rapids/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# This Dockerfile builds the container from which rapids tests are run
# This process needs to be executed manually from a git clone
#
# See manual-test-runner.sh for instructions

FROM gcr.io/cloud-builders/gcloud

RUN useradd -m -d /home/ia-tests -s /bin/bash ia-tests

RUN apt-get -qq update \
&& apt-get -y -qq install \
apt-transport-https apt-utils \
ca-certificates libmime-base64-perl gnupg \
curl jq less screen > /dev/null 2>&1 && apt-get clean

# Install bazel signing key, repo and package
ENV bazel_kr_path=/usr/share/keyrings/bazel-release.pub.gpg
ENV bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8"

RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg \
| gpg --dearmor -o "${bazel_kr_path}" \
&& echo "deb [arch=amd64 signed-by=${bazel_kr_path}] ${bazel_repo_data}" \
| dd of=/etc/apt/sources.list.d/bazel.list status=none \
&& apt-get update -qq

RUN apt-get autoremove -y -qq && \
apt-get install -y -qq default-jdk python3-setuptools bazel > /dev/null 2>&1 && \
apt-get clean


# Install here any utilities you find useful when troubleshooting
RUN apt-get -y -qq install emacs-nox vim uuid-runtime > /dev/null 2>&1 && apt-get clean

WORKDIR /init-actions

USER ia-tests
COPY --chown=ia-tests:ia-tests . ${WORKDIR}

ENTRYPOINT ["/bin/bash"]
#CMD ["/bin/bash"]
17 changes: 17 additions & 0 deletions rapids/bazel.screenrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these files committed by mistake?

# For debugging, uncomment the following line
#

# screen -L -t monitor 0 /bin/bash

screen -L -t 2.0-debian10 1 sh -c '/bin/bash -x rapids/run-bazel-tests.sh 2.0-debian10 ; exec /bin/bash'
#screen -L -t 2.0-rocky8 2 sh -c '/bin/bash -x rapids/run-bazel-tests.sh 2.0-rocky8 ; exec /bin/bash'
#screen -L -t 2.0-ubuntu18 3 sh -c '/bin/bash -x rapids/run-bazel-tests.sh 2.0-ubuntu18 ; exec /bin/bash'

#screen -L -t 2.1-debian11 4 sh -c '/bin/bash -x rapids/run-bazel-tests.sh 2.1-debian11 ; exec /bin/bash'
#screen -L -t 2.1-rocky8 5 sh -c '/bin/bash -x rapids/run-bazel-tests.sh 2.1-rocky8 ; exec /bin/bash'
#screen -L -t 2.1-ubuntu20 6 sh -c '/bin/bash -x rapids/run-bazel-tests.sh 2.1-ubuntu20 ; exec /bin/bash'

#screen -L -t 2.2-debian12 7 sh -c '/bin/bash -x rapids/run-bazel-tests.sh 2.2-debian12 ; exec /bin/bash'
#screen -L -t 2.2-rocky9 8 sh -c '/bin/bash -x rapids/run-bazel-tests.sh 2.2-rocky9 ; exec /bin/bash'
#screen -L -t 2.2-ubuntu22 9 sh -c '/bin/bash -x rapids/run-bazel-tests.sh 2.2-ubuntu22 ; exec /bin/bash'
7 changes: 7 additions & 0 deletions rapids/env.json.sample
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"PROJECT_ID":"example-yyyy-nn",
"PURPOSE":"cuda-pre-init",
"BUCKET":"my-bucket-name",
"IMAGE_VERSION":"2.2-debian12",
"ZONE":"us-west4-ñ"
}
77 changes: 77 additions & 0 deletions rapids/manual-test-runner.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
#!/bin/bash

# This script sets up the gcloud environment and launches tests in a screen session
#
# To run the script, the following will bootstrap
#
# git clone [email protected]:GoogleCloudDataproc/initialization-actions
# git checkout rapids-20240806
# cd initialization-actions
# cp rapids/env.json.sample env.json
# vi env.json
# docker build -f rapids/Dockerfile -t rapids-init-actions-runner:latest .
# time docker run -it rapids-init-actions-runner:latest rapids/manual-test-runner.sh
#
# The bazel run(s) happen in separate screen windows.
# To see a list of screen windows, press ^a "
# Num Name
#
# 0 monitor
# 1 2.0-debian10
# 2 sh


readonly timestamp="$(date +%F-%H-%M)"
export BUILD_ID="$(uuidgen)"

tmp_dir="/tmp/${BUILD_ID}"
log_dir="${tmp_dir}/logs"
mkdir -p "${log_dir}"

IMAGE_VERSION="$1"
if [[ -z "${IMAGE_VERSION}" ]] ; then
IMAGE_VERSION="$(jq -r .IMAGE_VERSION env.json)" ; fi ; export IMAGE_VERSION
export PROJECT_ID="$(jq -r .PROJECT_ID env.json)"
export REGION="$(jq -r .REGION env.json)"
export BUCKET="$(jq -r .BUCKET env.json)"

gcs_log_dir="gs://${BUCKET}/${BUILD_ID}/logs"

function exit_handler() {
RED='\\e[0;31m'
GREEN='\\e[0;32m'
NC='\\e[0m'
echo 'Cleaning up before exiting.'

# TODO: list clusters which match our BUILD_ID and clean them up
# TODO: remove any test related resources in the project

echo 'Uploading local logs to GCS bucket.'
gsutil -m rsync -r "${log_dir}/" "${gcs_log_dir}/"

if [[ -f "${tmp_dir}/tests_success" ]]; then
echo -e "${GREEN}Workflow succeeded, check logs at ${log_dir}/ or ${gcs_log_dir}/${NC}"
exit 0
else
echo -e "${RED}Workflow failed, check logs at ${log_dir}/ or ${gcs_log_dir}/${NC}"
exit 1
fi
}

trap exit_handler EXIT

# screen session name
session_name="manual-rapids-tests"

gcloud config set project ${PROJECT_ID}
gcloud config set dataproc/region ${REGION}
gcloud auth login
gcloud config set compute/region ${REGION}

export INTERNAL_IP_SSH="true"

# Run tests in screen session so we can monitor the container in another window
screen -US "${session_name}" -c rapids/bazel.screenrc



Loading