[gpu] strict driver and cuda version assignment #1275

cjac · 2024-12-13T06:44:34Z

Resolves Issues

[gpu] versions installed by gpu/install_gpu_driver.sh do not match requested versions #1268

gpu/install_gpu_driver.sh

Driver version defaults to version in driver .run file if specified
CUDA version defaults to version in cuda .run file if specified
exclusively using .run file installation method for cuda and driver installation
- Installing non-open driver from .run file on rocky8
build nccl from source, since that is the only mechanism which supports all Dataproc OSs
wrap expensive functions in completion checks to reduce re-run time when testing manually
cache build results in GCS
waiting on apt lock when it exists
only install build dependencies if build is necessary
fix problem with ops agent not installing ; using venv
Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
mark task completion by creating a file rather than setting a variable
added functions to check and report secure-boot and os version details

gpu/manual-test-runner.sh

order commands correctly
point to origin rather than staging repo

gpu/test_gpu.py

now marking tests as known to fail in addition to skipping
clearer test skipping logic

gpu/install_gpu_driver.sh * exclusively using .run file installation method when available * build nccl from source * cache build artifacts from kernel driver and nccl * Tested more CUDA minor versions * gathering CUDA and driver version from URLs if passed * Printing warnings when combination provided is known to fail * waiting on apt lock when it exists * wrapping expensive functions in completion checks to reduce re-run time * fixed a problem with ops agent not installing ; using venv * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS * setting better spark defaults * skipping proxy setup if http-proxy metadata not set * added function to check secure-boot and os version compatability gpu/manual-test-runner.sh * order commands correctly gpu/test_gpu.py * clearer test skipping logic * added instructions on how to test pyspark

…nd cuda match up

…ment ; added better description of what is in the runfile

… less well supported 11.8

…torch and tensorflow test functions

…atest releases

… the signing key used

…to build phase

…istent function

cjac · 2024-12-17T18:54:59Z

/gcbrun

cjac · 2024-12-24T04:27:26Z

/gcbrun

jayadeep-jayaraman · 2024-12-24T05:18:17Z

There are too many changes in this single PR. Can the PR be broken into smaller logical chunks so that it can be reviewed better

cjac · 2024-12-24T07:40:01Z

yeah, there has been a lot of change. I'll re-order the functions a little. This should reduce the delta significantly. Testing my change before commit and push...

…y creation script intended to be removed in 70f37b6

cjac · 2024-12-24T18:06:26Z

/gcbrun

cjac · 2024-12-24T18:15:44Z

blocked on #1283

cjac

These change are ready for review.

SurajAralihalli · 2025-01-07T18:11:01Z

Referencing a related discussion: #1269 (comment).

gpu/test_gpu.py: * using tests from GoogleCloudDataproc#1275 gpu/verify_pyspark.py: * new test file ; will probably be moved to mlvm templates/gpu/install_gpu_driver.sh.in: * this action template includes only the code unique to this action

cjac added 30 commits December 7, 2024 15:01

correcting driver for cuda 12.4

98c2ab3

correcting cuda subversion. 12.4.0 instead of 12.4.1 so that driver a…

52e1d14

…nd cuda match up

corrected cannonical 11.8 driver version ; removed extra code and com…

64297d5

…ment ; added better description of what is in the runfile

skipping most tests ; using 11.7 from the cuda 11 line instead of the…

13a5ff4

… less well supported 11.8

verified that the cuda and driver versions match up

61cfbbd

reducing log capture

a424a48

temporarily increasing machine shape for build caching

acf26aa

64 is too many for a single T4

7e667bd

added a subversion for 11.7

241e8b4

add more tests to the install function

dbd7ebe

only including architectures supported by this version of CUDA

9a41ced

pinning down versions better ; more caching ; more ram disks ; new py…

2b3da9c

…torch and tensorflow test functions

using maximum from 8.9 series on rocky for 11.7

77f928f

skip full build

afc4e78

pinning to bazel-7.4.0

ba3a1b5

NCCL requires gcc-11 for cuda11

be56ce7

rocky8 is now building from the source in the .run file

a1f4e47

reverting to previous state of only selecting a compiler version on l…

c141f52

…atest releases

replaced literal path names with variable values ; indexing builds by…

af7b58a

… the signing key used

moved variable definition to prepare function ; moved driver signing …

fa5e8d2

…to build phase

merged from pre-master

3072eec

test whether variable is defined before checking its value

21fc589

reduce noise in docker build output

28d759d

cache only the bins and logs

adbfe75

build index of kernel modules after unpacking ; remove call to non-ex…

c3fdb27

…istent function

only build module dependency index once

357559b

skipping CUDA 11 NCCL build on debian12

7ec074c

skip cuda11 on debian12, rocky9

1540f97

renamed verify_pyspark to verify_instance_pyspark

c4c73f4

This was referenced Dec 16, 2024

[gpu][spark-rapids] Consolidate mig.sh Scripts and Sync Driver Installation Steps Across Copies #1259

Open

[gpu] ml-on-gcp repo (gpu metrics dependency) to be archived #1081

Open

updated manual-test-runner.sh instructions

fa80ef5

cjac mentioned this pull request Dec 20, 2024

[gpu][spark-rapids] Fix MIG script #1269

Open

cjac added 2 commits December 23, 2024 15:35

this one generated from template after refactor

f1dc98c

do not point to local rpm pgp key

42c80ad

cjac added 6 commits December 23, 2024 23:58

re-ordering to reduce delta from master

a442058

custom image usage can come later

70f37b6

see GoogleCloudDataproc#1283

be0af99

replaced incorrectly removed presubmit.sh and removed custom image ke…

7c05c8e

…y creation script intended to be removed in 70f37b6

revert nearly to master

33aa1c2

can include extended test suite later

d634fa1

cjac changed the title ~~[gpu] toward a more consistent driver and CUDA install~~ [gpu] strict driver and cuda version assignmen Dec 24, 2024

cjac added 2 commits December 24, 2024 09:42

order commands correctly

5eca104

placing all completion files in a common directory

7e01287

cjac changed the title ~~[gpu] strict driver and cuda version assignmen~~ [gpu] strict driver and cuda version assignment Dec 24, 2024

cjac commented Dec 25, 2024

View reviewed changes

cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 3, 2025

using tests from GoogleCloudDataproc#1275

d72050e

cjac mentioned this pull request Jan 3, 2025

[template] create templates for use in generating actions #1282

Open

cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 6, 2025

using tests from GoogleCloudDataproc#1275

e241f7e

cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 9, 2025

using tests from GoogleCloudDataproc#1275

17f0fe8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gpu] strict driver and cuda version assignment #1275

[gpu] strict driver and cuda version assignment #1275

cjac commented Dec 13, 2024 •

edited

Loading

cjac commented Dec 17, 2024

cjac commented Dec 24, 2024

jayadeep-jayaraman commented Dec 24, 2024

cjac commented Dec 24, 2024

cjac commented Dec 24, 2024

cjac commented Dec 24, 2024

cjac left a comment

SurajAralihalli commented Jan 7, 2025

[gpu] strict driver and cuda version assignment #1275

Are you sure you want to change the base?

[gpu] strict driver and cuda version assignment #1275

Conversation

cjac commented Dec 13, 2024 • edited Loading

cjac commented Dec 17, 2024

cjac commented Dec 24, 2024

jayadeep-jayaraman commented Dec 24, 2024

cjac commented Dec 24, 2024

cjac commented Dec 24, 2024

cjac commented Dec 24, 2024

cjac left a comment

Choose a reason for hiding this comment

SurajAralihalli commented Jan 7, 2025

cjac commented Dec 13, 2024 •

edited

Loading