-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpu] strict driver and cuda version assignment #1275
base: master
Are you sure you want to change the base?
[gpu] strict driver and cuda version assignment #1275
Conversation
gpu/install_gpu_driver.sh * exclusively using .run file installation method when available * build nccl from source * cache build artifacts from kernel driver and nccl * Tested more CUDA minor versions * gathering CUDA and driver version from URLs if passed * Printing warnings when combination provided is known to fail * waiting on apt lock when it exists * wrapping expensive functions in completion checks to reduce re-run time * fixed a problem with ops agent not installing ; using venv * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS * setting better spark defaults * skipping proxy setup if http-proxy metadata not set * added function to check secure-boot and os version compatability gpu/manual-test-runner.sh * order commands correctly gpu/test_gpu.py * clearer test skipping logic * added instructions on how to test pyspark
…ment ; added better description of what is in the runfile
… less well supported 11.8
…torch and tensorflow test functions
… the signing key used
/gcbrun |
/gcbrun |
There are too many changes in this single PR. Can the PR be broken into smaller logical chunks so that it can be reviewed better |
yeah, there has been a lot of change. I'll re-order the functions a little. This should reduce the delta significantly. Testing my change before commit and push... |
…y creation script intended to be removed in 70f37b6
/gcbrun |
blocked on #1283 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These change are ready for review.
Referencing a related discussion: #1269 (comment). |
gpu/test_gpu.py: * using tests from GoogleCloudDataproc#1275 gpu/verify_pyspark.py: * new test file ; will probably be moved to mlvm templates/gpu/install_gpu_driver.sh.in: * this action template includes only the code unique to this action
gpu/test_gpu.py: * using tests from GoogleCloudDataproc#1275 gpu/verify_pyspark.py: * new test file ; will probably be moved to mlvm templates/gpu/install_gpu_driver.sh.in: * this action template includes only the code unique to this action
Resolves Issues
gpu/install_gpu_driver.sh
Driver version defaults to version in driver .run file if specified
CUDA version defaults to version in cuda .run file if specified
exclusively using .run file installation method for cuda and driver installation
build nccl from source, since that is the only mechanism which supports all Dataproc OSs
wrap expensive functions in completion checks to reduce re-run time when testing manually
cache build results in GCS
waiting on apt lock when it exists
only install build dependencies if build is necessary
fix problem with ops agent not installing ; using venv
Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
mark task completion by creating a file rather than setting a variable
added functions to check and report secure-boot and os version details
gpu/manual-test-runner.sh
gpu/test_gpu.py