Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu] strict driver and cuda version assignment #1275

Open
wants to merge 83 commits into
base: master
Choose a base branch
from

Conversation

cjac
Copy link
Contributor

@cjac cjac commented Dec 13, 2024

Resolves Issues

gpu/install_gpu_driver.sh

  • Driver version defaults to version in driver .run file if specified

  • CUDA version defaults to version in cuda .run file if specified

  • exclusively using .run file installation method for cuda and driver installation

    • Installing non-open driver from .run file on rocky8
  • build nccl from source, since that is the only mechanism which supports all Dataproc OSs

  • wrap expensive functions in completion checks to reduce re-run time when testing manually

  • cache build results in GCS

  • waiting on apt lock when it exists

  • only install build dependencies if build is necessary

  • fix problem with ops agent not installing ; using venv

  • Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS

  • mark task completion by creating a file rather than setting a variable

  • added functions to check and report secure-boot and os version details

gpu/manual-test-runner.sh

  • order commands correctly
  • point to origin rather than staging repo

gpu/test_gpu.py

  • now marking tests as known to fail in addition to skipping
  • clearer test skipping logic

cjac added 30 commits December 7, 2024 15:01
gpu/install_gpu_driver.sh
  * exclusively using .run file installation method when available
  * build nccl from source
  * cache build artifacts from kernel driver and nccl
  * Tested more CUDA minor versions
  * gathering CUDA and driver version from URLs if passed
  * Printing warnings when combination provided is known to fail
  * waiting on apt lock when it exists
  * wrapping expensive functions in completion checks to reduce re-run time
  * fixed a problem with ops agent not installing ; using venv
  * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
  * setting better spark defaults
  * skipping proxy setup if http-proxy metadata not set
  * added function to check secure-boot and os version compatability

gpu/manual-test-runner.sh
  * order commands correctly

gpu/test_gpu.py
  * clearer test skipping logic
  * added instructions on how to test pyspark
…ment ; added better description of what is in the runfile
@cjac
Copy link
Contributor Author

cjac commented Dec 17, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 24, 2024

/gcbrun

@jayadeep-jayaraman
Copy link
Collaborator

There are too many changes in this single PR. Can the PR be broken into smaller logical chunks so that it can be reviewed better

@cjac
Copy link
Contributor Author

cjac commented Dec 24, 2024

yeah, there has been a lot of change. I'll re-order the functions a little. This should reduce the delta significantly. Testing my change before commit and push...

@cjac cjac changed the title [gpu] toward a more consistent driver and CUDA install [gpu] strict driver and cuda version assignmen Dec 24, 2024
@cjac
Copy link
Contributor Author

cjac commented Dec 24, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 24, 2024

blocked on #1283

@cjac cjac changed the title [gpu] strict driver and cuda version assignmen [gpu] strict driver and cuda version assignment Dec 24, 2024
Copy link
Contributor Author

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These change are ready for review.

cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 3, 2025
cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 6, 2025
@SurajAralihalli
Copy link
Contributor

Referencing a related discussion: #1269 (comment).

cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 9, 2025
cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 10, 2025
gpu/test_gpu.py:
* using tests from GoogleCloudDataproc#1275

gpu/verify_pyspark.py:
* new test file ; will probably be moved to mlvm

templates/gpu/install_gpu_driver.sh.in:
* this action template includes only the code unique to this action
cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 10, 2025
gpu/test_gpu.py:
* using tests from GoogleCloudDataproc#1275

gpu/verify_pyspark.py:
* new test file ; will probably be moved to mlvm

templates/gpu/install_gpu_driver.sh.in:
* this action template includes only the code unique to this action
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants