Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: GPU bootstrap, refresh driver versions and list of supported GPU VM SKUs #587

Merged
merged 8 commits into from
Dec 1, 2024

Conversation

tallaxes
Copy link
Collaborator

@tallaxes tallaxes commented Nov 28, 2024

Fixes #579, #517

Description

  • Accommodate required bootstrap changes in recent VHD images (set GPU_DRIVER_TYPE)
  • Update NVIDIA driver versions
  • Update (and externalize) list of supported GPU VM SKUs
  • Add (and fix) tests for bootstrap GPU settings

How was this change tested?

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note


@coveralls
Copy link

coveralls commented Nov 28, 2024

Pull Request Test Coverage Report for Build 12092148753

Details

  • 36 of 37 (97.3%) changed or added relevant lines in 6 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.004%) to 94.23%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/utils/gpu.go 29 30 96.67%
Files with Coverage Reduction New Missed Lines %
pkg/cache/unavailableofferings.go 2 95.45%
Totals Coverage Status
Change from base Build 12038877567: -0.004%
Covered Lines: 37171
Relevant Lines: 39447

💛 - Coveralls

@tallaxes tallaxes self-assigned this Nov 29, 2024
@tallaxes tallaxes added area/gpu Issues or PRs related to GPUs area/bootstrap Issues or PRs related to bootstrap labels Nov 29, 2024
@tallaxes tallaxes marked this pull request as ready for review November 29, 2024 03:00
@ganeshkumarashok
Copy link

GPUNeedsFabricManager and related changes can be split into a separate PR since karpenter doesn't have MIG support today right? https://learn.microsoft.com/en-us/azure/aks/gpu-multi-instance?tabs=azure-cli

Copy link
Collaborator

@Bryce-Soghigian Bryce-Soghigian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, lets wait on removing NC1 etc. or do it in a separate PR.

@Bryce-Soghigian
Copy link
Collaborator

GPUNeedsFabricManager and related changes can be split into a separate PR since karpenter doesn't have MIG support today right? https://learn.microsoft.com/en-us/azure/aks/gpu-multi-instance?tabs=azure-cli

Yeah seems weird to add this support now.

@tallaxes tallaxes linked an issue Nov 30, 2024 that may be closed by this pull request
@tallaxes tallaxes mentioned this pull request Nov 30, 2024
3 tasks
@tallaxes tallaxes changed the title fix: GPU bootstrap (and component refresh) fix: GPU bootstrap, refresh driver versions and list of supported GPU VM SKUs Nov 30, 2024
@tallaxes tallaxes merged commit b4ae5e1 into main Dec 1, 2024
17 of 18 checks passed
@tallaxes tallaxes deleted the tallaxes/gpu-fix-update branch December 1, 2024 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bootstrap Issues or PRs related to bootstrap area/gpu Issues or PRs related to GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GPU nodes fail to join the cluster H100 support
4 participants