Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add tensorflow and pytorch CUDA version tests for GPU image build #452

Merged
merged 8 commits into from
Jul 10, 2024

Conversation

TRNWWZ
Copy link
Contributor

@TRNWWZ TRNWWZ commented Jun 27, 2024

Issue #, if available:
In 1.8.0 GPU image, we installed tensorflow-cpu version by mistake

Description of changes:
To get rid of such issue, add a unit test to validate tensorflow and pytorch cuda version is installed in GPU image

Test:

  1. Build v1.9.0 locally
  2. Run test against GPU image, it succeed
  3. Run test against CPU image, test failed:
Exception: TensorFlow is installed without CUDA support for GPU image build.
  1. Run first Pytorch validation test against CPU image, test failed:
Exception: Pytorch is installed without CUDA support for GPU image build.
  1. Run second Pytorch validation test against CPU image, test failed:
Exception: Pytorch CUDA is not working in current environment. Make sure to execute this test case in GPU environment if you are not

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@TRNWWZ TRNWWZ changed the title fix: verify tensorflow CUDA version when building GPU image fix: verify tensorflow and pytorch CUDA version when building GPU image Jun 28, 2024
@TRNWWZ TRNWWZ changed the title fix: verify tensorflow and pytorch CUDA version when building GPU image fix: add tensorflow and pytorch CUDA version tests for GPU image build Jun 29, 2024
balajisankar15
balajisankar15 previously approved these changes Jul 9, 2024
# Raise exception if CUDA is not detected
if 'cuda' not in package_build:
raise Exception("Pytorch is installed without CUDA support for GPU image build.")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also do a print here "Pytorch is built with CUDA support"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also do a print here "Pytorch is built with CUDA support"

Good point, updated exception message

ARG SAGEMAKER_DISTRIBUTION_IMAGE
FROM $SAGEMAKER_DISTRIBUTION_IMAGE

ARG MAMBA_DOCKERFILE_ACTIVATE=1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this ARG for?

Copy link
Contributor Author

@TRNWWZ TRNWWZ Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik, these 3 lines are used to activate test environment, we have it in other tests too: https://github.com/aws/sagemaker-distribution/blob/main/test/test_artifacts/v1/autogluon.test.Dockerfile#L1-L4

JunLyu
JunLyu previously approved these changes Jul 9, 2024
balajisankar15
balajisankar15 previously approved these changes Jul 9, 2024
@balajisankar15 balajisankar15 merged commit 497d59f into aws:main Jul 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants