Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Cuda again #137

Closed
wants to merge 18 commits into from
Closed

WIP: Cuda again #137

wants to merge 18 commits into from

Conversation

hmaarrfk
Copy link
Contributor

@hmaarrfk hmaarrfk commented Oct 10, 2021

Just testing cuda recipes on CIs

My TODO list:

  • Add cuda dependencies to recipe (follow CUDA build #120)
  • Add cuda migrations
  • Double check python dependencies
  • Double check that cuda dependencies have all been followed through in the many sub recipes
  • Update build files to add cuda flags.
  • Many other steps I am forgetting
  • Build cuda packages that timeout locally.

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

@conda-forge-linter
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

recipe/build.sh Outdated
fi

# cuda builds don't work with custom_toolchain, instead we hard-code arguments, mostly copied
# from https://github.com/AnacondaRecipes/tensorflow_recipes/tree/master/tensorflow-base-gpu
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the approach that is likely to be rejected based on the discussion in #134

Note also that if you just want a working cuda version for Linux I have that already in https://github.com/izahn/tensorflow-feedstock/tree/cuda_2.6.0 , https://github.com/izahn/tensorflow-feedstock/tree/cuda_2.5.1, and https://github.com/izahn/tensorflow-feedstock/tree/cuda_2.4.3 . Python 39 packages are available at https://anaconda.org/izahn/repo. It looks like these won't make it into conda-forge for the reasons discussed in #134 but they work fine for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider using different BUILD_OPTS (up to & including the custom toolchain) to be fair game. Perhaps I misread #134 at the time (if so, I apologise), but this doesn't looks too invasive.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you are using the custom toolchain most if this is redundant, or overrides the overrides specified in the custom toolchain.

@@ -106,7 +115,11 @@ outputs:
build:
- {{ compiler('c') }}
- {{ compiler('cxx') }}
- {{ compiler('cuda') }} # [cuda_compiler_version != "None"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be added to the top level build requirements in addition to all the outputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way to avoid duplicating dependencies. I think you have to name the top level package the same as the "first package".

I'm not sure if this is a feature or bug, but I use it in:
https://github.com/conda-forge/opencv-feedstock/blob/master/recipe/meta.yaml#L56

Would it be acceptable to use it here too?

recipe/build.sh Outdated Show resolved Hide resolved
@izahn
Copy link

izahn commented Oct 10, 2021

An updated version for 2.6.0 is available at https://github.com/izahn/tensorflow-feedstock/tree/cuda_2.6.0

@hmaarrfk
Copy link
Contributor Author

Everytime I feel like I understand programming a little better, new frameworks leave me thinking that it is all jargon again.

While I feel like I should be able to understand what this error is about, I just can't make sense of it
https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=390026&view=logs&j=d0d954b5-f111-5dc4-4d76-03b6c9d0cf7e&t=841356e0-85bb-57d8-dbbc-852e683d1642&l=796

@wolfv
Copy link
Member

wolfv commented Oct 27, 2021

hm, we (me and @DerThorsten) were looking at this again, and we might have gotten a tiny bit closer.

So basically we got it to compile to a similar level as what you have where it fails at a GPU target.
Basically it seems to try to invoke the GCC compiler where it should use nvcc.

In Tensorflow (the generated bazelrc) they seem to reference the following toolchain for linux-nvcc compilation: https://github.com/tensorflow/addons/blob/master/build_deps/toolchains/gcc7_manylinux2010-nvcc-cuda11/cc_toolchain_config.bzl

This toolchain also comes with a magic compiler wrapper in Python: https://github.com/tensorflow/addons/blob/334cd7ca8fb944aab38164a13d7d2203d7c39605/build_deps/toolchains/gcc7_manylinux2010-nvcc-cuda11/clang/bin/crosstool_wrapper_driver_is_not_gcc#L258-L278

That wrapper seems to either invoke nvcc or the host compiler.

I am not sure (and not a bazel expert at all!) but maybe we're overruling this toolchain with our custom toolchain (https://github.com/conda-forge/tensorflow-feedstock/tree/master/recipe/custom_toolchain) ??

Maybe @xhochy knows a bit about this?

One idea would be to try the compiler wrapper thing that is in the other toolchain and see if we can log something interesting.

@wolfv
Copy link
Member

wolfv commented Oct 27, 2021

We also found it's quite nice to reproduce the error when just compiling the single target and removing the purge:

# bazel clean --expunge
# bazel shutdown
./configure

bazel ${BAZEL_OPTS} build ${BUILD_OPTS} //tensorflow/core/kernels:quantize_and_dequantize_op_gpu

exit 0

And lastly I've written a very simple script to enter into the docker container from a build_locally.py build:

first commit the container using docker commit <hash> tf_snapshot and then run

FEEDSTOCK_ROOT="$(pwd)"
RECIPE_ROOT="${FEEDSTOCK_ROOT}/recipe"
DOCKER_IMAGE=tf_snapshot

echo "PWD: ${FEEDSTOCK_ROOT} :: ${RECIPE_ROOT}"

docker run -it \
           -v "${RECIPE_ROOT}":/home/conda/recipe_root:rw,z,delegated \
           -v "${FEEDSTOCK_ROOT}":/home/conda/feedstock_root:rw,z,delegated \
           tf_snapshot \
           /bin/bash

And that get's a proper interactive session going. I think automating this might be interesting to add into conda-smithy :)

@wolfv
Copy link
Member

wolfv commented Oct 27, 2021

Sooo ... i think I got ... something to work!

I copied the crosstool_wrapper_driver_is_not_gcc into our custom toolchain, and edited the toolchain to call it instead of gcc directly here:

I also hard coded the path to GCC in the crosstool_wrapper_driver_is_not_gcc tool (https://github.com/tensorflow/addons/blob/334cd7ca8fb944aab38164a13d7d2203d7c39605/build_deps/toolchains/gcc7_manylinux2010-nvcc-cuda11/clang/bin/crosstool_wrapper_driver_is_not_gcc#L41-L42)

I will try a full build now, let's see what happens :)

@xhochy @hmaarrfk just trying to get a sanity check if you think this makes sense at all? I am really no expert in bazel...

@hmaarrfk
Copy link
Contributor Author

I think you are on the right track.

I figured it was going to be something related to diving into their hard coded paths.

A solution might involve symlinking, or patching.

I'm very excited to get this working!

@wolfv
Copy link
Member

wolfv commented Oct 27, 2021

I can find this file also in the tensorflow sources: https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl

We might want to somehow wire it up from there ...

@wolfv
Copy link
Member

wolfv commented Oct 27, 2021

PS. Building the package now ... let's see how far it goes :)

@xhochy
Copy link
Member

xhochy commented Oct 28, 2021

Sooo ... i think I got ... something to work!

I copied the crosstool_wrapper_driver_is_not_gcc into our custom toolchain, and edited the toolchain to call it instead of gcc directly here:

I also hard coded the path to GCC in the crosstool_wrapper_driver_is_not_gcc tool (https://github.com/tensorflow/addons/blob/334cd7ca8fb944aab38164a13d7d2203d7c39605/build_deps/toolchains/gcc7_manylinux2010-nvcc-cuda11/clang/bin/crosstool_wrapper_driver_is_not_gcc#L41-L42)

I will try a full build now, let's see what happens :)

@xhochy @hmaarrfk just trying to get a sanity check if you think this makes sense at all? I am really no expert in bazel...

This makes sense. Bazel only supports a single compiler for C/C++/CUDA/... Thus if you want to use different compilers for different languages, you need a wrapper script that does the decision.

@wolfv
Copy link
Member

wolfv commented Oct 28, 2021

@xhochy I think #157 works :)

@hmaarrfk hmaarrfk closed this Oct 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants