-
-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA build #120
CUDA build #120
Conversation
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a look at the bazel toolchain and the scripts that generate it. This was the groundbreaking part for me to get the CPU builds working.
It turns out that the |
I think I'm making progress, but hit another snag I don't understand:
Any ideas on this one? Full log at https://gist.github.com/izahn/47c950b53ffca4e8f68818b67538d495 in case that helps. |
Still having problems with this:
Full log at https://gist.github.com/izahn/47c950b53ffca4e8f68818b67538d495 bazel is really killing me here, I still don't understand it. |
recipe/build.sh
Outdated
export TF_DOWNLOAD_CLANG=0 | ||
export TF_NEED_TENSORRT=0 | ||
export TF_NCCL_VERSION="" | ||
BUILD_OPTS="${BUILD_OPTS} --config=cuda --linkopt=-L${PREFIX}/lib --define=LIBDIR=${PREFIX}/lib" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUILD_OPTS="${BUILD_OPTS} --config=cuda --linkopt=-L${PREFIX}/lib --define=LIBDIR=${PREFIX}/lib" | |
BUILD_OPTS="${BUILD_OPTS} -s --config=cuda --linkopt=-L${PREFIX}/lib --define=LIBDIR=${PREFIX}/lib" |
Apparently -s
makes it print out the compilation commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might even want to try changing --linkopt
to --copt
or even --host_linkopt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yeah that is very verbose, but didn't really give any more information about what does wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long does it take to get to failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time to failure varies, because the order isn't always the same. Can be anywhere from 30 minutes to several hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sigh, this makes it harder to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might even want to try changing
--linkopt
to--copt
or even--host_linkopt
I think --copt=-L${PREFIX}/lib
might have done the trick! I'm almost afraid to get my hopes up at this point, but we are safely past the place where the build failed last time. Fingers crossed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it fails later with
[12,897 / 29,765] Compiling tensorflow/lite/toco/graph_transformations/resolve_multiply_by_zero.cc [for host]; 3s local ... (4 actions running)
ERROR: /home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/work/tensorflow/tools/proto_text/BUILD:31:10: Linking of rule '//tensorflow/tools/proto_text:gen_proto_text_functions' failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command
(cd /home/conda/.cache/bazel/_bazel_conda/fc5b6f8a245ea6c7bfa068d002f44f78/execroot/org_tensorflow && \
exec env - \
PATH=/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/work:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_build_env/bin:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/bin:/opt/conda/condabin:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_build_env:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_build_env/bin:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/bin:/opt/conda/bin:/opt/conda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/conda/bin:/usr/local/cuda/bin \
PWD=/proc/self/cwd \
external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/host/bin/tensorflow/tools/proto_text/gen_proto_text_functions-2.params)
Execution platform: @local_execution_config_platform//:platform
/usr/bin/ld: cannot find -lprotobuf
/usr/bin/ld: cannot find -lsnappy
/usr/bin/ld: cannot find -lprotobuf
collect2: error: ld returned 1 exit status
ERROR: /home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/work/tensorflow/core/framework/BUILD:1261:31 Linking of rule '//tensorflow/tools/proto_text:gen_proto_text_functions' failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command
(cd /home/conda/.cache/bazel/_bazel_conda/fc5b6f8a245ea6c7bfa068d002f44f78/execroot/org_tensorflow && \
exec env - \
PATH=/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/work:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_build_env/bin:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/bin:/opt/conda/condabin:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_build_env:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_build_env/bin:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac:/home/conda/feedstock_root/build_artifacts/tensorflow-split_1622642338317/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/bin:/opt/conda/bin:/opt/conda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/conda/bin:/usr/local/cuda/bin \
PWD=/proc/self/cwd \
external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/host/bin/tensorflow/tools/proto_text/gen_proto_text_functions-2.params)
Execution platform: @local_execution_config_platform//:platform
INFO: Elapsed time: 13126.052s, Critical Path: 444.51s
INFO: 13232 processes: 8181 internal, 5051 local.
FAILED: Build did NOT complete successfully
full log at https://gist.github.com/izahn/42712a8437361cb40aa5aed129f05155
I notice
in the logs, and suspect that might be a problem, if not the problem, causing my builds to fail. |
I've taken this as far as I can for the time being. I've pushed my latest changes, and I think its in pretty good shape except for the linker issue. If someone has time and interest in picking it up that would be great, otherwise I'll try to come back to this at some later date. Thanks to everyone who offered their time and advice, I appreciate your help! |
I'm finding it strange that there are references to the use of compilers in For one, the It should be using the conda-forge I'm expecting it to be something like
|
whoa! did you find it? |
Kind of! I gave up on getting The log from my most recent completed build is at https://gist.github.com/izahn/b194405de3595d460199cbc3bbce0c74, and the packages are at https://anaconda.org/izahn/repo. One of the All in all I think I'm getting very close to something that I would be comfortable merging here. It needs investigation and fix for the failing |
The linux builds are working, both with and without Several build logs are available for review, and the corresponding artifacts are available at https://anaconda.org/izahn/repo Given the complexity of this build, and the fact that I had to resort to using a totally different build process for |
We can have a |
Sorry for chiming in so late. This just came back on my radar. I think we should avoid having two branches. I think that we can likely try to see if the OSX builds are just broken as is, or if we did something special in this recipe to break them. I think I can try to build on the CIs the OSX part to see what broke. |
If we go the direction of this PR we effectively have two completely different builds. The cuda enabled recipe in this PR is basically just the Anaconda one, while the non-cuda build has diverged considerably from the anaconda recipe. Because of this I think a separate branch makes sense. Alternatively we can keep trying to get a cuda enabled build based on the conda-forge recipe here rather than based on the Anaconda recipe. I tried pretty hard to do that but never could get it working. The only thing that came from that effort was that I learned to hate bazel. |
Thanks for pushing this forward! I don't like the divergence in recipes TBH, but I understand the realities of the situation (and your hate for bazel). I haven't looked in detail, but is there an argument against using the CUDA-enabled recipe to also build the non-CUDA version? Also, in the hope of enriching our understanding of the different approaches - have you seen the work from open-ce on packaging tensorflow for conda? At the time I commented:
and @jayfurmanek responded:
|
Not really as the CUDA-enabled one uses the system compilers and not the conda-forge ones. We need the c-f ones on OSX independent of the target and for cross-compiling on Linux. The latter is not used yet for Tensorflow but has been working really well with other bazel-based projects using the I haven't looked into any issues here as I actually don't need CUDA for my current use cases but I would expect that we would get it working with a limited amount of patching. The main issue here is not Bazel but just the extreme way of Tensorflow of doing all things on their own weird way. Bazel isn't a straight forward to use tool, it requires a lot of upfront knowledge but the issues here cannot be really blamed on it 😉
I had a brief look at that when doing the last iteration on building Tensorflow and it also helped and gave me some helpful pointers but as I didn't enable CUDA at that point, I didn't use anything from there. |
@xhochy are you planning on working on this? I'd love to help out as we need a CUDA-enabled build. |
Not really, I would have a look from time to time into the errors that come up but I don't need CUDA currently. |
The basic problem I've had when trying to make a cuda-enabled build based on the conda-forge recipe is that bazel stubbornly refuses to use the host system cuda headers. The error usually looks like
@xhochy if you have any ideas about that it would be appreciated! |
One option is to just copy the headers to |
I'll be more than happy if someone wants to try this, but I've already spent more time than I care to admit working on this. I have an alternative that works (see #134 ) and I don't plan to spend more time on this approach. I'm closing this PR, but I'll leave my branch up in case someone else wants to give it a shot. |
Checklist
0
(if the version changed)conda-smithy
(Use the phrase@conda-forge-admin, please rerender
in a comment in this PR for automated rerendering)This is a cleaned up PR that adds a CUDA build. It is based on the work and feedback in #118
The main issue is that I couldn't get a CUDA version to build with system libraries. I finally gave up and unset
TF_SYSTEM_LIBS
for the CUDA builds only (other builds should still use system libraries as before). It would be even more awesome if we could get CUDA builds and use conda system libraries. I tried hard but couldn't figure out how to do it, and I think a CUDA build will be valuable, even if we can't get system libraries working.