-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot build TensorFLow with --config=dbg #48919
Comments
I am unable to reproduce by running: yes '' | TF_NEED_CUDA=1 ./configure
bazel build --config=dbg --config=cuda --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package I ran at commit a3d9f9b with Ubuntu 18.04. The compute capabilities I compiled with defaulted to 7.0. Can you give more information, like the commit and OS? Perhaps the issue was fixed at a later commit. You can also try compiling only a subset of files with debugging symbols, which I highly recommend as it reduces build time and gdb startup time. This might help since the "overflow" in the error message might be fixed with less debugging symbols, although admittedly I have no idea how debugging symbols work at all. I use the following command to compile a subset of files:
Still, we should fix this even if the above command works for you. |
I'm at commit 9275e30 root@7fe23091cb5b:/home/baarts/tensorflow-GH# cat /etc/issue |
I can reproduce on Ubuntu 20.04. I am not familiar with how debugging symbols work, but based on what you said it seems they are exceeding 2 GB which is causing the issue? You mention /CC @chsigg any ideas on what to do here? If it's impossible to support compiling all files with debug symbols, perhaps we should provide a config option and instructions on only building a subset of files with debugging symbols. |
@bas-aarts |
@reedwm It could be really nice to have a practical solution for c++ contributors. |
See also #13295 |
Could you please confirm if the issue still persist.Thanks |
confirmed. |
All external developers have a need for this. It would be great if Google could document and support the way for external developers to build TF with debug symbols. |
@sanjoy, I tried adding --copt=-O1 as well as --copt=-O2, which does not address the problem.
|
I think the best solution here is to only include debugging information for certain files when On Ubuntu 18.04,
I suggest that with
@mihaimaruseac, @bas-aarts does adding the flags above to |
This sounds good to me, we can go with this path for now. |
with the above diff, not all is well yet when compiling with --config=dbg . For mark_for_compilation_pass.cc, I see the following command line:
|
For me, it is compiled with debug info without optimizations. After adding the two lines in my previous post, when I run: bazel build -s --config=dbg --config=cuda //tensorflow/compiler/jit:compilation_passes I get the command line:
There is a |
My commandline shows up when building //tensorflow/tools/pip_package:build_pip_package |
Before, the build would fail with errors such as: "relocation truncated to fit: R_X86_64_32 against .debug_info'". The issue was the debug info was too large. I believe the issue was occurring because offsets into the .debug_info section are stored as 32-bit integers, and so that section cannot exceed 4GiB. To fix, debug info is only included for files under tensorflow/, excluding kernels. This brings the size of the .debug_info section down to about 1.4GiB, well under the 4GiB limit. Unfortunately, TF kernels and TF dependencies do not have debugging info anymore, but I suspect these are rarely debugged. Debugging info for specific kernels/dependencies can still be explicitly included by the user, e.g. by passing the bazel flags: --config=dbg --per_file_copt=+tensorflow/core/kernels/identity_op.*@-g See #48919 for more context. PiperOrigin-RevId: 378910826 Change-Id: I4b94e3d53bb3ca00c30d5c83d2a57e4bd390c5a8
I submitted d3bbd2f, which makes the changes to
Your command is slightly different, but it should still work. |
trying now |
Actually the build might be failing right now for an unrelated reason. So trying at d3bbd2f itself, with or without debugging info, might not work. |
Building is working again, as of bf36815. I ran the command you tried:
The subcommands for mark_for_compilation_pass.cc outputted by bazel are here. TensorFlow actually builds |
I just came to the same conclusion. Many files are compiled twice. What is 'host' used for? Is it required? |
Debugging works fine. Fix looks good to me. |
The "Build configurations and cross-compilation" section of this page has details on the "host" vs "target" configuration (I used the word "platform" before but I think the right word is "configuration"). There is probably a genrule somewhere that uses |
You can use |
fwiw, building all of tensorflow/... (ie including tensorflow/core/kernels) with debug, still works as well. |
I thought this is precisely what wasn't working right? Or did I misunderstand the issue? |
This is still not all , just the tensorflow directory. all external bits are still optimized |
The |
so I added --distinct_host_configuration=false to the bazel build just to see. Build was so much faster, as only the build is done. So far, seems like stuff is working. |
Is this related to this debug build or it is going to impact also regular build? |
While i have not verified, this should not be debug related |
This could be interesting /cc @angerson @perfinion |
@reedwm, just following up regarding the documentation part of this bug. Any updates? |
No update yet, will try to do this this week. |
Resolves tensorflow/tensorflow#48919. PiperOrigin-RevId: 380915403
This is fixed and documented. But for some reason, using
I would have thought passing |
when building opensource TensorFlow with
bazel build --config=dbg --config=cuda --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package
(for SM 7.0 only)
The build dies at link time with:
ERROR: /home/baarts/tensorflow-GH/tensorflow/python/BUILD:3373:24: Linking of rule '//tensorflow/python:_pywrap_tensorflow_internal.so' failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/k8-dbg/bin/tensorflow/python/_pywrap_tensorflow_internal.so-2.params bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(AnnotationRemarks.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against
.debug_info'bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(BDCE.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against
.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(CallSiteSplitting.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against
.debug_info'bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(ConstantHoisting.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against
.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(ConstraintElimination.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against
.debug_info'bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(CorrelatedValuePropagation.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against
.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(DCE.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against
.debug_info'bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(DeadStoreElimination.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against
.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(DivRemPairs.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against
.debug_info'bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(EarlyCSE.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against
.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(FlattenCFGPass.pic.o):(.debug_aranges+0x6): additional relocation overflows omitted from the output collect2: error: ld returned 1 exit status
Adding -mcmodel=large makes no difference, as the overflow is in a debug section.
I tried -gdwarf64 which is not supported by gcc
some platform info:
The text was updated successfully, but these errors were encountered: