[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLA CPU CNN compilation SegFault #48016

Open
seanmor5 opened this issue Mar 23, 2021 · 3 comments
Open

XLA CPU CNN compilation SegFault #48016

seanmor5 opened this issue Mar 23, 2021 · 3 comments
Assignees
Labels
comp:xla XLA stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.5 Issues related to TF 2.5 type:bug Bug

Comments

@seanmor5
Copy link
Contributor
seanmor5 commented Mar 23, 2021

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04 / macOS 10.14.6
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: n/a
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 3da7b09
  • Python version: n/a
  • Bazel version (if compiling from source): 3.7.2
  • GCC/Compiler version (if compiling from source): 9.3.0
  • CUDA/cuDNN version: n/a
  • GPU model and memory: n/a

Describe the current behavior

We're working on XLA bindings over at Nx. We're currently working on a high-level API for writing Neural networks. We have a CNN that basically looks like:

input({32, 3, 32, 32})
|> conv(32, kernel_size: {3, 3}, activation: :relu)
|> batch_norm()
|> avg_pool(kernel_size: {2, 2})
|> conv(64, kernel_size: {3, 3}, activation: :relu)
|> batch_norm()
|> avg_pool(kernel_size: {2, 2})
|> conv(64, kernel_size: {3, 3}, activation: :relu)
|> batch_norm()
|> flatten()
|> dense(64, activation: :relu)
|> dropout()
|> dense(10, activation: :log_softmax)

Unfortunately, the program SegFaults during compilation when using XLA CPU. We have previously successfully compiled and run the same network using XLA GPU. GDB Backtrace indicates this happens somewhere in LLVM. I can provide HLO Dumps as well.

Other info / logs

GDB Backtrace
#0  0x00007f68bb04ee09 in llvm::MemorySSA::getOrCreateAccessList(llvm::BasicBlock const*) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#1  0x00007f68bb04f4af in llvm::MemorySSA::insertIntoListsForBlock(llvm::MemoryAccess*, llvm::BasicBlock const*, llvm::MemorySSA::InsertionPlace) ()
   from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#2  0x00007f68bb050033 in llvm::MemorySSA::createMemoryPhi(llvm::BasicBlock*) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#3  0x00007f68bb061876 in llvm::MemorySSAUpdater::getPreviousDefRecursive(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#4  0x00007f68bb0628f6 in llvm::MemorySSAUpdater::getPreviousDefFromEnd(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#5  0x00007f68bb0617db in llvm::MemorySSAUpdater::getPreviousDefRecursive(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#6  0x00007f68bb0628f6 in llvm::MemorySSAUpdater::getPreviousDefFromEnd(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#7  0x00007f68bb0617db in llvm::MemorySSAUpdater::getPreviousDefRecursive(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#8  0x00007f68bb0628f6 in llvm::MemorySSAUpdater::getPreviousDefFromEnd(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#9  0x00007f68bb0619ab in llvm::MemorySSAUpdater::getPreviousDefRecursive(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#10 0x00007f68bb0628f6 in llvm::MemorySSAUpdater::getPreviousDefFromEnd(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
@Saduf2019
Copy link
Contributor

@seanmor5
Please share a simple stand alone code such that we could reproduce the issue reported or a colab gist with the code an error for us to analyse.

@Saduf2019 Saduf2019 added the stat:awaiting response Status - Awaiting response from author label Mar 24, 2021
@seanmor5
Copy link
Contributor Author

Hi @Saduf2019 thank you for your response. Here is a gist; however, it requires installing Elixir and Erlang/OTP as well as building our XLA Client from source. I can try to make this process a little easier by providing a Dockerfile. I was hoping the HLO Dumps might have been useful here to reproduce the compilation SegFault. Please let me know what else I can do to make this easier to debug.

@Saduf2019 Saduf2019 added TF 2.5 Issues related to TF 2.5 and removed stat:awaiting response Status - Awaiting response from author labels Mar 24, 2021
@seanmor5
Copy link
Contributor Author

Just to provide some (hopefully) more helpful information. I ran everything through valgrind:

==1948032== 
==1948032== Process terminating with default action of signal 11 (SIGSEGV)
==1948032==  Bad permissions for mapped region at address 0x48B5AFF8
==1948032==    at 0x6A0E407B: llvm::DenseMapBase<llvm::DenseMap<llvm::BasicBlock*, std::unique_ptr<llvm::DomTreeNodeBase<llvm::BasicBlock>, std::default_delete<llvm::DomTreeNodeBase<llvm::BasicBlock> > >, llvm::DenseMapInfo<llvm::BasicBlock*>, llvm::detail::DenseMapPair<llvm::BasicBlock*, std::unique_ptr<llvm::DomTreeNodeBase<llvm::BasicBlock>, std::default_delete<llvm::DomTreeNodeBase<llvm::BasicBlock> > > > >, llvm::BasicBlock*, std::unique_ptr<llvm::DomTreeNodeBase<llvm::BasicBlock>, std::default_delete<llvm::DomTreeNodeBase<llvm::BasicBlock> > >, llvm::DenseMapInfo<llvm::BasicBlock*>, llvm::detail::DenseMapPair<llvm::BasicBlock*, std::unique_ptr<llvm::DomTreeNodeBase<llvm::BasicBlock>, std::default_delete<llvm::DomTreeNodeBase<llvm::BasicBlock> > > > >::find(llvm::BasicBlock const*) const (in /home/sean/projects/axon/deps/exla/exla/priv/libexla.so)
==1948032== 
==1948032== Process terminating with default action of signal 11 (SIGSEGV)
==1948032==  Bad permissions for mapped region at address 0x48B5AFF0
==1948032==    at 0x4831134: _vgnU_freeres (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_core-amd64-linux.so)
==1948032== 
==1948032== HEAP SUMMARY:
==1948032==     in use at exit: 9,331,477,455 bytes in 1,335,643 blocks
==1948032==   total heap usage: 17,604,332 allocs, 16,268,689 frees, 606,006,962,608 bytes allocated
==1948032== 
==1948032== LEAK SUMMARY:
==1948032==    definitely lost: 1,136 bytes in 3 blocks
==1948032==    indirectly lost: 7,936 bytes in 2 blocks
==1948032==      possibly lost: 8,601,228 bytes in 152,054 blocks
==1948032==    still reachable: 9,322,436,784 bytes in 1,183,258 blocks
==1948032==                       of which reachable via heuristic:
==1948032==                         newarray           : 8,920 bytes in 5 blocks
==1948032==                         multipleinheritance: 6,776 bytes in 43 blocks
==1948032==         suppressed: 430,371 bytes in 326 blocks
==1948032== Rerun with --leak-check=full to see details of leaked memory
==1948032== 
==1948032== For lists of detected and suppressed errors, rerun with: -s
==1948032== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Segmentation fault

Additionally, the segmentation fault no longer occurs when running with XLA_FLAGS=--xla_backend_optimization_level=1. The segmentation fault always occurs after a call to createMemoryPhi in LLVM.

@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:xla XLA stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.5 Issues related to TF 2.5 type:bug Bug
Projects
None yet
Development

No branches or pull requests

4 participants