XLA CPU CNN compilation SegFault #48016

seanmor5 · 2021-03-23T14:17:24Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04 / macOS 10.14.6
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: n/a
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): 3da7b09
Python version: n/a
Bazel version (if compiling from source): 3.7.2
GCC/Compiler version (if compiling from source): 9.3.0
CUDA/cuDNN version: n/a
GPU model and memory: n/a

Describe the current behavior

We're working on XLA bindings over at Nx. We're currently working on a high-level API for writing Neural networks. We have a CNN that basically looks like:

input({32, 3, 32, 32})
|> conv(32, kernel_size: {3, 3}, activation: :relu)
|> batch_norm()
|> avg_pool(kernel_size: {2, 2})
|> conv(64, kernel_size: {3, 3}, activation: :relu)
|> batch_norm()
|> avg_pool(kernel_size: {2, 2})
|> conv(64, kernel_size: {3, 3}, activation: :relu)
|> batch_norm()
|> flatten()
|> dense(64, activation: :relu)
|> dropout()
|> dense(10, activation: :log_softmax)

Unfortunately, the program SegFaults during compilation when using XLA CPU. We have previously successfully compiled and run the same network using XLA GPU. GDB Backtrace indicates this happens somewhere in LLVM. I can provide HLO Dumps as well.

Other info / logs

GDB Backtrace

#0  0x00007f68bb04ee09 in llvm::MemorySSA::getOrCreateAccessList(llvm::BasicBlock const*) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#1  0x00007f68bb04f4af in llvm::MemorySSA::insertIntoListsForBlock(llvm::MemoryAccess*, llvm::BasicBlock const*, llvm::MemorySSA::InsertionPlace) ()
   from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#2  0x00007f68bb050033 in llvm::MemorySSA::createMemoryPhi(llvm::BasicBlock*) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#3  0x00007f68bb061876 in llvm::MemorySSAUpdater::getPreviousDefRecursive(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#4  0x00007f68bb0628f6 in llvm::MemorySSAUpdater::getPreviousDefFromEnd(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#5  0x00007f68bb0617db in llvm::MemorySSAUpdater::getPreviousDefRecursive(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#6  0x00007f68bb0628f6 in llvm::MemorySSAUpdater::getPreviousDefFromEnd(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#7  0x00007f68bb0617db in llvm::MemorySSAUpdater::getPreviousDefRecursive(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#8  0x00007f68bb0628f6 in llvm::MemorySSAUpdater::getPreviousDefFromEnd(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#9  0x00007f68bb0619ab in llvm::MemorySSAUpdater::getPreviousDefRecursive(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so
#10 0x00007f68bb0628f6 in llvm::MemorySSAUpdater::getPreviousDefFromEnd(llvm::BasicBlock*, llvm::DenseMap, llvm::DenseMapInfo, llvm::detail::DenseMapPair > >&) () from /home/sean/projects/axon/_build/dev/lib/exla/priv/libexla.so

The text was updated successfully, but these errors were encountered:

Saduf2019 · 2021-03-24T07:01:01Z

@seanmor5
Please share a simple stand alone code such that we could reproduce the issue reported or a colab gist with the code an error for us to analyse.

seanmor5 · 2021-03-24T11:31:02Z

Hi @Saduf2019 thank you for your response. Here is a gist; however, it requires installing Elixir and Erlang/OTP as well as building our XLA Client from source. I can try to make this process a little easier by providing a Dockerfile. I was hoping the HLO Dumps might have been useful here to reproduce the compilation SegFault. Please let me know what else I can do to make this easier to debug.

seanmor5 · 2021-03-24T15:08:42Z

Just to provide some (hopefully) more helpful information. I ran everything through valgrind:

==1948032== 
==1948032== Process terminating with default action of signal 11 (SIGSEGV)
==1948032==  Bad permissions for mapped region at address 0x48B5AFF8
==1948032==    at 0x6A0E407B: llvm::DenseMapBase<llvm::DenseMap<llvm::BasicBlock*, std::unique_ptr<llvm::DomTreeNodeBase<llvm::BasicBlock>, std::default_delete<llvm::DomTreeNodeBase<llvm::BasicBlock> > >, llvm::DenseMapInfo<llvm::BasicBlock*>, llvm::detail::DenseMapPair<llvm::BasicBlock*, std::unique_ptr<llvm::DomTreeNodeBase<llvm::BasicBlock>, std::default_delete<llvm::DomTreeNodeBase<llvm::BasicBlock> > > > >, llvm::BasicBlock*, std::unique_ptr<llvm::DomTreeNodeBase<llvm::BasicBlock>, std::default_delete<llvm::DomTreeNodeBase<llvm::BasicBlock> > >, llvm::DenseMapInfo<llvm::BasicBlock*>, llvm::detail::DenseMapPair<llvm::BasicBlock*, std::unique_ptr<llvm::DomTreeNodeBase<llvm::BasicBlock>, std::default_delete<llvm::DomTreeNodeBase<llvm::BasicBlock> > > > >::find(llvm::BasicBlock const*) const (in /home/sean/projects/axon/deps/exla/exla/priv/libexla.so)
==1948032== 
==1948032== Process terminating with default action of signal 11 (SIGSEGV)
==1948032==  Bad permissions for mapped region at address 0x48B5AFF0
==1948032==    at 0x4831134: _vgnU_freeres (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_core-amd64-linux.so)
==1948032== 
==1948032== HEAP SUMMARY:
==1948032==     in use at exit: 9,331,477,455 bytes in 1,335,643 blocks
==1948032==   total heap usage: 17,604,332 allocs, 16,268,689 frees, 606,006,962,608 bytes allocated
==1948032== 
==1948032== LEAK SUMMARY:
==1948032==    definitely lost: 1,136 bytes in 3 blocks
==1948032==    indirectly lost: 7,936 bytes in 2 blocks
==1948032==      possibly lost: 8,601,228 bytes in 152,054 blocks
==1948032==    still reachable: 9,322,436,784 bytes in 1,183,258 blocks
==1948032==                       of which reachable via heuristic:
==1948032==                         newarray           : 8,920 bytes in 5 blocks
==1948032==                         multipleinheritance: 6,776 bytes in 43 blocks
==1948032==         suppressed: 430,371 bytes in 326 blocks
==1948032== Rerun with --leak-check=full to see details of leaked memory
==1948032== 
==1948032== For lists of detected and suppressed errors, rerun with: -s
==1948032== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Segmentation fault

Additionally, the segmentation fault no longer occurs when running with XLA_FLAGS=--xla_backend_optimization_level=1. The segmentation fault always occurs after a call to createMemoryPhi in LLVM.

seanmor5 added the type:bug Bug label Mar 23, 2021

google-ml-butler bot assigned Saduf2019 Mar 23, 2021

Saduf2019 added the comp:xla XLA label Mar 24, 2021

Saduf2019 added the stat:awaiting response Status - Awaiting response from author label Mar 24, 2021

Saduf2019 added TF 2.5 Issues related to TF 2.5 and removed stat:awaiting response Status - Awaiting response from author labels Mar 24, 2021

Saduf2019 assigned jvishnuvardhan and unassigned Saduf2019 Mar 24, 2021

jvishnuvardhan assigned r4nt and unassigned jvishnuvardhan Mar 24, 2021

jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLA CPU CNN compilation SegFault #48016

XLA CPU CNN compilation SegFault #48016

XLA CPU CNN compilation SegFault #48016

XLA CPU CNN compilation SegFault #48016

Comments