Failing Tensorflow unit tests for BF16 hardware #65988

christinaburge · 2024-04-18T09:41:03Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

tf 2.17.0

Custom code

No

OS platform and distribution

Linux Ubuntu 22.04

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

We have the following unit test failures in Tensorflow github/nightly:

//tensorflow/compiler/tests:conv3d_test_cpu
//tensorflow/compiler/tests:conv3d_test_cpu_mlir_bridge_test
//tensorflow/compiler/tests:stateful_random_ops_test_cpu
//tensorflow/compiler/tests:stateless_random_ops_test_cpu
//tensorflow/compiler/tests:stateless_random_ops_test_cpu_mlir_bridge_test
//tensorflow/compiler/tests:stateful_random_ops_test_cpu_mlir_bridge_test
//tensorflow/compiler/tests:stochastic_cast_op_test_cpu

On investigation, the first commit where this issue is present is a4d7e97, tests pass with the commit immediately prior to this.

The tests do not fail in the upstream CI because it uses N1 cores with no bf16 HW.

Standalone code to reproduce the issue

From the directory tensorflow/ci/official, to reproduce the failure for e.g. //tensorflow/compiler/tests:conv3d_test_cpu we want to:
open any.sh and remove cd "$(dirname "$0")/../../"  # tensorflow/
run:
TFCI=py311,linux_arm64  TF_ANY_MODE=test TF_ANY_TARGETS=//tensorflow/compiler/tests:conv3d_test_cpu  ./any.sh

Relevant log output

An example error from the stochastic_cast test:

FAIL: //tensorflow/compiler/tests:stochastic_cast_op_test_cpu (shard 13 of 20) (see /root/.cache/bazel/_bazel_root/574657b8af23672198530ef061ba4201/execroot/org_tensorflow/bazel-out/aarch64-opt/testlogs/tensorflow/compiler/tests/stochastic_cast_op_test_cpu/shard_13_of_20/test.log)
INFO: From Testing //tensorflow/compiler/tests:stochastic_cast_op_test_cpu (shard 13 of 20):
==================== Test output for //tensorflow/compiler/tests:stochastic_cast_op_test_cpu (shard 13 of 20):
2024-04-15 10:17:41.267151: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Running tests under Python 3.11.6: /root/.cache/bazel/_bazel_root/574657b8af23672198530ef061ba4201/execroot/org_tensorflow/bazel-out/aarch64-opt/bin/tensorflow/compiler/tests/stochastic_cast_op_test_cpu.runfiles/python_aarch64-unknown-linux-gnu/bin/python3
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/574657b8af23672198530ef061ba4201/execroot/org_tensorflow/bazel-out/aarch64-opt/bin/tensorflow/compiler/tests/stochastic_cast_op_test_cpu.runfiles/org_tensorflow/tensorflow/compiler/tests/xla_test.py:106: Context.enable_xla_devices (from tensorflow.python.eager.context) is deprecated and will be removed in a future version.
Instructions for updating:
XLA:CPU and XLA:GPU devices are deprecated
W0415 10:17:44.369471 247748062629904 deprecation.py:50] From /root/.cache/bazel/_bazel_root/574657b8af23672198530ef061ba4201/execroot/org_tensorflow/bazel-out/aarch64-opt/bin/tensorflow/compiler/tests/stochastic_cast_op_test_cpu.runfiles/org_tensorflow/tensorflow/compiler/tests/xla_test.py:106: Context.enable_xla_devices (from tensorflow.python.eager.context) is deprecated and will be removed in a future version.
Instructions for updating:
XLA:CPU and XLA:GPU devices are deprecated
[ RUN      ] StochasticCastOpTest.testStochasticCastOpResultProbability_0.125_from_bfloat16_to_int16
INFO:tensorflow:Start test case: StochasticCastOpTest.testStochasticCastOpResultProbability_0.125_from_bfloat16_to_int16
I0415 10:17:44.370704 247748062629904 xla_test.py:231] Start test case: StochasticCastOpTest.testStochasticCastOpResultProbability_0.125_from_bfloat16_to_int16
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1713176264.392647   68553 service.cc:145] XLA service 0xbd1261c1dc00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1713176264.392683   68553 service.cc:153]   StreamExecutor device (0): Host, Default Version
2024-04-15 10:17:44.398999: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
I0000 00:00:1713176264.411515   73786 xla_device.cc:462] XLA_GPU and XLA_CPU devices are deprecated and will be removed in subsequent releases. Instead, use either @tf.function(jit_compile=True) for must-compile semantics, or run with TF_XLA_FLAGS=--tf_xla_auto_jit=2 for auto-clustering best-effort compilation.
I0000 00:00:1713176264.457087   73787 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
LLVM ERROR: Cannot select: 0xe14f2452b580: v8bf16,ch = masked_load<(load unknown-size from %ir.lsr.iv217, align 2, !alias.scope !4, !noalias !7)> 0xe14f241a0ea0, 0xe14f2452bd60, undef:i64, 0xe14f244faa10, undef:v8bf16
  0xe14f2452bd60: i64,ch = CopyFromReg 0xe14f241a0ea0, Register:i64 %65
    0xe14f2452d220: i64 = Register %65
  0xe14f2452b9e0: i64 = undef
  0xe14f244faa10: v8i16 = AArch64ISD::VASHR 0xe14f24529720, Constant:i32<15>
    0xe14f24529720: v8i16 = AArch64ISD::VSHL 0xe14f2454cb00, Constant:i32<15>
      0xe14f2454cb00: v8i16 = any_extend 0xe14f2452b970
        0xe14f2452b970: v8i8,ch = CopyFromReg 0xe14f241a0ea0, Register:v8i8 %66
          0xe14f2452bdd0: v8i8 = Register %66
      0xe14f245602e0: i32 = Constant<15>
    0xe14f245602e0: i32 = Constant<15>
  0xe14f2452be40: v8bf16 = undef
In function: parallel_fusion
Fatal Python error: Aborted

The text was updated successfully, but these errors were encountered:

sushreebarsa · 2024-04-23T06:54:29Z

@christinaburge The issue is present only in the nightly build, which is intended for testing new features and might contain bugs. Could you try with the stable version with a minimal example and let us know?
Thank you!

christinaburge · 2024-04-23T08:52:21Z

@penpornk @MichaelHudgins

This is an issue in our downstream CI, would you be able to advise please?

penpornk · 2024-04-25T07:16:47Z

@d0k It seems a4d7e97 is still causing unit test failures on aarch64. Could you please help take a look? Thank you very much!

christinaburge · 2024-06-04T10:33:47Z

Hi, just wondering if anything has happened with this please?

d0k · 2024-06-04T11:08:11Z

Ah sorry, this fell through the cracks. It's a bug in LLVM's lowering to the bf16 hardware, but I don't have access to a machine with that instruction set.

Would you mind running the test with --test_env=XLA_FLAGS=--xla_dump_to=/tmp/some/directory and attach the .ll files that produces? From there we can distill it into a bug report against LLVM.

christinaburge · 2024-06-07T13:03:20Z

bug_files.tar.gz
No problem, here is the full output, let me know if you need anything else!

d0k · 2024-06-10T10:16:43Z

Thanks, created a reduced reproducer at llvm/llvm-project#94951

I'm not exactly sure it's the same issue, as this one requires SVE and your original error message had no SVE types in it.

google-ml-butler bot added the type:bug Bug label Apr 18, 2024

google-ml-butler bot assigned sushreebarsa Apr 18, 2024

sushreebarsa added the comp:ops OPs related issues label Apr 23, 2024

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Apr 23, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 23, 2024

sushreebarsa assigned SuryanarayanaY and unassigned sushreebarsa Apr 23, 2024

SuryanarayanaY added type:build/install Build and install issues stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing Tensorflow unit tests for BF16 hardware #65988

Failing Tensorflow unit tests for BF16 hardware #65988

Failing Tensorflow unit tests for BF16 hardware #65988

Failing Tensorflow unit tests for BF16 hardware #65988

Comments

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output