[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing Tensorflow unit tests for BF16 hardware #65988

Open
christinaburge opened this issue Apr 18, 2024 · 7 comments
Open

Failing Tensorflow unit tests for BF16 hardware #65988

christinaburge opened this issue Apr 18, 2024 · 7 comments
Assignees
Labels
comp:ops OPs related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug type:build/install Build and install issues

Comments

@christinaburge
Copy link
christinaburge commented Apr 18, 2024

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

tf 2.17.0

Custom code

No

OS platform and distribution

Linux Ubuntu 22.04

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

We have the following unit test failures in Tensorflow github/nightly:

//tensorflow/compiler/tests:conv3d_test_cpu
//tensorflow/compiler/tests:conv3d_test_cpu_mlir_bridge_test
//tensorflow/compiler/tests:stateful_random_ops_test_cpu
//tensorflow/compiler/tests:stateless_random_ops_test_cpu
//tensorflow/compiler/tests:stateless_random_ops_test_cpu_mlir_bridge_test
//tensorflow/compiler/tests:stateful_random_ops_test_cpu_mlir_bridge_test
//tensorflow/compiler/tests:stochastic_cast_op_test_cpu

On investigation, the first commit where this issue is present is a4d7e97, tests pass with the commit immediately prior to this.

The tests do not fail in the upstream CI because it uses N1 cores with no bf16 HW.

Standalone code to reproduce the issue

From the directory tensorflow/ci/official, to reproduce the failure for e.g. //tensorflow/compiler/tests:conv3d_test_cpu we want to:
open any.sh and remove cd "$(dirname "$0")/../../"  # tensorflow/
run:
TFCI=py311,linux_arm64  TF_ANY_MODE=test TF_ANY_TARGETS=//tensorflow/compiler/tests:conv3d_test_cpu  ./any.sh

Relevant log output

An example error from the stochastic_cast test:

FAIL: //tensorflow/compiler/tests:stochastic_cast_op_test_cpu (shard 13 of 20) (see /root/.cache/bazel/_bazel_root/574657b8af23672198530ef061ba4201/execroot/org_tensorflow/bazel-out/aarch64-opt/testlogs/tensorflow/compiler/tests/stochastic_cast_op_test_cpu/shard_13_of_20/test.log)
INFO: From Testing //tensorflow/compiler/tests:stochastic_cast_op_test_cpu (shard 13 of 20):
==================== Test output for //tensorflow/compiler/tests:stochastic_cast_op_test_cpu (shard 13 of 20):
2024-04-15 10:17:41.267151: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Running tests under Python 3.11.6: /root/.cache/bazel/_bazel_root/574657b8af23672198530ef061ba4201/execroot/org_tensorflow/bazel-out/aarch64-opt/bin/tensorflow/compiler/tests/stochastic_cast_op_test_cpu.runfiles/python_aarch64-unknown-linux-gnu/bin/python3
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/574657b8af23672198530ef061ba4201/execroot/org_tensorflow/bazel-out/aarch64-opt/bin/tensorflow/compiler/tests/stochastic_cast_op_test_cpu.runfiles/org_tensorflow/tensorflow/compiler/tests/xla_test.py:106: Context.enable_xla_devices (from tensorflow.python.eager.context) is deprecated and will be removed in a future version.
Instructions for updating:
XLA:CPU and XLA:GPU devices are deprecated
W0415 10:17:44.369471 247748062629904 deprecation.py:50] From /root/.cache/bazel/_bazel_root/574657b8af23672198530ef061ba4201/execroot/org_tensorflow/bazel-out/aarch64-opt/bin/tensorflow/compiler/tests/stochastic_cast_op_test_cpu.runfiles/org_tensorflow/tensorflow/compiler/tests/xla_test.py:106: Context.enable_xla_devices (from tensorflow.python.eager.context) is deprecated and will be removed in a future version.
Instructions for updating:
XLA:CPU and XLA:GPU devices are deprecated
[ RUN      ] StochasticCastOpTest.testStochasticCastOpResultProbability_0.125_from_bfloat16_to_int16
INFO:tensorflow:Start test case: StochasticCastOpTest.testStochasticCastOpResultProbability_0.125_from_bfloat16_to_int16
I0415 10:17:44.370704 247748062629904 xla_test.py:231] Start test case: StochasticCastOpTest.testStochasticCastOpResultProbability_0.125_from_bfloat16_to_int16
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1713176264.392647   68553 service.cc:145] XLA service 0xbd1261c1dc00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1713176264.392683   68553 service.cc:153]   StreamExecutor device (0): Host, Default Version
2024-04-15 10:17:44.398999: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
I0000 00:00:1713176264.411515   73786 xla_device.cc:462] XLA_GPU and XLA_CPU devices are deprecated and will be removed in subsequent releases. Instead, use either @tf.function(jit_compile=True) for must-compile semantics, or run with TF_XLA_FLAGS=--tf_xla_auto_jit=2 for auto-clustering best-effort compilation.
I0000 00:00:1713176264.457087   73787 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
LLVM ERROR: Cannot select: 0xe14f2452b580: v8bf16,ch = masked_load<(load unknown-size from %ir.lsr.iv217, align 2, !alias.scope !4, !noalias !7)> 0xe14f241a0ea0, 0xe14f2452bd60, undef:i64, 0xe14f244faa10, undef:v8bf16
  0xe14f2452bd60: i64,ch = CopyFromReg 0xe14f241a0ea0, Register:i64 %65
    0xe14f2452d220: i64 = Register %65
  0xe14f2452b9e0: i64 = undef
  0xe14f244faa10: v8i16 = AArch64ISD::VASHR 0xe14f24529720, Constant:i32<15>
    0xe14f24529720: v8i16 = AArch64ISD::VSHL 0xe14f2454cb00, Constant:i32<15>
      0xe14f2454cb00: v8i16 = any_extend 0xe14f2452b970
        0xe14f2452b970: v8i8,ch = CopyFromReg 0xe14f241a0ea0, Register:v8i8 %66
          0xe14f2452bdd0: v8i8 = Register %66
      0xe14f245602e0: i32 = Constant<15>
    0xe14f245602e0: i32 = Constant<15>
  0xe14f2452be40: v8bf16 = undef
In function: parallel_fusion
Fatal Python error: Aborted
@google-ml-butler google-ml-butler bot added the type:bug Bug label Apr 18, 2024
@sushreebarsa sushreebarsa added the comp:ops OPs related issues label Apr 23, 2024
@sushreebarsa
Copy link
Contributor

@christinaburge The issue is present only in the nightly build, which is intended for testing new features and might contain bugs. Could you try with the stable version with a minimal example and let us know?
Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Apr 23, 2024
@christinaburge
Copy link
Author

@penpornk @MichaelHudgins

This is an issue in our downstream CI, would you be able to advise please?

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 23, 2024
@penpornk
Copy link
Member

@d0k It seems a4d7e97 is still causing unit test failures on aarch64. Could you please help take a look? Thank you very much!

@SuryanarayanaY SuryanarayanaY added type:build/install Build and install issues stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Apr 25, 2024
@christinaburge
Copy link
Author

Hi, just wondering if anything has happened with this please?

@d0k
Copy link
Member
d0k commented Jun 4, 2024

Ah sorry, this fell through the cracks. It's a bug in LLVM's lowering to the bf16 hardware, but I don't have access to a machine with that instruction set.

Would you mind running the test with --test_env=XLA_FLAGS=--xla_dump_to=/tmp/some/directory and attach the .ll files that produces? From there we can distill it into a bug report against LLVM.

@christinaburge
Copy link
Author

bug_files.tar.gz
No problem, here is the full output, let me know if you need anything else!

@d0k
Copy link
Member
d0k commented Jun 10, 2024

Thanks, created a reduced reproducer at llvm/llvm-project#94951

I'm not exactly sure it's the same issue, as this one requires SVE and your original error message had no SVE types in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:ops OPs related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

5 participants