Test TensorFloat32 with conv2d #46168

WangTuoxyty · 2021-01-05T03:13:42Z

Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):

import tensorflow as tf
import numpy as np
tf.config.experimental.enable_tensor_float_32_execution(False)
x_in = np.array([[
  [[2], [1], [2], [0], [1]],
  [[1], [3], [2], [2], [3]],
  [[1], [1], [3], [3], [0]],
  [[2], [2], [0], [1], [1]],
  [[0], [0], [3], [1], [2]], ]])
kernel_in = np.array([
 [ [[2, 0.1]], [[3, 0.2]] ],
 [ [[0, 0.3]],[[1, 0.4]] ], ])
x = tf.constant(x_in, dtype=tf.float32)
kernel = tf.constant(kernel_in, dtype=tf.float32)
out = tf.nn.conv2d(x, kernel, strides=[1, 1, 1, 1], padding='VALID')

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS Linux release 7.4.1708 (Core)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): pip3 install tensorflow-gpu
Python version: 3.6.8
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: cuda11.1/cudnn8.0.5
GPU model and memory: GeForce RTX 3090

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:

TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
I ran this on an RTX 3090 with Nsight system. Compared with tf.config.experimental.enable_tensor_float_32_execution(False), the conv2d kernels don`t have higher performance with tf.config.experimental.enable_tensor_float_32_execution(True).

Describe the expected behavior
With tf.config.experimental.enable_tensor_float_32_execution(True), the conv2d kernels should have higher performance.

Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

The text was updated successfully, but these errors were encountered:

ravikyram · 2021-01-05T05:24:05Z

@WangTuoxyty

I have tried in colab with TF -gpu version 2.4 and i did not notice any major performance issue. Please, find the gist here.
Please, elaborate the issue with reproducible code if i miss something. It helps us in debugging faster. Thanks!

WangTuoxyty · 2021-01-05T05:45:47Z

@WangTuoxyty

I have tried in colab with TF -gpu version 2.4 and i did not notice any major performance issue. Please, find the gist here.
Please, elaborate the issue with reproducible code if i miss something. It helps us in debugging faster. Thanks!

I saved the code to a file "test_conv.py", and execute command "nsys nvprof python3 test_conv.py" in a terminal, and this is a part of the output:
Generating CUDA Kernel Statistics...
CUDA Kernel Statistics (nanoseconds)

Time(%) Total Time Instances Average Minimum Maximum Name

96.7 1927958 28 68855.6 67072 71072 redzone_checker
0.9 18078 7 2582.6 2560 2592 void cudnn::cnn::conv2d_grouped_direct_kernel<float, float, float, float, float, float, true, false, 0, 1, 0>(cudnnTensorStruct, float const*, cudnnFilterStruct, float const*, cudnnConvolutionStruct, cudnnTensorStruct, float*, float, float, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divis
0.5 9536 4 2384.0 2304 2432 void fft2d_r2c_16x16(float2*, float const*, int, int, int, int, int, int, int, int)
0.4 7200 2 3600.0 3584 3616 void fft2d_r2c_32x32<float, false, 1u, true>(float2*, float const*, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
0.3 6400 2 3200.0 3200 3200 void fft2d_c2r_32x32<float, false, false, 1u, false, false>(float*, float2 const*, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*, int2, int, int)
0.3 5759 2 2879.5 2848 2911 void fft2d_r2c_32x32<float, false, 5u, false>(float2*, float const*, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
0.2 4895 2 2447.5 2431 2464 void gemmk1_kernel<float2, 256, 5, false, false, true, false, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float2>(cublasGemmk1Params<float2, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float2, biasType<cublasGemvTensorStridedBatched::value_t
0.2 4704 2 2352.0 2336 2368 void fft2d_c2r_16x16<float, false>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
0.2 3135 2 1567.5 1567 1568 void gemmk1_kernel<float2, 256, 5, true, false, false, false, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float2>(cublasGemmk1Params<float2, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float2, biasType<cublasGemvTensorStridedBatched::value_t
0.2 3040 2 1520.0 1504 1536 void flip_filter<float, float>(float*, float const*, int, int, int, int)
0.1 1568 1 1568.0 1568 1568 void tensorflow::functor::ShuffleInTensor3Simple<float, 2, 1, 0, false>(int, float const*, tensorflow::functor::Dimension<3>, float*)
0.1 1440 1 1440.0 1440 1440 void tensorflow::functor::ShuffleInTensor3Simple<float, 0, 2, 1, false>(int, float const*, tensorflow::functor::Dimension<3>, float*)

sanjoy · 2021-01-06T00:06:02Z

Hi @WangTuoxyty,

The benchmark you're using is very small so tf32 or not will not make a big difference, do you see the same issue when you try larger convolutions?

WangTuoxyty · 2021-01-06T09:27:32Z

Hi @WangTuoxyty,

The benchmark you're using is very small so tf32 or not will not make a big difference, do you see the same issue when you try larger convolutions?

I change the shape of x_in from [1, 5, 5, 1] to [10, 5000, 5000, 1]，while it has the same result.

HengjiaLi · 2022-10-11T06:08:03Z

Any updates on this issue?

Here, I run into a similar situation.
I tried to test convolution's accuracy under TF-32 mode on my machine by comparing TF-32's and FP-32's computaiton results.

Firstly, I tested a simple MatMul example (as given here:https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_tensor_float_32_execution). TF-32's and FP-32's results are indeed different as expected. It indicates my environment supports TF-32 mode by default.

However, I used tf.nn.conv2d() method, fed with randomly generated float32 data, to test acc of convolution under TF-32 mode, but TF-32's and FP-32's results turned to be identical. It seems like tf.nn.conv2d() failed to activate TF-32?
Can someone please help with this?

WangTuoxyty added the type:performance Performance Issue label Jan 5, 2021

google-ml-butler bot assigned ravikyram Jan 5, 2021

ravikyram added comp:apis Highlevel API related issues stat:awaiting response Status - Awaiting response from author labels Jan 5, 2021

ravikyram removed the stat:awaiting response Status - Awaiting response from author label Jan 5, 2021

ravikyram assigned rmothukuru and unassigned ravikyram Jan 5, 2021

ravikyram added the comp:gpu GPU related issues label Jan 5, 2021

rmothukuru added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 5, 2021

rmothukuru assigned sanjoy and unassigned rmothukuru Jan 5, 2021

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 8, 2021

oawxkw mentioned this issue Sep 9, 2023

Does TensorFloat-32 support tf.compat.v1.nn.depthwise_conv2d_native? How can I identify whether an op is supported by TensorFloat-32? #61821

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test TensorFloat32 with conv2d #46168

Test TensorFloat32 with conv2d #46168

Test TensorFloat32 with conv2d #46168

Test TensorFloat32 with conv2d #46168

Comments