[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test TensorFloat32 with conv2d #46168

Open
WangTuoxyty opened this issue Jan 5, 2021 · 5 comments
Open

Test TensorFloat32 with conv2d #46168

WangTuoxyty opened this issue Jan 5, 2021 · 5 comments
Assignees
Labels
comp:apis Highlevel API related issues comp:gpu GPU related issues type:performance Performance Issue

Comments

@WangTuoxyty
Copy link

Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
import tensorflow as tf
import numpy as np
tf.config.experimental.enable_tensor_float_32_execution(False)
x_in = np.array([[
  [[2], [1], [2], [0], [1]],
  [[1], [3], [2], [2], [3]],
  [[1], [1], [3], [3], [0]],
  [[2], [2], [0], [1], [1]],
  [[0], [0], [3], [1], [2]], ]])
kernel_in = np.array([
 [ [[2, 0.1]], [[3, 0.2]] ],
 [ [[0, 0.3]],[[1, 0.4]] ], ])
x = tf.constant(x_in, dtype=tf.float32)
kernel = tf.constant(kernel_in, dtype=tf.float32)
out = tf.nn.conv2d(x, kernel, strides=[1, 1, 1, 1], padding='VALID')
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS Linux release 7.4.1708 (Core)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): pip3 install tensorflow-gpu
  • Python version: 3.6.8
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: cuda11.1/cudnn8.0.5
  • GPU model and memory: GeForce RTX 3090

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
  2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
I ran this on an RTX 3090 with Nsight system. Compared with tf.config.experimental.enable_tensor_float_32_execution(False), the conv2d kernels don`t have higher performance with tf.config.experimental.enable_tensor_float_32_execution(True).

Describe the expected behavior
With tf.config.experimental.enable_tensor_float_32_execution(True), the conv2d kernels should have higher performance.

Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

@WangTuoxyty WangTuoxyty added the type:performance Performance Issue label Jan 5, 2021
@ravikyram
Copy link
Contributor

@WangTuoxyty

I have tried in colab with TF -gpu version 2.4 and i did not notice any major performance issue. Please, find the gist here.
Please, elaborate the issue with reproducible code if i miss something. It helps us in debugging faster. Thanks!

@ravikyram ravikyram added comp:apis Highlevel API related issues stat:awaiting response Status - Awaiting response from author labels Jan 5, 2021
@WangTuoxyty
Copy link
Author

@WangTuoxyty

I have tried in colab with TF -gpu version 2.4 and i did not notice any major performance issue. Please, find the gist here.
Please, elaborate the issue with reproducible code if i miss something. It helps us in debugging faster. Thanks!

I saved the code to a file "test_conv.py", and execute command "nsys nvprof python3 test_conv.py" in a terminal, and this is a part of the output:
Generating CUDA Kernel Statistics...
CUDA Kernel Statistics (nanoseconds)

Time(%) Total Time Instances Average Minimum Maximum Name


96.7 1927958 28 68855.6 67072 71072 redzone_checker
0.9 18078 7 2582.6 2560 2592 void cudnn::cnn::conv2d_grouped_direct_kernel<float, float, float, float, float, float, true, false, 0, 1, 0>(cudnnTensorStruct, float const*, cudnnFilterStruct, float const*, cudnnConvolutionStruct, cudnnTensorStruct, float*, float, float, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divis
0.5 9536 4 2384.0 2304 2432 void fft2d_r2c_16x16(float2*, float const*, int, int, int, int, int, int, int, int)
0.4 7200 2 3600.0 3584 3616 void fft2d_r2c_32x32<float, false, 1u, true>(float2*, float const*, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
0.3 6400 2 3200.0 3200 3200 void fft2d_c2r_32x32<float, false, false, 1u, false, false>(float*, float2 const*, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*, int2, int, int)
0.3 5759 2 2879.5 2848 2911 void fft2d_r2c_32x32<float, false, 5u, false>(float2*, float const*, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
0.2 4895 2 2447.5 2431 2464 void gemmk1_kernel<float2, 256, 5, false, false, true, false, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float2>(cublasGemmk1Params<float2, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float2, biasType<cublasGemvTensorStridedBatched::value_t
0.2 4704 2 2352.0 2336 2368 void fft2d_c2r_16x16<float, false>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
0.2 3135 2 1567.5 1567 1568 void gemmk1_kernel<float2, 256, 5, true, false, false, false, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float2>(cublasGemmk1Params<float2, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float2, biasType<cublasGemvTensorStridedBatched::value_t
0.2 3040 2 1520.0 1504 1536 void flip_filter<float, float>(float*, float const*, int, int, int, int)
0.1 1568 1 1568.0 1568 1568 void tensorflow::functor::ShuffleInTensor3Simple<float, 2, 1, 0, false>(int, float const*, tensorflow::functor::Dimension<3>, float*)
0.1 1440 1 1440.0 1440 1440 void tensorflow::functor::ShuffleInTensor3Simple<float, 0, 2, 1, false>(int, float const*, tensorflow::functor::Dimension<3>, float*)

@ravikyram ravikyram removed the stat:awaiting response Status - Awaiting response from author label Jan 5, 2021
@ravikyram ravikyram assigned rmothukuru and unassigned ravikyram Jan 5, 2021
@ravikyram ravikyram added the comp:gpu GPU related issues label Jan 5, 2021
@rmothukuru rmothukuru added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 5, 2021
@rmothukuru rmothukuru assigned sanjoy and unassigned rmothukuru Jan 5, 2021
@sanjoy
Copy link
Contributor
sanjoy commented Jan 6, 2021

Hi @WangTuoxyty,

The benchmark you're using is very small so tf32 or not will not make a big difference, do you see the same issue when you try larger convolutions?

@WangTuoxyty
Copy link
Author

Hi @WangTuoxyty,

The benchmark you're using is very small so tf32 or not will not make a big difference, do you see the same issue when you try larger convolutions?

I change the shape of x_in from [1, 5, 5, 1] to [10, 5000, 5000, 1],while it has the same result.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 8, 2021
@HengjiaLi
Copy link

Any updates on this issue?

Here, I run into a similar situation.
I tried to test convolution's accuracy under TF-32 mode on my machine by comparing TF-32's and FP-32's computaiton results.

Firstly, I tested a simple MatMul example (as given here:https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_tensor_float_32_execution). TF-32's and FP-32's results are indeed different as expected. It indicates my environment supports TF-32 mode by default.

However, I used tf.nn.conv2d() method, fed with randomly generated float32 data, to test acc of convolution under TF-32 mode, but TF-32's and FP-32's results turned to be identical. It seems like tf.nn.conv2d() failed to activate TF-32?
Can someone please help with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:apis Highlevel API related issues comp:gpu GPU related issues type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

6 participants