GPU performance issue when calling slicing for tensors of types tf.int16, tf.int32 (op StridedSlice) #51428

farotem · 2021-08-11T12:32:35Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Linux Ubuntu 18.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary):
binary wheel via PyPI
TensorFlow version (use command below):
v2.5.0-0-ga4dfb8d1a71 2.5.0
Python version:
3.6
CUDA/cuDNN version:
CUDA 11.2
GPU model and memory:
V100 32 GB

Describe the current behavior
When slicing tensors of shape int16 or int32 it takes significant amount of time comparing to float16, float32

type array: <dtype: 'int16'>
took 35.05420684814453

type array: <dtype: 'int32'>
took 22.861242294311523

type array: <dtype: 'int64'>
took 5.330085754394531

type array: <dtype: 'float16'>
took 1.550912857055664

type array: <dtype: 'float32'>
took 2.5637149810791016

type array: <dtype: 'float64'>
took 5.917549133300781

type array: <dtype: 'bool'>
took 1.6639232635498047

Describe the expected behavior
takes around the same time as the float versions

Contributing

Do you want to contribute a PR? (yes/no): no

when running the code using tf.debugging.set_log_device_placement(True) found that slicing int16 is not using gpu at all:
type array: <dtype: 'int16'>
Executing op Mul in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0

for int32 I don't understand the reason for such difference comparing to float32

Standalone code to reproduce the issue

import tensorflow as tf
# tf.debugging.set_log_device_placement(True)
import numpy as np
import time

def test_slicing():

    with tf.device('/GPU:0'):
      shape = (52000, 2, 15, 15)
      np_array = np.ones(shape, dtype=np.uint16)

      dtypes= [tf.uint16, tf.int16, tf.int64, tf.float16, tf.float32, tf.float64, tf.bool]
      start = 0
      end = 50000
      for dtype in dtypes:
        print("*****************************************************")

        tf_array = tf.constant(np_array, dtype=dtype) * 1 if dtype != tf.bool else tf.math.logical_and(tf.constant(np_array, dtype=dtype) , tf.constant(np_array, dtype=dtype)) 
        print(f'type array: {tf_array.dtype}')
        
        for i in range(1):
            tf_array = tf.constant(np_array, dtype=dtype) * 1 if dtype != tf.bool else tf.math.logical_and(tf.constant(np_array, dtype=dtype) , tf.constant(np_array, dtype=dtype)) 

            tic = time.time()
            a = tf_array[start:end]
            a[0][0][0][0].numpy()
            toc = time.time()
            print(f'took {(toc - tic)*1000}')

print()          
print(tf.version.GIT_VERSION, tf.version.VERSION)
test_slicing()

The text was updated successfully, but these errors were encountered:

sushreebarsa · 2021-08-12T07:25:13Z

@farotem Could you please take a look at the link and let us know if it helps ? Thanks!

farotem · 2021-08-12T11:39:26Z

Thank you for the fast response.
I looked at it before and I'm not sure how it helps,
I think the behavior for all types should work the same the link for the low level api is not helping if only for specific type there is a need for the low level api.
any other ideas?
Is it possible to add also add GPU tag to the issue?

sushreebarsa · 2021-08-13T05:42:20Z

@sanatmpa1 Was able to replicate the issue on Colab with TF v2.4, v2.5 , tf-nightly ,please find the gists attached.Thank you!

sanatmpa1 · 2021-08-13T06:48:30Z

@farotem,

Please take a look at this gist-2.6 from TF 2.6.0 and gist-nightly from tf-nightly, The time difference between int and float has reduced significantly in TF 2.6 and you can notice that in tf-nightly the time taken for int is mostly close to what it takes for float, which indicates that its already fixed in nightly. Let me know if it addresses your issue. Thanks!

farotem · 2021-08-16T06:48:55Z

Thank you for your response,
I looked at both gists (2.6, nightly) and I see some concerned issues:

timing
1. float in tf 2.5: ~1.5ms - 2.5ms
2. float in tf 2.6, nightly ~14-16ms
runtime machine (enabling tf.debugging.set_log_device_placement(True))
1. tf 2.5 looking the op StridedSlice for float is device:GPU:0
2. tf 2.6, nightly looking the op StridedSlice for float is device:CPU:0

so if I understand correctly the reason for the closing gap between int to float is that now the performance of float decrease because the op StridedSlice is no longer in GPU and only runs on CPU is that on purpose?

I hoped the opposite direction would work and the int op StridedSlice that runs in tf 2.5 on CPU only will run on GPU in later releases just like float and the performance for int and float will be the same and increased.

farotem added the type:bug Bug label Aug 11, 2021

google-ml-butler bot assigned sushreebarsa Aug 11, 2021

farotem changed the title ~~slicing tf.int16 , tf.int32 have GPU performance impact~~ GPU performance issue when calling slicing tf.int16 , tf.int32 (op StridedSlice) Aug 11, 2021

farotem changed the title ~~GPU performance issue when calling slicing tf.int16 , tf.int32 (op StridedSlice)~~ GPU performance issue when calling slicing for tensors of types tf.int16, tf.int32 (op StridedSlice) Aug 11, 2021

sushreebarsa added comp:ops OPs related issues TF 2.5 Issues related to TF 2.5 labels Aug 12, 2021

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Aug 12, 2021

sushreebarsa added comp:gpu GPU related issues type:performance Performance Issue and removed type:bug Bug stat:awaiting response Status - Awaiting response from author comp:ops OPs related issues labels Aug 13, 2021

sushreebarsa assigned sanatmpa1 and unassigned sushreebarsa Aug 13, 2021

sanatmpa1 added the stat:awaiting response Status - Awaiting response from author label Aug 13, 2021

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 18, 2021

sanatmpa1 assigned jvishnuvardhan and unassigned sanatmpa1 Aug 19, 2021

jvishnuvardhan assigned sanjoy and unassigned jvishnuvardhan Aug 25, 2021

jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU performance issue when calling slicing for tensors of types tf.int16, tf.int32 (op StridedSlice) #51428

GPU performance issue when calling slicing for tensors of types tf.int16, tf.int32 (op StridedSlice) #51428

GPU performance issue when calling slicing for tensors of types tf.int16, tf.int32 (op StridedSlice) #51428

GPU performance issue when calling slicing for tensors of types tf.int16, tf.int32 (op StridedSlice) #51428

Comments