[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU performance issue when calling slicing for tensors of types tf.int16, tf.int32 (op StridedSlice) #51428

Open
farotem opened this issue Aug 11, 2021 · 5 comments
Assignees
Labels
comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.5 Issues related to TF 2.5 type:performance Performance Issue

Comments

@farotem
Copy link
farotem commented Aug 11, 2021

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Linux Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary):
  • binary wheel via PyPI
  • TensorFlow version (use command below):
  • v2.5.0-0-ga4dfb8d1a71 2.5.0
  • Python version:
  • 3.6
  • CUDA/cuDNN version:
  • CUDA 11.2
  • GPU model and memory:
  • V100 32 GB

Describe the current behavior
When slicing tensors of shape int16 or int32 it takes significant amount of time comparing to float16, float32


type array: <dtype: 'int16'>
took 35.05420684814453


type array: <dtype: 'int32'>
took 22.861242294311523


type array: <dtype: 'int64'>
took 5.330085754394531


type array: <dtype: 'float16'>
took 1.550912857055664


type array: <dtype: 'float32'>
took 2.5637149810791016


type array: <dtype: 'float64'>
took 5.917549133300781


type array: <dtype: 'bool'>
took 1.6639232635498047

Describe the expected behavior
takes around the same time as the float versions

Contributing

  • Do you want to contribute a PR? (yes/no): no

when running the code using tf.debugging.set_log_device_placement(True) found that slicing int16 is not using gpu at all:
type array: <dtype: 'int16'>
Executing op Mul in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0

for int32 I don't understand the reason for such difference comparing to float32

Standalone code to reproduce the issue

import tensorflow as tf
# tf.debugging.set_log_device_placement(True)
import numpy as np
import time

def test_slicing():

    with tf.device('/GPU:0'):
      shape = (52000, 2, 15, 15)
      np_array = np.ones(shape, dtype=np.uint16)

      dtypes= [tf.uint16, tf.int16, tf.int64, tf.float16, tf.float32, tf.float64, tf.bool]
      start = 0
      end = 50000
      for dtype in dtypes:
        print("*****************************************************")

        tf_array = tf.constant(np_array, dtype=dtype) * 1 if dtype != tf.bool else tf.math.logical_and(tf.constant(np_array, dtype=dtype) , tf.constant(np_array, dtype=dtype)) 
        print(f'type array: {tf_array.dtype}')
        
        for i in range(1):
            tf_array = tf.constant(np_array, dtype=dtype) * 1 if dtype != tf.bool else tf.math.logical_and(tf.constant(np_array, dtype=dtype) , tf.constant(np_array, dtype=dtype)) 

            tic = time.time()
            a = tf_array[start:end]
            a[0][0][0][0].numpy()
            toc = time.time()
            print(f'took {(toc - tic)*1000}')

print()          
print(tf.version.GIT_VERSION, tf.version.VERSION)
test_slicing()
@farotem farotem added the type:bug Bug label Aug 11, 2021
@farotem farotem changed the title slicing tf.int16 , tf.int32 have GPU performance impact GPU performance issue when calling slicing tf.int16 , tf.int32 (op StridedSlice) Aug 11, 2021
@farotem farotem changed the title GPU performance issue when calling slicing tf.int16 , tf.int32 (op StridedSlice) GPU performance issue when calling slicing for tensors of types tf.int16, tf.int32 (op StridedSlice) Aug 11, 2021
@sushreebarsa sushreebarsa added comp:ops OPs related issues TF 2.5 Issues related to TF 2.5 labels Aug 12, 2021
@sushreebarsa
Copy link
Contributor

@farotem Could you please take a look at the link and let us know if it helps ? Thanks!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Aug 12, 2021
@farotem
Copy link
Author
farotem commented Aug 12, 2021

Thank you for the fast response.
I looked at it before and I'm not sure how it helps,
I think the behavior for all types should work the same the link for the low level api is not helping if only for specific type there is a need for the low level api.
any other ideas?
Is it possible to add also add GPU tag to the issue?

@sushreebarsa sushreebarsa added comp:gpu GPU related issues type:performance Performance Issue and removed type:bug Bug stat:awaiting response Status - Awaiting response from author comp:ops OPs related issues labels Aug 13, 2021
@sushreebarsa
Copy link
Contributor

@sanatmpa1 Was able to replicate the issue on Colab with TF v2.4, v2.5 , tf-nightly ,please find the gists attached.Thank you!

@sanatmpa1
Copy link

@farotem,

Please take a look at this gist-2.6 from TF 2.6.0 and gist-nightly from tf-nightly, The time difference between int and float has reduced significantly in TF 2.6 and you can notice that in tf-nightly the time taken for int is mostly close to what it takes for float, which indicates that its already fixed in nightly. Let me know if it addresses your issue. Thanks!

@sanatmpa1 sanatmpa1 added the stat:awaiting response Status - Awaiting response from author label Aug 13, 2021
@farotem
Copy link
Author
farotem commented Aug 16, 2021

Thank you for your response,
I looked at both gists (2.6, nightly) and I see some concerned issues:

  • timing

    1. float in tf 2.5: ~1.5ms - 2.5ms
    2. float in tf 2.6, nightly ~14-16ms
  • runtime machine (enabling tf.debugging.set_log_device_placement(True))

    1. tf 2.5 looking the op StridedSlice for float is device:GPU:0
    2. tf 2.6, nightly looking the op StridedSlice for float is device:CPU:0

so if I understand correctly the reason for the closing gap between int to float is that now the performance of float decrease because the op StridedSlice is no longer in GPU and only runs on CPU is that on purpose?

I hoped the opposite direction would work and the int op StridedSlice that runs in tf 2.5 on CPU only will run on GPU in later releases just like float and the performance for int and float will be the same and increased.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 18, 2021
@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.5 Issues related to TF 2.5 type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

6 participants