-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU performance issue when calling slicing for tensors of types tf.int16, tf.int32 (op StridedSlice) #51428
Comments
Thank you for the fast response. |
@sanatmpa1 Was able to replicate the issue on Colab with TF v2.4, v2.5 , tf-nightly ,please find the gists attached.Thank you! |
Please take a look at this gist-2.6 from |
Thank you for your response,
so if I understand correctly the reason for the closing gap between int to float is that now the performance of float decrease because the op StridedSlice is no longer in GPU and only runs on CPU is that on purpose? I hoped the opposite direction would work and the int op StridedSlice that runs in tf 2.5 on CPU only will run on GPU in later releases just like float and the performance for int and float will be the same and increased. |
System information
Describe the current behavior
When slicing tensors of shape int16 or int32 it takes significant amount of time comparing to float16, float32
type array: <dtype: 'int16'>
took 35.05420684814453
type array: <dtype: 'int32'>
took 22.861242294311523
type array: <dtype: 'int64'>
took 5.330085754394531
type array: <dtype: 'float16'>
took 1.550912857055664
type array: <dtype: 'float32'>
took 2.5637149810791016
type array: <dtype: 'float64'>
took 5.917549133300781
type array: <dtype: 'bool'>
took 1.6639232635498047
Describe the expected behavior
takes around the same time as the float versions
Contributing
when running the code using tf.debugging.set_log_device_placement(True) found that slicing int16 is not using gpu at all:
type array: <dtype: 'int16'>
Executing op Mul in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:CPU:0
for int32 I don't understand the reason for such difference comparing to float32
Standalone code to reproduce the issue
The text was updated successfully, but these errors were encountered: