TFLite GPUv2: ADD(x, 1e-5) results in severely wrong output #67216

gustavla · 2024-05-09T03:06:22Z

System information

Samsung Galaxy S23 / Android 13 /Snapdragon® 8 Gen 2 | SM8550
GPUv2 delegate
TFLite 2.16.1

Assets:

Model: https://qaihub-public-issues.s3.us-west-2.amazonaws.com/tflite/67216_post_add_numerical_issues.tflite
Inputs: https://qaihub-public-issues.s3.us-west-2.amazonaws.com/tflite/67216_post_add_numerical_issues_inputs.npz (saved with numpy.savez)

Please take a look at two outputs in particular of this network:

key = "model_13/featurefusion_network/encoder/query_layer/norm/LayerNormalization/moments/variance" (variance)
key2 = "model_13/featurefusion_network/encoder/query_layer/norm/LayerNormalization/batchnorm/add" (add)

The variable variance gets fed into ADD(x, 0.000009999999747378752) and comes out as add.

I ran this on the CPU (xnnpack) and the GPU (GPUv2) and got totally different results.

variance looks like this across CPU and GPU (so far consistent):

(Pdb) p cpu[key][0].ravel()
array([0.8386347 , 0.8353483 , 0.83554685, 0.8366282 , 0.8377434 ,
       0.8369055 , 0.8419936 , 0.8433927 , 0.83845955, 0.83644855,
       0.8404068 , 0.8368349 , 0.8335228 , 0.8401757 , 0.83619094,
       0.8386446 ], dtype=float32)
(Pdb) p gpu[key][0].ravel()
array([0.8378906 , 0.83496094, 0.8354492 , 0.8364258 , 0.83691406,
       0.8364258 , 0.8417969 , 0.84375   , 0.83740234, 0.8354492 ,
       0.84033203, 0.83691406, 0.8334961 , 0.83984375, 0.8359375 ,
       0.8383789 ], dtype=float32)

add looks like this across CPU and GPU:

(Pdb) p cpu[key2][0].ravel()
array([0.83864474, 0.8353583 , 0.83555686, 0.8366382 , 0.8377534 ,
       0.8369155 , 0.8420036 , 0.84340274, 0.83846956, 0.83645856,
       0.8404168 , 0.8368449 , 0.8335328 , 0.8401857 , 0.83620095,
       0.83865464], dtype=float32)
(Pdb) p gpu[key2][0].ravel()
array([-0.78222656,  0.20910645, -0.72802734,  0.32421875, -0.78125   ,
        0.20947266, -0.72753906,  0.3244629 , -0.7817383 ,  0.2097168 ,
       -0.7265625 ,  0.3251953 , -0.78125   ,  0.2097168 , -0.72802734,
        0.3244629 ], dtype=float32)

Here, the values on the GPU has gone completely off the rails. They do not look random though, since there is a periodicity to the output (error alternates between around 1.6 and 0.6).

Standalone code to reproduce the issue
This should be simple to set up through benchmark tool or any other way to run GPUv2 directly. I ran it through Qualcomm's AI Hub (https://aihub.qualcomm.com), so I'm attaching the script that I used as a reference. This also shows how the example inputs can be loaded into python.

import numpy as np
import qai_hub as hub

inputs = np.load("67216_post_add_numerical_issues_inputs.npz")

model = hub.upload_model("67216_post_add_numerical_issues.tflite")
device = hub.Device("Samsung Galaxy S23")
input_data = hub.upload_dataset({
    "image": [inputs["image"]], 
    "feature_template": [inputs["feature_template"]],
    "pos_template": [inputs["pos_template"]],
    "pos_search": [inputs["pos_search"]],
})

job_cpu = hub.submit_inference_job(
    model,
    device=device,
    inputs=input_data,
    options="--compute_unit cpu",
)

job_gpu = hub.submit_inference_job(
    model,
    device=device,
    inputs=input_data,
    options="--compute_unit gpu",
)

cpu = job_cpu.download_output_data()
gpu = job_gpu.download_output_data()

key = "model_13/featurefusion_network/encoder/query_layer/norm/LayerNormalization/moments/variance"
key2 = "model_13/featurefusion_network/encoder/query_layer/norm/LayerNormalization/batchnorm/add"

print(gpu[key2][0].ravel())

The text was updated successfully, but these errors were encountered:

sawantkumar · 2024-05-13T07:30:20Z

Hi @gustavla ,

I replicated your issue using Qualcom Ai hub, and i got the same results as you. Let me verify the same through an Android app and I will get back to you.

impjdi · 2024-05-31T21:05:27Z

Is this OpenGL or OpenCL?
What is the precision? FP16 or FP32?
The GPU delegate works on 4D tensors of shape [B, H, W, C]. It looks like it starts with [1, H, W, C] but throughout the network, you see tensor shapes like 16x1x256, which will then be auto expanded to [16, 1, 1, 256], but this is obviously wrong, because B is expected to be 1 (the way I read your network). Can the tensor dimensions be carefully reviewed and made consistent, well-formed 4D tensors?

sawantkumar · 2024-06-03T07:19:09Z

Hi @gustavla ,

I tried reproducing your issue using an Android app but I kept running into issues with passing the inputs to the tflite model.
If possible can you please reproduce this error through an Android app?

gustavla · 2024-06-04T00:04:09Z

@impjdi

OpenCL
FP16. I've attached the GPUv2 configuration below, which will allow FP16 execution. Note that I have confirmed that this is not a precision issue at all. It may be a bug in the FP16 implementation only, but it's not because of the lower precision. I can try running this with the FP32 mode too.
Can the tensor dimensions be carefully reviewed and made consistent, well-formed 4D tensors?

This is part of a large model that was compiled from Tensorflow to TFLite using the TFLiteConverter. Making sure this model is 4D at the TFLite level is a non-trivial task. I am also under the impression that the current 3D tensors are perfectly well-formed as far as TFLite is concerned (if not, that's a bug in TFLiteConverter then). Yes, GPUv2 may translate them to rank 4 as an implementation detail, but if the rank 3 tensors are valid, then isn't it a bug in GPUv2 if this translation to rank 4 is causing issues?

@sawantkumar I do not have a stand-alone app that reproduces this. However, the original repro did run through an app using TFLite on a real Android device. Feel free to re-save the npy file in another format that you are more used to using for repros like this. For instance, you can do packed row-major as inputs["image"].tofile("image.raw"). How do you usually feed in specific data in situations like this?

GPUv2 configuration:
inference_preference = TFLITE_GPU_INFERENCE_PREFERENCE_SUSTAINED_SPEED
inference_priority1 = TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY
inference_priority2 = TFLITE_GPU_INFERENCE_PRIORITY_MIN_MEMORY_USAGE
inference_priority3 = TFLITE_GPU_INFERENCE_PRIORITY_MAX_PRECISION

sawantkumar · 2024-06-04T06:32:43Z

Hi @gustavla ,

If possible, can you share me the model tensorflow to Tflite model conversion script. The issue could lie there also.

impjdi · 2024-06-09T22:37:47Z

Sorry for the late turn around.

Took a look at the network. You are tapping into the intermediate tensors, but we reuse intermediate tensors, no matter whether it's a graph output tensor (sorry, that's a limitation that requires some engineering resources to fix and we never prioritized that). If you really want to tap into an intermediate tensor, you have to add a small no-op. For example, if you want to read what you have named as "variance" in the above picture (output of MEAN), you have to add a small non-zero value, e.g. ADD(x, 1e-6) (not to be confused with the ADD(x, 0.000009999999747378752) you already have) and make a tensor a true terminal tensor.

gustavla added the comp:lite TF Lite related issues label May 9, 2024

google-ml-butler bot assigned Venkat6871 May 9, 2024

tilakrayal added TFLiteGpuDelegate TFLite Gpu delegate issue TF 2.16 type:bug Bug labels May 9, 2024

tilakrayal assigned sawantkumar and unassigned sawantkumar and Venkat6871 May 9, 2024

sawantkumar added the stat:awaiting response Status - Awaiting response from author label Jun 4, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFLite GPUv2: ADD(x, 1e-5) results in severely wrong output #67216

TFLite GPUv2: ADD(x, 1e-5) results in severely wrong output #67216

TFLite GPUv2: ADD(x, 1e-5) results in severely wrong output #67216

TFLite GPUv2: ADD(x, 1e-5) results in severely wrong output #67216

Comments