[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFLite GPUv2: ADD(x, 1e-5) results in severely wrong output #67216

Open
gustavla opened this issue May 9, 2024 · 6 comments
Open

TFLite GPUv2: ADD(x, 1e-5) results in severely wrong output #67216

gustavla opened this issue May 9, 2024 · 6 comments
Assignees
Labels
comp:lite TF Lite related issues TF 2.16 TFLiteGpuDelegate TFLite Gpu delegate issue type:bug Bug

Comments

@gustavla
Copy link
Contributor
gustavla commented May 9, 2024

System information

  • Samsung Galaxy S23 / Android 13 /Snapdragon® 8 Gen 2 | SM8550
  • GPUv2 delegate
  • TFLite 2.16.1

Assets:

Please take a look at two outputs in particular of this network:

  • key = "model_13/featurefusion_network/encoder/query_layer/norm/LayerNormalization/moments/variance" (variance)
  • key2 = "model_13/featurefusion_network/encoder/query_layer/norm/LayerNormalization/batchnorm/add" (add)

The variable variance gets fed into ADD(x, 0.000009999999747378752) and comes out as add.

image

I ran this on the CPU (xnnpack) and the GPU (GPUv2) and got totally different results.

variance looks like this across CPU and GPU (so far consistent):

(Pdb) p cpu[key][0].ravel()
array([0.8386347 , 0.8353483 , 0.83554685, 0.8366282 , 0.8377434 ,
       0.8369055 , 0.8419936 , 0.8433927 , 0.83845955, 0.83644855,
       0.8404068 , 0.8368349 , 0.8335228 , 0.8401757 , 0.83619094,
       0.8386446 ], dtype=float32)
(Pdb) p gpu[key][0].ravel()
array([0.8378906 , 0.83496094, 0.8354492 , 0.8364258 , 0.83691406,
       0.8364258 , 0.8417969 , 0.84375   , 0.83740234, 0.8354492 ,
       0.84033203, 0.83691406, 0.8334961 , 0.83984375, 0.8359375 ,
       0.8383789 ], dtype=float32)

add looks like this across CPU and GPU:

(Pdb) p cpu[key2][0].ravel()
array([0.83864474, 0.8353583 , 0.83555686, 0.8366382 , 0.8377534 ,
       0.8369155 , 0.8420036 , 0.84340274, 0.83846956, 0.83645856,
       0.8404168 , 0.8368449 , 0.8335328 , 0.8401857 , 0.83620095,
       0.83865464], dtype=float32)
(Pdb) p gpu[key2][0].ravel()
array([-0.78222656,  0.20910645, -0.72802734,  0.32421875, -0.78125   ,
        0.20947266, -0.72753906,  0.3244629 , -0.7817383 ,  0.2097168 ,
       -0.7265625 ,  0.3251953 , -0.78125   ,  0.2097168 , -0.72802734,
        0.3244629 ], dtype=float32)

Here, the values on the GPU has gone completely off the rails. They do not look random though, since there is a periodicity to the output (error alternates between around 1.6 and 0.6).

Standalone code to reproduce the issue
This should be simple to set up through benchmark tool or any other way to run GPUv2 directly. I ran it through Qualcomm's AI Hub (https://aihub.qualcomm.com), so I'm attaching the script that I used as a reference. This also shows how the example inputs can be loaded into python.

import numpy as np
import qai_hub as hub

inputs = np.load("67216_post_add_numerical_issues_inputs.npz")

model = hub.upload_model("67216_post_add_numerical_issues.tflite")
device = hub.Device("Samsung Galaxy S23")
input_data = hub.upload_dataset({
    "image": [inputs["image"]], 
    "feature_template": [inputs["feature_template"]],
    "pos_template": [inputs["pos_template"]],
    "pos_search": [inputs["pos_search"]],
})

job_cpu = hub.submit_inference_job(
    model,
    device=device,
    inputs=input_data,
    options="--compute_unit cpu",
)

job_gpu = hub.submit_inference_job(
    model,
    device=device,
    inputs=input_data,
    options="--compute_unit gpu",
)

cpu = job_cpu.download_output_data()
gpu = job_gpu.download_output_data()

key = "model_13/featurefusion_network/encoder/query_layer/norm/LayerNormalization/moments/variance"
key2 = "model_13/featurefusion_network/encoder/query_layer/norm/LayerNormalization/batchnorm/add"

print(gpu[key2][0].ravel())
@gustavla gustavla added the comp:lite TF Lite related issues label May 9, 2024
@tilakrayal tilakrayal added TFLiteGpuDelegate TFLite Gpu delegate issue TF 2.16 type:bug Bug labels May 9, 2024
@sawantkumar
Copy link

Hi @gustavla ,

I replicated your issue using Qualcom Ai hub, and i got the same results as you. Let me verify the same through an Android app and I will get back to you.

@impjdi
Copy link
Contributor
impjdi commented May 31, 2024
  1. Is this OpenGL or OpenCL?
  2. What is the precision? FP16 or FP32?
  3. The GPU delegate works on 4D tensors of shape [B, H, W, C]. It looks like it starts with [1, H, W, C] but throughout the network, you see tensor shapes like 16x1x256, which will then be auto expanded to [16, 1, 1, 256], but this is obviously wrong, because B is expected to be 1 (the way I read your network). Can the tensor dimensions be carefully reviewed and made consistent, well-formed 4D tensors?

@sawantkumar
Copy link

Hi @gustavla ,

I tried reproducing your issue using an Android app but I kept running into issues with passing the inputs to the tflite model.
If possible can you please reproduce this error through an Android app?

@gustavla
Copy link
Contributor Author
gustavla commented Jun 4, 2024

@impjdi

  1. OpenCL

  2. FP16. I've attached the GPUv2 configuration below, which will allow FP16 execution. Note that I have confirmed that this is not a precision issue at all. It may be a bug in the FP16 implementation only, but it's not because of the lower precision. I can try running this with the FP32 mode too.

  3. Can the tensor dimensions be carefully reviewed and made consistent, well-formed 4D tensors?

    This is part of a large model that was compiled from Tensorflow to TFLite using the TFLiteConverter. Making sure this model is 4D at the TFLite level is a non-trivial task. I am also under the impression that the current 3D tensors are perfectly well-formed as far as TFLite is concerned (if not, that's a bug in TFLiteConverter then). Yes, GPUv2 may translate them to rank 4 as an implementation detail, but if the rank 3 tensors are valid, then isn't it a bug in GPUv2 if this translation to rank 4 is causing issues?

@sawantkumar I do not have a stand-alone app that reproduces this. However, the original repro did run through an app using TFLite on a real Android device. Feel free to re-save the npy file in another format that you are more used to using for repros like this. For instance, you can do packed row-major as inputs["image"].tofile("image.raw"). How do you usually feed in specific data in situations like this?

GPUv2 configuration:
inference_preference = TFLITE_GPU_INFERENCE_PREFERENCE_SUSTAINED_SPEED
inference_priority1 = TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY
inference_priority2 = TFLITE_GPU_INFERENCE_PRIORITY_MIN_MEMORY_USAGE
inference_priority3 = TFLITE_GPU_INFERENCE_PRIORITY_MAX_PRECISION

@sawantkumar
Copy link

Hi @gustavla ,

If possible, can you share me the model tensorflow to Tflite model conversion script. The issue could lie there also.

@sawantkumar sawantkumar added the stat:awaiting response Status - Awaiting response from author label Jun 4, 2024
@impjdi
Copy link
Contributor
impjdi commented Jun 9, 2024

Sorry for the late turn around.

Took a look at the network. You are tapping into the intermediate tensors, but we reuse intermediate tensors, no matter whether it's a graph output tensor (sorry, that's a limitation that requires some engineering resources to fix and we never prioritized that). If you really want to tap into an intermediate tensor, you have to add a small no-op. For example, if you want to read what you have named as "variance" in the above picture (output of MEAN), you have to add a small non-zero value, e.g. ADD(x, 1e-6) (not to be confused with the ADD(x, 0.000009999999747378752) you already have) and make a tensor a true terminal tensor.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:lite TF Lite related issues TF 2.16 TFLiteGpuDelegate TFLite Gpu delegate issue type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants