TFLite Converter, add possibility to ignore some OPs from quantization #62923

adamp87 · 2024-02-08T16:53:20Z

Issue type

Feature Request

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

v2.13.0-17-gf841394b1b7

Custom code

No

OS platform and distribution

No response

Mobile device

No response

Python version

3.10.13

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

Quantizing models to integer works as expected, but because some of the final operations work in INT8, a large accuracy drop can be observed in some models.

This ticket is a feature request to be able to exclude specific operations from quantization and execute in FP32. OpenVINO supports this feature as ignored_scope param during quantization. Link to OpenVINO quantizer documentation. Considering how Edge TPU works, the solution should be to set where to stop quantization and execute the rest of the OPs in FP32 on the CPU.

Lets take yolov8n as an example and convert the pytorch model to TF using onnx2tf. Lets compare the main branch in FULL INT8 quantization, with a dirty hack by detaching the last operations and executing as INT8 + FP32. As a note, Edge TPU compiled models larger than 192pixel input execute the head on the CPU as some Transpose operations are too large for the TPU.

Model yolo8n	mAP50	mAP50-95	Note	Speed on Intel CPU
Baseline FP32	52.6	37.4	Main branch	N/A
TFLite Full INT8	48.8	32.9	per-tensor	162.2 ms
TFLite INT8 + FP32	50.3	35.2	per-tensor	166.0ms
TFLite Full INT8	49.8	33.9	per-channel	N/A
TFLite INT8 + FP32	51.4	36.3	per-channel	N/A

Standalone code to reproduce the issue

https://github.com/adamp87/ultralytics/blob/tflite_detach_dirty/yolo8_full_int8_nohead_test.ipynb

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

sushreebarsa · 2024-02-12T10:05:10Z

@adamp87 One workaround could be to quantize the entire model, then fine-tune specific layers with FP32 weights and activation. This approach can be less efficient and might not perfectly address accuracy concerns. Thank you!

adamp87 · 2024-02-12T10:30:58Z

@sushreebarsa Not sure how would that be possible, is there a documentation how to perform such a task? Currently my workaround is to split the Model call() code. I think a solution could be something similar like the ignored_scope at openvino. reference

pkgoogle · 2024-02-13T21:35:46Z

Hi @abattery, can you please take a look? Thanks.

adamp87 · 2024-03-01T14:51:41Z

Hi @abattery,

in the meantime I got a suggestion to experiment with QuantizationDebugger. Please see ticket: https://github.com/PINTO0309/onnx2tf/issues/578

What is your opinion about it? Is this the right way? Im still having some issues that could be discussed.

Thank you!

google-ml-butler bot added the type:feature Feature requests label Feb 8, 2024

google-ml-butler bot assigned SuryanarayanaY Feb 8, 2024

sushreebarsa assigned sushreebarsa and unassigned SuryanarayanaY Feb 9, 2024

sushreebarsa added comp:lite TF Lite related issues TF 2.13 For issues related to Tensorflow 2.13 labels Feb 12, 2024

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Feb 12, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Feb 12, 2024

sushreebarsa assigned LakshmiKalaKadali and unassigned sushreebarsa Feb 12, 2024

LakshmiKalaKadali added the TFLiteConverter For issues related to TFLite converter label Feb 12, 2024

LakshmiKalaKadali assigned pkgoogle and unassigned LakshmiKalaKadali Feb 13, 2024

pkgoogle added the ModelOptimizationToolkit TF Model Optimization Toolkit label Feb 13, 2024

pkgoogle assigned abattery Feb 13, 2024

pkgoogle added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFLite Converter, add possibility to ignore some OPs from quantization #62923

TFLite Converter, add possibility to ignore some OPs from quantization #62923

TFLite Converter, add possibility to ignore some OPs from quantization #62923

TFLite Converter, add possibility to ignore some OPs from quantization #62923

Comments

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output