Failure in convert Gemma 2B models to TfLite #63025

RageshAntonyHM · 2024-02-22T07:33:01Z

I tried converting Google Gemma 2B models to TfLite. Found it ending in failure

1. System information

Ubuntu 22.04
TensorFlow installation (installed with keras-nlp) :
TensorFlow library (installed with keras-nlp):

2. Code

import os
import keras
import os
import numpy as np
import keras_nlp
import tensorflow as tf
import tensorflow_text as tf_text
from tensorflow import keras
from tensorflow.lite.python import interpreter
import time

os.environ["KAGGLE_USERNAME"] = "rag"
os.environ["KAGGLE_KEY"] = 'e7c'
os.environ["KERAS_BACKEND"] = "tensorflow"  # Or "tensorflow" or "torch".

preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset('gemma_2b_en', sequence_length=4096, add_end_token=True
)
generator = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")

def run_inference(input, generate_tflite):
  interp = interpreter.InterpreterWithCustomOps(
      model_content=generate_tflite,
      custom_op_registerers=tf_text.tflite_registrar.SELECT_TFTEXT_OPS)
  interp.get_signature_list()

  preprocessor_output = preprocessor.generate_preprocess(
    input, sequence_length=preprocessor.sequence_length
  )
  generator = interp.get_signature_runner('serving_default')
  output = generator(preprocessor_output)
  output = preprocessor.generate_postprocess(output["output_0"])
  print("\nGenerated with TFLite:\n", output)

generate_function = generator.make_generate_function()
concrete_func = generate_function.get_concrete_function({
  "token_ids": tf.TensorSpec([None, 4096]),
  "padding_mask": tf.TensorSpec([None, 4096])
})


converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func],
                                                            generator)
converter.target_spec.supported_ops = [
  tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.
  tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.
]
converter.allow_custom_ops = True
converter.target_spec.experimental_select_user_tf_ops = ["UnsortedSegmentJoin", "UpperBound"]
converter._experimental_guarantee_all_funcs_one_use = True
generate_tflite = converter.convert()
run_inference("I'm enjoying a", generate_tflite)

with open('unquantized_mistral.tflite', 'wb') as f:
  f.write(generate_tflite)

3. Failure after conversion

I am getting this error:

tensorflow/core.py":65:1))))))))))))))))))))))))))]): error: missing attribute 'value'
LLVM ERROR: Failed to infer result type(s).
Aborted (core dumped)

5. (optional) Any other info / logs

2024-02-22 06:34:41.094712: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:378] Ignored output_format.
2024-02-22 06:34:41.094742: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:381] Ignored drop_control_dependency.
2024-02-22 06:34:41.095691: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmp58p378bn
2024-02-22 06:34:41.140303: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2024-02-22 06:34:41.140329: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: /tmp/tmp58p378bn
2024-02-22 06:34:41.233389: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-02-22 06:34:41.264724: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-22 06:34:43.697440: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/tmp58p378bn
2024-02-22 06:34:44.189111: I tensorflow/cc/saved_model/loader.cc:316] SavedModel load for tags { serve }; Status: success: OK. Took 3093423 microseconds.
2024-02-22 06:34:45.009212: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
loc(fused["ReadVariableOp:", callsite("decoder_block_0_1/attention_1/attention_output_1/Cast/ReadVariableOp@__inference_generate_step_12229"("/workspace/gem.py":38:1) at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":258:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":235:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":212:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":214:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py":816:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py":42:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":157:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_decoder_block.py":147:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py":816:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py":42:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":157:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_attention.py":193:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py":816:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py":42:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":157:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/core/einsum_dense.py":218:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/numpy.py":2414:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/numpy.py":90:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/numpy.py":91:1 at "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/core.py":65:1))))))))))))))))))))))))))]): error: missing attribute 'value'
LLVM ERROR: Failed to infer result type(s).
Aborted (core dumped)

The text was updated successfully, but these errors were encountered:

LakshmiKalaKadali · 2024-02-23T07:04:07Z

Hi @RageshAntonyHM,

I am trying to reproduce the issue while I had another error ModuleNotFoundError: No module named 'keras_nlp.backend, could you please confirm the version of it.

Thank You

RageshAntonyHM · 2024-02-23T07:08:09Z

@LakshmiKalaKadali

it is keras 3.0.5 and installed keras-nlp via pip install git+https://github.com/keras-team/keras-nlp (0.8.1)

RageshAntonyHM · 2024-02-23T07:08:51Z

@LakshmiKalaKadali

first install pip install git+https://github.com/keras-team/keras-nlp and then update Keras (pip install -U keras)

RageshAntonyHM · 2024-02-23T07:16:53Z

Then install tensorflow-datasets also @LakshmiKalaKadali

farmaker47 · 2024-02-24T15:47:38Z

Also crashing in Colab with or without quantization.

RageshAntonyHM · 2024-02-24T15:49:37Z

@farmaker47

This conversion pipeline needs lot of Vram. At Least 24 GB.

@LakshmiKalaKadali any updates on this please?

urim85 · 2024-02-24T17:02:14Z

@RageshAntonyHM
I got same crash in colab A100(40GB GPU RAM).

RageshAntonyHM · 2024-02-24T17:06:19Z

@urim85

Yeah. Actually, till it is crashing for me in 48 GB RTX 6000.

(What I told to @farmaker47 was, it will crash prematurally if VRAM is low. But also crashes in final step even if you have enough VRAM)

farmaker47 · 2024-02-25T07:25:46Z

I saw that training is working OK having installed first TensorFlow nightly version (2.17.0-dev20240223). @RageshAntonyHM can you try with nightly version and check again the conversion?

RageshAntonyHM · 2024-02-25T07:58:35Z

@farmaker47

How to install TensorFlow nightly version? I tried pip install tf-nightly, but I am getting error

File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/core.py", line 5, in
from tensorflow.compiler.tf2xla.python.xla import dynamic_update_slice
ModuleNotFoundError: No module named 'tensorflow.compiler.tf2xla'

Name: tf-nightly
Version: 2.17.0.dev20240223

farmaker47 · 2024-02-25T08:00:42Z

I work with Colab.
So it is

!pip install tf-nightly
!pip install -q --upgrade keras-nlp
!pip install -q -U keras>=3

RageshAntonyHM · 2024-02-25T08:22:16Z

@farmaker47

Now, again I am getting that first mentioned error

could you please share your notebook link ?

farmaker47 · 2024-02-25T09:21:37Z

The colab is from this example

https://ai.google.dev/gemma/docs/lora_tuning

I have changed nothing. So the idea is if you install tf-nightly the error for conversion disappears? I don't understand from your previous answer if the error is during tf-nightly installation or during conversion.

RageshAntonyHM · 2024-02-25T09:23:59Z

@farmaker47

I hope some package conflicts ,like some packages reinstall 'stable' version of tensorflow. Let me check

RageshAntonyHM · 2024-02-25T09:47:43Z

@farmaker47

I able to ran inference already. my problem is, i need to create a TFlite model for Gemma 2B. I think there is some problem still in conversion

i am very new to AI and even python.

farmaker47 · 2024-02-25T10:08:46Z

@farmaker47

I able to ran inference already. my problem is, i need to create a TFlite model for Gemma 2B. I think there is some problem still in conversion

i am very new to AI and even python.

Then we have to wait a little bit so the TF team solves this and provide us the tf-nightly version we can use to convert it.

RageshAntonyHM · 2024-02-25T10:13:40Z

@LakshmiKalaKadali

import keras_nlp.backend import ops
is not needed. Sorry

But when using all nightly versions, I got some "GraphDef" issue

freedomtan · 2024-02-26T03:15:26Z

a minimal script to reproduce the issue

import keras
import keras_nlp
import tensorflow as tf

os.environ["KAGGLE_USERNAME"] = '....'
os.environ["KAGGLE_KEY"] = '...'
os.environ["KERAS_BACKEND"] = "tensorflow" 

gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
tf.saved_model.save(gemma_lm.backbone, '/tmp/gemma_saved_model/')

f = tf.lite.TFLiteConverter.from_saved_model('/tmp/gemma_saved_model/').convert()

I tested with tf-2.15, 2.16, and 2.17 nightly and their corresponding packages. None of them works.

LakshmiKalaKadali · 2024-02-27T11:48:38Z

Hi @pkgoogle,

I have reproduced the issue in Colab with TF 2.15, the session crashed at the step generator = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en") . Please take a look.

Thank You

ymodak · 2024-02-27T21:53:14Z

Adding @advaitjain and @paulinesho for visibility.

nyadla-sys · 2024-03-01T06:24:03Z

Use the below code snippet to generate gemma2 quantized model and it would be around 2.33gb

# Convert the model
converter = tf.lite.TFLiteConverter.from_saved_model('gemma_2/')
converter.target_spec.supported_ops = [
  tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.
  tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.
]
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

# Save the model
with open('gemma2-quantized.tflite', 'wb') as f:
    f.write(tflite_model)

nyadla-sys · 2024-03-01T06:25:35Z

Will post the inference results soon

RageshAntonyHM · 2024-03-03T05:23:29Z

@nyadla-sys
Did you run the inference in Android?

RageshAntonyHM · 2024-03-03T15:04:45Z

@farmaker47

You told that you able to run it in Android

I generated the quantized model as per @nyadla-sys code and tried running in Android:

   private lateinit var interpreter: Interpreter
    private  val OUTPUT_BUFFER_SIZE = 800
    private val outputBuffer = ByteBuffer.allocateDirect(OUTPUT_BUFFER_SIZE)

    @WorkerThread
    private fun runInterpreterOn(input: String): String {

        interpreter = Interpreter(gemmaTfLiteFile)

        outputBuffer.clear()

        // Run interpreter, which will generate text into outputBuffer
        interpreter.run(input, outputBuffer)

        // Set output buffer limit to current position & position to 0
        outputBuffer.flip()

        // Get bytes from output buffer
        val bytes = ByteArray(outputBuffer.remaining())
        outputBuffer.get(bytes)

        outputBuffer.clear()

        // Return bytes converted to String
        return String(bytes, Charsets.UTF_8)
    }

But I get this error:

Cannot convert between a TensorFlowLite tensor with type FLOAT32 and a Java object of type java.lang.String (which is compatible with the TensorFlowLite type STRING).

How to fix this?

RageshAntonyHM · 2024-03-06T06:29:26Z

@farmaker47 @nyadla-sys @freedomtan

How to create Gemma Tokenizer for input and outputs in Android ? I think interpreter.run(input, outputBuffer) needs the specific format of Gemma.

The inputs formats:

0 serving_default_inputs_1:0 FLOAT32 [1, 3][-1, 3]
1 serving_default_inputs:0 FLOAT32 [1, 3][-1, 3]

output:
1872 StatefulPartitionedCall_1:0 FLOAT32 [1, 3, 2048][-1, 3, 2048]

farmaker47 · 2024-03-13T07:54:03Z

After Gemma release they have presented also this for converting Gemma models:
https://github.com/googlesamples/mediapipe/tree/main/examples/llm_inference/conversion
and the android app to use the converted files:
https://github.com/googlesamples/mediapipe/tree/main/examples/llm_inference/android

You can also dig in there for the tokenizer I suppose.

jagmohaniiit · 2024-04-02T09:12:46Z

@farmaker47 @nyadla-sys @freedomtan
I was able to perform fine-tuning using LoRA and conversion to Tensorflow lite of Gemma2 using A100 GPU based on comments in this forum. I am interested in performing inference using the exported tflite in Linux (Colab). However, we need to perform tokenization ourselves (for which I used Gemma Tokenizer from keras_nlp), but the backbone takes only fixed-length input. Thus, how to generate a response from a text prompt using the tflite on Linux (Colab) is not clear to me. Any suggestions on this are welcome.

pkgoogle · 2024-06-11T20:31:57Z

Hi All, @RageshAntonyHM, if the media-pipe workflow is unideal for your use case we have another option, AI-Edge-Torch, our PyTorch conversion library, you can find more information here: googleblog.

We actually have examples of converting and quantizing decoder-only LLMs here: https://github.com/google-ai-edge/ai-edge-torch/tree/main/ai_edge_torch/generative/examples.

If the conversion is successful, you can also try visualizing the result in model-explorer as well.

Please try them out and let us know if this resolves your issue. If you still need further help, feel free to open a new issue at the respective repo. Thanks for your help.

github-actions · 2024-06-19T01:51:28Z

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

jigmam · 2024-06-23T20:54:32Z

Hello Everyone, I followed all steps in this issue to create a llm in mobile using gemma-2b-it, but I have a problem when I try to run the llm with mediapipe.

E0000 00:00:1719161438.366071   20510 calculator_graph.cc:887] INTERNAL: CalculatorGraph::Run() failed: 
                                                                                                    Calculator::Open() for node "odml.infra.TfLitePrefillDecodeRunnerCalculator" failed: ; RET_CHECK failure (external/odml/odml/infra/genai/inference/calculators/tflite_prefill_decode_runner_calculator.cc:157) (prefill_runner_)!=(nullptr)

the step that i followed were:

trained the llm with keras.
Convert from keras to tflite.
Convert tflite to mediaPipe. (using colab
load .task in android with mediapipe

does Anybody know something about that error?

pkgoogle · 2024-06-24T23:03:01Z

Hi @jigmam, if you have your converted model and android studio project please share it. We need the context around the call which is doing the error for us to debug it. Thanks for your help!

jigmam · 2024-06-27T20:00:22Z

Thanks for your response.
I am using react native, this is the repository: https://github.com/jigmam/reactllm.git
the .task is in this link: https://drive.google.com/file/d/1iJW75azdzf1o6LC0-ohAW8g8brohqb7P/view?usp=drive_link

and if you need to know how I created .task, you can see:

https://github.com/jigmam/reactllm/blob/main/Copia_de_lora_tuning.ipynb
https://github.com/jigmam/reactllm/blob/main/convertidor_de_modelos_funciona_conTPU.ipynb
3.https://colab.research.google.com/github/googlesamples/mediapipe/blob/main/examples/llm_inference/bundling/llm_bundling.ipynb (Note: like a tokenizer.model a download .model from https://huggingface.co/google/gemma-2b-it/tree/main, I know is not a better way but I didnt know how get that tokenizer.model
I am bloking, i dont know what i can do.

if you have any questions or suggestions, please let me know.

pkgoogle · 2024-06-28T18:46:59Z

Hi @jigmam, while there is probably a way to make mediapipe work with reactnative, I suspect we will keep running into different issues if we go that route (as it's not a well traveled road) -- Would you be open to switching to an Android Studio project? If you are targeting Android: https://ai.google.dev/edge/mediapipe/solutions/setup_android

jigmam · 2024-06-28T19:51:20Z

It's okey for my, I just need MVP to present a prototype. Following your recommendation I created a new project using example that link, but not working .task, could it be a issue in converter between keras to tflite?

pkgoogle · 2024-06-28T20:02:22Z

Hi @jigmam, currently Keras3 does not work well with tflite primarily due to saved model format issues. What I'm hearing from you is that you were able to follow https://github.com/google-ai-edge/ai-edge-torch/tree/main/ai_edge_torch/generative/examples successfully? If so, great! If not, let us know if you are stuck.

jigmam · 2024-06-28T23:25:29Z

No yet, I am reading the example, several questions: do I need to make fine-tuning again? Is That tutorial just using to convert to pytorch to tflite? Remember I Made fine-tuning in keras

RageshAntonyHM added the TFLiteConverter For issues related to TFLite converter label Feb 22, 2024

google-ml-butler bot assigned tilakrayal Feb 22, 2024

tilakrayal added comp:lite TF Lite related issues type:bug Bug labels Feb 22, 2024

tilakrayal assigned LakshmiKalaKadali and unassigned tilakrayal Feb 22, 2024

LakshmiKalaKadali assigned pkgoogle and unassigned LakshmiKalaKadali Feb 27, 2024

pkgoogle assigned paulinesho Feb 27, 2024

pkgoogle added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 27, 2024

RageshAntonyHM mentioned this issue Mar 20, 2024

[Question] error: package org.apache.tvm does not exist import org.apache.tvm.Device mlc-ai/mlc-llm#1545

Closed

pkgoogle added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jun 11, 2024

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jun 19, 2024

google-ml-butler bot removed stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author labels Jun 23, 2024

pkgoogle added the stat:awaiting response Status - Awaiting response from author label Jun 24, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 27, 2024

pkgoogle added stat:awaiting response Status - Awaiting response from author labels Jun 28, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 28, 2024

pkgoogle added the stat:awaiting response Status - Awaiting response from author label Jun 28, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in convert Gemma 2B models to TfLite #63025

Failure in convert Gemma 2B models to TfLite #63025

Failure in convert Gemma 2B models to TfLite #63025

Failure in convert Gemma 2B models to TfLite #63025

Comments

1. System information

2. Code

3. Failure after conversion

5. (optional) Any other info / logs