[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in convert Gemma 2B models to TfLite #63025

Open
RageshAntonyHM opened this issue Feb 22, 2024 · 59 comments
Open

Failure in convert Gemma 2B models to TfLite #63025

RageshAntonyHM opened this issue Feb 22, 2024 · 59 comments
Assignees
Labels
comp:lite TF Lite related issues TFLiteConverter For issues related to TFLite converter type:bug Bug

Comments

@RageshAntonyHM
Copy link
RageshAntonyHM commented Feb 22, 2024

I tried converting Google Gemma 2B models to TfLite. Found it ending in failure

1. System information

  • Ubuntu 22.04
  • TensorFlow installation (installed with keras-nlp) :
  • TensorFlow library (installed with keras-nlp):

2. Code

import os
import keras
import os
import numpy as np
import keras_nlp
import tensorflow as tf
import tensorflow_text as tf_text
from tensorflow import keras
from tensorflow.lite.python import interpreter
import time

os.environ["KAGGLE_USERNAME"] = "rag"
os.environ["KAGGLE_KEY"] = 'e7c'
os.environ["KERAS_BACKEND"] = "tensorflow"  # Or "tensorflow" or "torch".

preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset('gemma_2b_en', sequence_length=4096, add_end_token=True
)
generator = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")

def run_inference(input, generate_tflite):
  interp = interpreter.InterpreterWithCustomOps(
      model_content=generate_tflite,
      custom_op_registerers=tf_text.tflite_registrar.SELECT_TFTEXT_OPS)
  interp.get_signature_list()

  preprocessor_output = preprocessor.generate_preprocess(
    input, sequence_length=preprocessor.sequence_length
  )
  generator = interp.get_signature_runner('serving_default')
  output = generator(preprocessor_output)
  output = preprocessor.generate_postprocess(output["output_0"])
  print("\nGenerated with TFLite:\n", output)

generate_function = generator.make_generate_function()
concrete_func = generate_function.get_concrete_function({
  "token_ids": tf.TensorSpec([None, 4096]),
  "padding_mask": tf.TensorSpec([None, 4096])
})


converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func],
                                                            generator)
converter.target_spec.supported_ops = [
  tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.
  tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.
]
converter.allow_custom_ops = True
converter.target_spec.experimental_select_user_tf_ops = ["UnsortedSegmentJoin", "UpperBound"]
converter._experimental_guarantee_all_funcs_one_use = True
generate_tflite = converter.convert()
run_inference("I'm enjoying a", generate_tflite)

with open('unquantized_mistral.tflite', 'wb') as f:
  f.write(generate_tflite)

3. Failure after conversion

I am getting this error:

tensorflow/core.py":65:1))))))))))))))))))))))))))]): error: missing attribute 'value'
LLVM ERROR: Failed to infer result type(s).
Aborted (core dumped)

5. (optional) Any other info / logs

2024-02-22 06:34:41.094712: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:378] Ignored output_format.
2024-02-22 06:34:41.094742: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:381] Ignored drop_control_dependency.
2024-02-22 06:34:41.095691: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmp58p378bn
2024-02-22 06:34:41.140303: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2024-02-22 06:34:41.140329: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: /tmp/tmp58p378bn
2024-02-22 06:34:41.233389: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-02-22 06:34:41.264724: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-22 06:34:43.697440: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/tmp58p378bn
2024-02-22 06:34:44.189111: I tensorflow/cc/saved_model/loader.cc:316] SavedModel load for tags { serve }; Status: success: OK. Took 3093423 microseconds.
2024-02-22 06:34:45.009212: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
loc(fused["ReadVariableOp:", callsite("decoder_block_0_1/attention_1/attention_output_1/Cast/ReadVariableOp@__inference_generate_step_12229"("/workspace/gem.py":38:1) at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":258:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":235:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":212:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_causal_lm.py":214:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py":816:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py":42:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":157:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_decoder_block.py":147:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py":816:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py":42:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":157:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras_nlp/models/gemma/gemma_attention.py":193:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py":816:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":118:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py":42:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py":157:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/layers/core/einsum_dense.py":218:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/ops/numpy.py":2414:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/numpy.py":90:1 at callsite("/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/numpy.py":91:1 at "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/core.py":65:1))))))))))))))))))))))))))]): error: missing attribute 'value'
LLVM ERROR: Failed to infer result type(s).
Aborted (core dumped)
@RageshAntonyHM RageshAntonyHM added the TFLiteConverter For issues related to TFLite converter label Feb 22, 2024
@tilakrayal tilakrayal added comp:lite TF Lite related issues type:bug Bug labels Feb 22, 2024
@LakshmiKalaKadali
Copy link
Contributor

Hi @RageshAntonyHM,

I am trying to reproduce the issue while I had another error ModuleNotFoundError: No module named 'keras_nlp.backend, could you please confirm the version of it.

Thank You

@RageshAntonyHM
Copy link
Author

@LakshmiKalaKadali

it is keras 3.0.5 and installed keras-nlp via pip install git+https://github.com/keras-team/keras-nlp (0.8.1)

@RageshAntonyHM
Copy link
Author

@LakshmiKalaKadali

first install pip install git+https://github.com/keras-team/keras-nlp and then update Keras (pip install -U keras)

@RageshAntonyHM
Copy link
Author

Then install tensorflow-datasets also @LakshmiKalaKadali

@farmaker47
Copy link

Also crashing in Colab with or without quantization.

@RageshAntonyHM
Copy link
Author

@farmaker47

This conversion pipeline needs lot of Vram. At Least 24 GB.

@LakshmiKalaKadali any updates on this please?

@urim85
Copy link
urim85 commented Feb 24, 2024

@RageshAntonyHM
I got same crash in colab A100(40GB GPU RAM).

@RageshAntonyHM
Copy link
Author

@urim85

Yeah. Actually, till it is crashing for me in 48 GB RTX 6000.

(What I told to @farmaker47 was, it will crash prematurally if VRAM is low. But also crashes in final step even if you have enough VRAM)

@farmaker47
Copy link
farmaker47 commented Feb 25, 2024

I saw that training is working OK having installed first TensorFlow nightly version (2.17.0-dev20240223). @RageshAntonyHM can you try with nightly version and check again the conversion?

@RageshAntonyHM
Copy link
Author
RageshAntonyHM commented Feb 25, 2024

@farmaker47

How to install TensorFlow nightly version? I tried pip install tf-nightly, but I am getting error

File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/core.py", line 5, in
from tensorflow.compiler.tf2xla.python.xla import dynamic_update_slice
ModuleNotFoundError: No module named 'tensorflow.compiler.tf2xla'

Name: tf-nightly
Version: 2.17.0.dev20240223

@farmaker47
Copy link

I work with Colab.
So it is

!pip install tf-nightly
!pip install -q --upgrade keras-nlp
!pip install -q -U keras>=3

@RageshAntonyHM
Copy link
Author

@farmaker47

Now, again I am getting that first mentioned error

could you please share your notebook link ?

@farmaker47
Copy link

The colab is from this example

https://ai.google.dev/gemma/docs/lora_tuning

I have changed nothing. So the idea is if you install tf-nightly the error for conversion disappears? I don't understand from your previous answer if the error is during tf-nightly installation or during conversion.

@RageshAntonyHM
Copy link
Author

@farmaker47

I hope some package conflicts ,like some packages reinstall 'stable' version of tensorflow. Let me check

@RageshAntonyHM
Copy link
Author

@farmaker47

I able to ran inference already. my problem is, i need to create a TFlite model for Gemma 2B. I think there is some problem still in conversion

i am very new to AI and even python.

@farmaker47
Copy link

@farmaker47

I able to ran inference already. my problem is, i need to create a TFlite model for Gemma 2B. I think there is some problem still in conversion

i am very new to AI and even python.

Then we have to wait a little bit so the TF team solves this and provide us the tf-nightly version we can use to convert it.

@RageshAntonyHM
Copy link
Author
RageshAntonyHM commented Feb 25, 2024

@LakshmiKalaKadali

import keras_nlp.backend import ops
is not needed. Sorry

But when using all nightly versions, I got some "GraphDef" issue

@freedomtan
Copy link
Contributor

a minimal script to reproduce the issue

import keras
import keras_nlp
import tensorflow as tf

os.environ["KAGGLE_USERNAME"] = '....'
os.environ["KAGGLE_KEY"] = '...'
os.environ["KERAS_BACKEND"] = "tensorflow" 

gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
tf.saved_model.save(gemma_lm.backbone, '/tmp/gemma_saved_model/')

f = tf.lite.TFLiteConverter.from_saved_model('/tmp/gemma_saved_model/').convert()

I tested with tf-2.15, 2.16, and 2.17 nightly and their corresponding packages. None of them works.

@LakshmiKalaKadali
Copy link
Contributor

Hi @pkgoogle,

I have reproduced the issue in Colab with TF 2.15, the session crashed at the step generator = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en") . Please take a look.

Thank You

@ymodak
Copy link
Contributor
ymodak commented Feb 27, 2024

Adding @advaitjain and @paulinesho for visibility.

@pkgoogle pkgoogle added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 27, 2024
@nyadla-sys
Copy link
Member
nyadla-sys commented Mar 1, 2024

Use the below code snippet to generate gemma2 quantized model and it would be around 2.33gb

# Convert the model
converter = tf.lite.TFLiteConverter.from_saved_model('gemma_2/')
converter.target_spec.supported_ops = [
  tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.
  tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.
]
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

# Save the model
with open('gemma2-quantized.tflite', 'wb') as f:
    f.write(tflite_model)

@nyadla-sys
Copy link
Member

Will post the inference results soon

@RageshAntonyHM
Copy link
Author

@nyadla-sys
Did you run the inference in Android?

@RageshAntonyHM
Copy link
Author

@farmaker47

You told that you able to run it in Android

I generated the quantized model as per @nyadla-sys code and tried running in Android:

   private lateinit var interpreter: Interpreter
    private  val OUTPUT_BUFFER_SIZE = 800
    private val outputBuffer = ByteBuffer.allocateDirect(OUTPUT_BUFFER_SIZE)

    @WorkerThread
    private fun runInterpreterOn(input: String): String {

        interpreter = Interpreter(gemmaTfLiteFile)

        outputBuffer.clear()

        // Run interpreter, which will generate text into outputBuffer
        interpreter.run(input, outputBuffer)

        // Set output buffer limit to current position & position to 0
        outputBuffer.flip()

        // Get bytes from output buffer
        val bytes = ByteArray(outputBuffer.remaining())
        outputBuffer.get(bytes)

        outputBuffer.clear()

        // Return bytes converted to String
        return String(bytes, Charsets.UTF_8)
    }

But I get this error:

Cannot convert between a TensorFlowLite tensor with type FLOAT32 and a Java object of type java.lang.String (which is compatible with the TensorFlowLite type STRING).

How to fix this?

@RageshAntonyHM
Copy link
Author
RageshAntonyHM commented Mar 6, 2024

@farmaker47 @nyadla-sys @freedomtan

How to create Gemma Tokenizer for input and outputs in Android ? I think interpreter.run(input, outputBuffer) needs the specific format of Gemma.

The inputs formats:

0 serving_default_inputs_1:0 FLOAT32 [1, 3][-1, 3]
1 serving_default_inputs:0 FLOAT32 [1, 3][-1, 3]

output:
1872 StatefulPartitionedCall_1:0 FLOAT32 [1, 3, 2048][-1, 3, 2048]

@farmaker47
Copy link

After Gemma release they have presented also this for converting Gemma models:
https://github.com/googlesamples/mediapipe/tree/main/examples/llm_inference/conversion
and the android app to use the converted files:
https://github.com/googlesamples/mediapipe/tree/main/examples/llm_inference/android

You can also dig in there for the tokenizer I suppose.

@jagmohaniiit
Copy link

@farmaker47 @nyadla-sys @freedomtan
I was able to perform fine-tuning using LoRA and conversion to Tensorflow lite of Gemma2 using A100 GPU based on comments in this forum. I am interested in performing inference using the exported tflite in Linux (Colab). However, we need to perform tokenization ourselves (for which I used Gemma Tokenizer from keras_nlp), but the backbone takes only fixed-length input. Thus, how to generate a response from a text prompt using the tflite on Linux (Colab) is not clear to me. Any suggestions on this are welcome.

@pkgoogle
Copy link

Hi All, @RageshAntonyHM, if the media-pipe workflow is unideal for your use case we have another option, AI-Edge-Torch, our PyTorch conversion library, you can find more information here: googleblog.

We actually have examples of converting and quantizing decoder-only LLMs here: https://github.com/google-ai-edge/ai-edge-torch/tree/main/ai_edge_torch/generative/examples.

If the conversion is successful, you can also try visualizing the result in model-explorer as well.

Please try them out and let us know if this resolves your issue. If you still need further help, feel free to open a new issue at the respective repo. Thanks for your help.

@pkgoogle pkgoogle added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jun 11, 2024
Copy link

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jun 19, 2024
@jigmam
Copy link
jigmam commented Jun 23, 2024

Hello Everyone, I followed all steps in this issue to create a llm in mobile using gemma-2b-it, but I have a problem when I try to run the llm with mediapipe.

E0000 00:00:1719161438.366071   20510 calculator_graph.cc:887] INTERNAL: CalculatorGraph::Run() failed: 
                                                                                                    Calculator::Open() for node "odml.infra.TfLitePrefillDecodeRunnerCalculator" failed: ; RET_CHECK failure (external/odml/odml/infra/genai/inference/calculators/tflite_prefill_decode_runner_calculator.cc:157) (prefill_runner_)!=(nullptr)

the step that i followed were:

  1. trained the llm with keras.
  2. Convert from keras to tflite.
  3. Convert tflite to mediaPipe. (using colab
  4. load .task in android with mediapipe

does Anybody know something about that error?

@google-ml-butler google-ml-butler bot removed stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author labels Jun 23, 2024
@pkgoogle
Copy link

Hi @jigmam, if you have your converted model and android studio project please share it. We need the context around the call which is doing the error for us to debug it. Thanks for your help!

@pkgoogle pkgoogle added the stat:awaiting response Status - Awaiting response from author label Jun 24, 2024
@jigmam
Copy link
jigmam commented Jun 27, 2024

Thanks for your response.
I am using react native, this is the repository: https://github.com/jigmam/reactllm.git
the .task is in this link: https://drive.google.com/file/d/1iJW75azdzf1o6LC0-ohAW8g8brohqb7P/view?usp=drive_link

and if you need to know how I created .task, you can see:

  1. https://github.com/jigmam/reactllm/blob/main/Copia_de_lora_tuning.ipynb
  2. https://github.com/jigmam/reactllm/blob/main/convertidor_de_modelos_funciona_conTPU.ipynb
    3.https://colab.research.google.com/github/googlesamples/mediapipe/blob/main/examples/llm_inference/bundling/llm_bundling.ipynb (Note: like a tokenizer.model a download .model from https://huggingface.co/google/gemma-2b-it/tree/main, I know is not a better way but I didnt know how get that tokenizer.model
    I am bloking, i dont know what i can do.

if you have any questions or suggestions, please let me know.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 27, 2024
@pkgoogle
Copy link

Hi @jigmam, while there is probably a way to make mediapipe work with reactnative, I suspect we will keep running into different issues if we go that route (as it's not a well traveled road) -- Would you be open to switching to an Android Studio project? If you are targeting Android: https://ai.google.dev/edge/mediapipe/solutions/setup_android

@pkgoogle pkgoogle added stat:awaiting response Status - Awaiting response from author labels Jun 28, 2024
@jigmam
Copy link
jigmam commented Jun 28, 2024

It's okey for my, I just need MVP to present a prototype. Following your recommendation I created a new project using example that link, but not working .task, could it be a issue in converter between keras to tflite?

@pkgoogle
Copy link

Hi @jigmam, currently Keras3 does not work well with tflite primarily due to saved model format issues. What I'm hearing from you is that you were able to follow https://github.com/google-ai-edge/ai-edge-torch/tree/main/ai_edge_torch/generative/examples successfully? If so, great! If not, let us know if you are stuck.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 28, 2024
@pkgoogle pkgoogle added the stat:awaiting response Status - Awaiting response from author label Jun 28, 2024
@jigmam
Copy link
jigmam commented Jun 28, 2024

No yet, I am reading the example, several questions: do I need to make fine-tuning again? Is That tutorial just using to convert to pytorch to tflite? Remember I Made fine-tuning in keras

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:lite TF Lite related issues TFLiteConverter For issues related to TFLite converter type:bug Bug
Projects
None yet
Development

No branches or pull requests