Cant use HuggingFace Model for evaluation #743

Kraebs · 2024-05-06T09:28:01Z

When i follow the example on this page:
https://docs.confident-ai.com/docs/metrics-introduction

and try to use Mistral-7B as evaluation-model, i always get this error when running the exact code in the tutorial.
It seems there is a mistake in the code when using HugigngFace models for evaluation instead of ChatGPT.

Error:

JSONDecodeError Traceback (most recent call last)
File ~/.conda/envs/evaluation/lib/python3.12/site-packages/deepeval/metrics/utils.py:58, in trimAndLoadJson(input_string, metric)
57 try:
---> 58 return json.loads(jsonStr)
59 except json.JSONDecodeError:

File ~/.conda/envs/evaluation/lib/python3.12/json/init.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
343 if (cls is None and object_hook is None and
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
--> 346 return _default_decoder.decode(s)
347 if cls is None:

File ~/.conda/envs/evaluation/lib/python3.12/json/decoder.py:340, in JSONDecoder.decode(self, s, _w)
339 if end != len(s):
--> 340 raise JSONDecodeError("Extra data", s, end)
341 return obj

JSONDecodeError: Extra data: line 4 column 1 (char 110)

During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[4], line 18
...
---> 63 raise ValueError(error_str)
64 except Exception as e:
65 raise Exception(f"An unexpected error occurred: {str(e)}")

ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.

Code:
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.models.base_model import DeepEvalBaseLLM
import asyncio

class Mistral7B(DeepEvalBaseLLM):
def init(
self,
model,
tokenizer
):
self.model = model
self.tokenizer = tokenizer

def load_model(self):
    return self.model

def generate(self, prompt: str) -> str:
    model = self.load_model()

    device = "cuda" # the device to load the model onto

    model_inputs = self.tokenizer([prompt], return_tensors="pt").to(device)
    model.to(device)

    generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
    output = self.tokenizer.batch_decode(generated_ids)[0]
    #result = f"{{ {output} }}"
    return output
    

async def a_generate(self, prompt: str) -> str:
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(None, self.generate, prompt)

def get_model_name(self):
    return "Mistral 7B"

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

mistral_7b = Mistral7B(model=model, tokenizer=tokenizer)

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

Replace this with the actual output from your LLM application

actual_output = "We offer a 30-day full refund at no extra cost."

metric = AnswerRelevancyMetric(
threshold=0.7,
model=mistral_7b,
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

or evaluate test cases in bulk

evaluate([test_case], [metric])

Thanks for the help in advance and all the best!

The text was updated successfully, but these errors were encountered:

penguine-ip · 2024-05-06T12:13:10Z

Hey @Kraebs can you try running the model outside of any metric to see if there are any errors?

hyusterr · 2024-05-08T17:16:35Z

I encounter the same problem when using Mistral-7B-Instruct-v0.2.
Also, I'm wondering if I need to add special tokens like [INST] [/INST] from mistral-instruct models to the implementation.

TheDominus · 2024-05-16T08:25:26Z

same issue with me for another model

penguine-ip · 2024-05-16T13:12:22Z

@hyusterr @TheDominus Try using it outside of any metric. If you can't run model.generate() as shown in the docs, you know where the problem is

nicoeiris11 · 2024-05-24T15:58:28Z

The same happens to me using SummarizationMetric with default values.

akashlp27 · 2024-06-26T05:59:59Z

Hi, facing same issue, outside the metrics the model is able to generate from model.generate() but not with metrics

FaizaQamar · 2024-06-28T04:40:04Z

Facing the same issue. model.generate works but metric.measure doesn't. Here somebody provided a solution, but I couldn't understand it. Anybody else does?

MINJIK01 · 2024-07-03T06:21:18Z

I also encountered this error. I just followed the instruction of the official website (https://docs.confident-ai.com/docs/metrics-introduction). Is there anyone else who can solve this error?

penguine-ip · 2024-07-03T07:24:44Z

@akashlp27 @FaizaQamar @MINJIK01 Can you show the error messages?

MINJIK01 · 2024-07-05T08:12:07Z

My error is here.

============================================================================================================================ ERRORS =============================================================================================================================
______________________________________________________________________________________________________________ ERROR collecting test_mistral7b.py _______________________________________________________________________________________________________________
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/utils.py:63: in trimAndLoadJson
return json.loads(jsonStr)
../../anaconda3/envs/graph_llm/lib/python3.10/json/init.py:346: in loads
return _default_decoder.decode(s)
../../anaconda3/envs/graph_llm/lib/python3.10/json/decoder.py:340: in decode
raise JSONDecodeError("Extra data", s, end)
E json.decoder.JSONDecodeError: Extra data: line 4 column 1 (char 110)

During handling of the above exception, another exception occurred:
test_mistral7b.py:64: in
metric.measure(test_case)
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/answer_relevancy/answer_relevancy.py:67: in measure
self.statements: List[str] = self._generate_statements(
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/answer_relevancy/answer_relevancy.py:229: in _generate_statements
data = trimAndLoadJson(res, self)
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/utils.py:68: in trimAndLoadJson
raise ValueError(error_str)
E ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.
==================================================================================================================== short test summary info ====================================================================================================================
ERROR test_mistral7b.py - ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================================================================= 4 warnings, 1 error in 17.68s =================================================================================================================
No test cases found, please try again.

This was referenced Jun 6, 2024

Improve loading of json if end of json is not found for metrics #814

Closed

Improve loading of json if end of json is not found for metrics #816

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cant use HuggingFace Model for evaluation #743

Cant use HuggingFace Model for evaluation #743

Cant use HuggingFace Model for evaluation #743

Cant use HuggingFace Model for evaluation #743

Comments

Error:

Replace this with the actual output from your LLM application

or evaluate test cases in bulk