[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cant use HuggingFace Model for evaluation #743

Open
Kraebs opened this issue May 6, 2024 · 10 comments
Open

Cant use HuggingFace Model for evaluation #743

Kraebs opened this issue May 6, 2024 · 10 comments

Comments

@Kraebs
Copy link
Kraebs commented May 6, 2024

When i follow the example on this page:
https://docs.confident-ai.com/docs/metrics-introduction

and try to use Mistral-7B as evaluation-model, i always get this error when running the exact code in the tutorial.
It seems there is a mistake in the code when using HugigngFace models for evaluation instead of ChatGPT.

Error:

JSONDecodeError Traceback (most recent call last)
File ~/.conda/envs/evaluation/lib/python3.12/site-packages/deepeval/metrics/utils.py:58, in trimAndLoadJson(input_string, metric)
57 try:
---> 58 return json.loads(jsonStr)
59 except json.JSONDecodeError:

File ~/.conda/envs/evaluation/lib/python3.12/json/init.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
343 if (cls is None and object_hook is None and
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
--> 346 return _default_decoder.decode(s)
347 if cls is None:

File ~/.conda/envs/evaluation/lib/python3.12/json/decoder.py:340, in JSONDecoder.decode(self, s, _w)
339 if end != len(s):
--> 340 raise JSONDecodeError("Extra data", s, end)
341 return obj

JSONDecodeError: Extra data: line 4 column 1 (char 110)

During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[4], line 18
...
---> 63 raise ValueError(error_str)
64 except Exception as e:
65 raise Exception(f"An unexpected error occurred: {str(e)}")

ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.

Code:
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.models.base_model import DeepEvalBaseLLM
import asyncio

class Mistral7B(DeepEvalBaseLLM):
def init(
self,
model,
tokenizer
):
self.model = model
self.tokenizer = tokenizer

def load_model(self):
    return self.model

def generate(self, prompt: str) -> str:
    model = self.load_model()

    device = "cuda" # the device to load the model onto

    model_inputs = self.tokenizer([prompt], return_tensors="pt").to(device)
    model.to(device)

    generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
    output = self.tokenizer.batch_decode(generated_ids)[0]
    #result = f"{{ {output} }}"
    return output
    

async def a_generate(self, prompt: str) -> str:
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(None, self.generate, prompt)

def get_model_name(self):
    return "Mistral 7B"

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

mistral_7b = Mistral7B(model=model, tokenizer=tokenizer)

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

Replace this with the actual output from your LLM application

actual_output = "We offer a 30-day full refund at no extra cost."

metric = AnswerRelevancyMetric(
threshold=0.7,
model=mistral_7b,
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

or evaluate test cases in bulk

evaluate([test_case], [metric])

Thanks for the help in advance and all the best!

@penguine-ip
Copy link
Contributor

Hey @Kraebs can you try running the model outside of any metric to see if there are any errors?

@hyusterr
Copy link
hyusterr commented May 8, 2024

I encounter the same problem when using Mistral-7B-Instruct-v0.2.
Also, I'm wondering if I need to add special tokens like [INST] [/INST] from mistral-instruct models to the implementation.

@TheDominus
Copy link

same issue with me for another model

@penguine-ip
Copy link
Contributor

@hyusterr @TheDominus Try using it outside of any metric. If you can't run model.generate() as shown in the docs, you know where the problem is

@nicoeiris11
Copy link

The same happens to me using SummarizationMetric with default values.

@akashlp27
Copy link

Hi, facing same issue, outside the metrics the model is able to generate from model.generate() but not with metrics

@FaizaQamar
Copy link

Facing the same issue. model.generate works but metric.measure doesn't. Here somebody provided a solution, but I couldn't understand it. Anybody else does?

@MINJIK01
Copy link
MINJIK01 commented Jul 3, 2024

I also encountered this error. I just followed the instruction of the official website (https://docs.confident-ai.com/docs/metrics-introduction). Is there anyone else who can solve this error?

@penguine-ip
Copy link
Contributor

@akashlp27 @FaizaQamar @MINJIK01 Can you show the error messages?

@MINJIK01
Copy link
MINJIK01 commented Jul 5, 2024

My error is here.

============================================================================================================================ ERRORS =============================================================================================================================
______________________________________________________________________________________________________________ ERROR collecting test_mistral7b.py _______________________________________________________________________________________________________________
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/utils.py:63: in trimAndLoadJson
return json.loads(jsonStr)
../../anaconda3/envs/graph_llm/lib/python3.10/json/init.py:346: in loads
return _default_decoder.decode(s)
../../anaconda3/envs/graph_llm/lib/python3.10/json/decoder.py:340: in decode
raise JSONDecodeError("Extra data", s, end)
E json.decoder.JSONDecodeError: Extra data: line 4 column 1 (char 110)

During handling of the above exception, another exception occurred:
test_mistral7b.py:64: in
metric.measure(test_case)
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/answer_relevancy/answer_relevancy.py:67: in measure
self.statements: List[str] = self._generate_statements(
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/answer_relevancy/answer_relevancy.py:229: in _generate_statements
data = trimAndLoadJson(res, self)
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/utils.py:68: in trimAndLoadJson
raise ValueError(error_str)
E ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.
==================================================================================================================== short test summary info ====================================================================================================================
ERROR test_mistral7b.py - ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================================================================= 4 warnings, 1 error in 17.68s =================================================================================================================
No test cases found, please try again.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants