You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've evaluated other LLMs like Claude, PaLM, etc. In evaluation, I used the following code to convert the multi-round message used for ChatGPT into a single round prompt:
prompt = ""
for turn in messages:
if turn['role'] == 'system':
prompt += f"{turn['content']}\n\n"
elif turn['role'] == 'user':
prompt += f"{turn['content']}\n\n"
prompt += "The ranking results of the 20 passages (only identifiers) is:"
I used the models to re-rank the top-20 BM25 passages on TREC DL-19, here are the results I got for reference:
Method
NDCG@1
NDCG@5
NDCG@10
OpenAI text-davinci-003
70.54
61.90
57.24
OpenAI gpt-3.5-turbo-0301
75.58
66.19
60.89
OpenAI gpt-4-0314
79.46
71.65
65.68
Cohere rerank-english-v2.0
79.46
71.56
64.78
Antropic claude-v1
74.81
65.47
60.25
Antropic claude-instant-v1
69.38
63.12
57.93
Google chat-bison-001
63.95
59.59
54.51
Google text-bison-001
69.77
64.46
58.67
Google bard-2023.06.07
68.60
61.62
56.27
Google FLAN-T5-XXL (11B)
52.71
51.63
50.26
Tsinghua ChatGLM-6B
54.26
52.77
50.58
LMSYS Vicuna-13B-v1.1
54.26
51.55
49.08
I haven't yet evaluated the latest LLMs like LLaMA-2, Claude-2, etc.
Since the APIs of the different LLMs vary significantly, I've conducted individual evaluations for each model. It would be great if the relevant code could be uploaded to help the use of LLMs other than ChatGPT :)
Hi,
any reason this doesn't also evaluate other models like Claude, Palm, Llama 2 (via Replicate), etc.?
Happy to make a PR for the relevant code changes if that's a blocker
The text was updated successfully, but these errors were encountered: