Evaluating Claude / PaLM / etc. #6

krrishdholakia · 2023-08-03T23:48:30Z

Hi,

any reason this doesn't also evaluate other models like Claude, Palm, Llama 2 (via Replicate), etc.?

Happy to make a PR for the relevant code changes if that's a blocker

sunnweiwei · 2023-08-04T07:06:51Z

I've evaluated other LLMs like Claude, PaLM, etc. In evaluation, I used the following code to convert the multi-round message used for ChatGPT into a single round prompt:

prompt = ""
for turn in messages:
    if turn['role'] == 'system':
        prompt += f"{turn['content']}\n\n"
    elif turn['role'] == 'user':
        prompt += f"{turn['content']}\n\n"
prompt += "The ranking results of the 20 passages (only identifiers) is:"

I used the models to re-rank the top-20 BM25 passages on TREC DL-19, here are the results I got for reference:

Method	NDCG@1	NDCG@5	NDCG@10
OpenAI text-davinci-003	70.54	61.90	57.24
OpenAI gpt-3.5-turbo-0301	75.58	66.19	60.89
OpenAI gpt-4-0314	79.46	71.65	65.68
Cohere rerank-english-v2.0	79.46	71.56	64.78
Antropic claude-v1	74.81	65.47	60.25
Antropic claude-instant-v1	69.38	63.12	57.93
Google chat-bison-001	63.95	59.59	54.51
Google text-bison-001	69.77	64.46	58.67
Google bard-2023.06.07	68.60	61.62	56.27
Google FLAN-T5-XXL (11B)	52.71	51.63	50.26
Tsinghua ChatGLM-6B	54.26	52.77	50.58
LMSYS Vicuna-13B-v1.1	54.26	51.55	49.08

I haven't yet evaluated the latest LLMs like LLaMA-2, Claude-2, etc.

Since the APIs of the different LLMs vary significantly, I've conducted individual evaluations for each model. It would be great if the relevant code could be uploaded to help the use of LLMs other than ChatGPT :)

krrishdholakia mentioned this issue Aug 4, 2023

adding support for Claude, Replicate, Cohere, Azure #7

Merged

Albert-Ma closed this as completed Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating Claude / PaLM / etc. #6

Evaluating Claude / PaLM / etc. #6

Evaluating Claude / PaLM / etc. #6

Evaluating Claude / PaLM / etc. #6

Comments