[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating Claude / PaLM / etc. #6

Closed
krrishdholakia opened this issue Aug 3, 2023 · 1 comment
Closed

Evaluating Claude / PaLM / etc. #6

krrishdholakia opened this issue Aug 3, 2023 · 1 comment

Comments

@krrishdholakia
Copy link
Contributor

Hi,

any reason this doesn't also evaluate other models like Claude, Palm, Llama 2 (via Replicate), etc.?

Happy to make a PR for the relevant code changes if that's a blocker

@sunnweiwei
Copy link
Owner

I've evaluated other LLMs like Claude, PaLM, etc. In evaluation, I used the following code to convert the multi-round message used for ChatGPT into a single round prompt:

prompt = ""
for turn in messages:
    if turn['role'] == 'system':
        prompt += f"{turn['content']}\n\n"
    elif turn['role'] == 'user':
        prompt += f"{turn['content']}\n\n"
prompt += "The ranking results of the 20 passages (only identifiers) is:"

I used the models to re-rank the top-20 BM25 passages on TREC DL-19, here are the results I got for reference:

Method NDCG@1 NDCG@5 NDCG@10
OpenAI text-davinci-003 70.54 61.90 57.24
OpenAI gpt-3.5-turbo-0301 75.58 66.19 60.89
OpenAI gpt-4-0314 79.46 71.65 65.68
Cohere rerank-english-v2.0 79.46 71.56 64.78
Antropic claude-v1 74.81 65.47 60.25
Antropic claude-instant-v1 69.38 63.12 57.93
Google chat-bison-001 63.95 59.59 54.51
Google text-bison-001 69.77 64.46 58.67
Google bard-2023.06.07 68.60 61.62 56.27
Google FLAN-T5-XXL (11B) 52.71 51.63 50.26
Tsinghua ChatGLM-6B 54.26 52.77 50.58
LMSYS Vicuna-13B-v1.1 54.26 51.55 49.08

I haven't yet evaluated the latest LLMs like LLaMA-2, Claude-2, etc.

Since the APIs of the different LLMs vary significantly, I've conducted individual evaluations for each model. It would be great if the relevant code could be uploaded to help the use of LLMs other than ChatGPT :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants