[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regrading the "A New Test Set – NovelEval" #14

Closed
HolmesShuan opened this issue Dec 17, 2023 · 2 comments
Closed

Questions regrading the "A New Test Set – NovelEval" #14

HolmesShuan opened this issue Dec 17, 2023 · 2 comments

Comments

@HolmesShuan
Copy link

Hi, nice work! Congratulations on the Outstanding Paper award!

Having thoroughly reviewed the arXiv version, I find myself with a question about the results of "NovelEval". I concur with the idea of creating continuously updated IR test sets to ensure that the questions, passages to be ranked, and relevance annotations haven't been preemptively learned by the latest LLMs. However, without the detailed training information for GPT-4, it becomes challenging to confirm if "NovelEval" truly offers a "fair evaluation".

It's widely known that InstructGPT involves RLHF, which relies on a reward model that's pre-trained on human-labeled ranking loss. But it's unclear whether this same approach is used in the SFT or in the undisclosed training strategy for subsequent GPT models. I'm left wondering if GPT-4's superiority stems from its overall capacity or if it's simply due to the use of "Ranking loss" in its training. If it's the latter, then the seemingly magical Ranking becomes much more straightforward.

Looking forward to your reply.

@sunnweiwei
Copy link
Owner

Hi, thank you for your questions.

Yes, GPT, and some other 'open-weight' LLM eg llama-chat, do not disclose their training data and strategies. This lack of transparency leads to uncertainty.

NovelEval aims to assess these models, regardless of their training methods, using uncontaminated (new) data. Compared to a static test set, this approach can better reflect ranking ability in real scenarios.

@HolmesShuan
Copy link
Author

agreed~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants