Questions regrading the "A New Test Set – NovelEval" #14

HolmesShuan · 2023-12-17T13:25:15Z

Hi, nice work! Congratulations on the Outstanding Paper award!

Having thoroughly reviewed the arXiv version, I find myself with a question about the results of "NovelEval". I concur with the idea of creating continuously updated IR test sets to ensure that the questions, passages to be ranked, and relevance annotations haven't been preemptively learned by the latest LLMs. However, without the detailed training information for GPT-4, it becomes challenging to confirm if "NovelEval" truly offers a "fair evaluation".

It's widely known that InstructGPT involves RLHF, which relies on a reward model that's pre-trained on human-labeled ranking loss. But it's unclear whether this same approach is used in the SFT or in the undisclosed training strategy for subsequent GPT models. I'm left wondering if GPT-4's superiority stems from its overall capacity or if it's simply due to the use of "Ranking loss" in its training. If it's the latter, then the seemingly magical Ranking becomes much more straightforward.

Looking forward to your reply.

sunnweiwei · 2023-12-21T07:28:56Z

Hi, thank you for your questions.

Yes, GPT, and some other 'open-weight' LLM eg llama-chat, do not disclose their training data and strategies. This lack of transparency leads to uncertainty.

NovelEval aims to assess these models, regardless of their training methods, using uncontaminated (new) data. Compared to a static test set, this approach can better reflect ranking ability in real scenarios.

HolmesShuan · 2023-12-22T01:38:51Z

agreed～

HolmesShuan closed this as completed Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regrading the "A New Test Set – NovelEval" #14

Questions regrading the "A New Test Set – NovelEval" #14

Questions regrading the "A New Test Set – NovelEval" #14

Questions regrading the "A New Test Set – NovelEval" #14

Comments