You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, nice work! Congratulations on the Outstanding Paper award!
Having thoroughly reviewed the arXiv version, I find myself with a question about the results of "NovelEval". I concur with the idea of creating continuously updated IR test sets to ensure that the questions, passages to be ranked, and relevance annotations haven't been preemptively learned by the latest LLMs. However, without the detailed training information for GPT-4, it becomes challenging to confirm if "NovelEval" truly offers a "fair evaluation".
It's widely known that InstructGPT involves RLHF, which relies on a reward model that's pre-trained on human-labeled ranking loss. But it's unclear whether this same approach is used in the SFT or in the undisclosed training strategy for subsequent GPT models. I'm left wondering if GPT-4's superiority stems from its overall capacity or if it's simply due to the use of "Ranking loss" in its training. If it's the latter, then the seemingly magical Ranking becomes much more straightforward.
Looking forward to your reply.
The text was updated successfully, but these errors were encountered:
Yes, GPT, and some other 'open-weight' LLM eg llama-chat, do not disclose their training data and strategies. This lack of transparency leads to uncertainty.
NovelEval aims to assess these models, regardless of their training methods, using uncontaminated (new) data. Compared to a static test set, this approach can better reflect ranking ability in real scenarios.
Hi, nice work! Congratulations on the Outstanding Paper award!
Having thoroughly reviewed the arXiv version, I find myself with a question about the results of "NovelEval". I concur with the idea of creating continuously updated IR test sets to ensure that the questions, passages to be ranked, and relevance annotations haven't been preemptively learned by the latest LLMs. However, without the detailed training information for GPT-4, it becomes challenging to confirm if "NovelEval" truly offers a "fair evaluation".
It's widely known that InstructGPT involves RLHF, which relies on a reward model that's pre-trained on human-labeled ranking loss. But it's unclear whether this same approach is used in the SFT or in the undisclosed training strategy for subsequent GPT models. I'm left wondering if GPT-4's superiority stems from its overall capacity or if it's simply due to the use of "Ranking loss" in its training. If it's the latter, then the seemingly magical Ranking becomes much more straightforward.
Looking forward to your reply.
The text was updated successfully, but these errors were encountered: