Evaluation

My Answer is C: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions to limit the response space. The model is then evaluated by …