Dirk Hovy
Home
Research
Publications
Projects
Talks
CV
Fun
Blog
Leatherwork
Contact
Evaluation
My Answer is C: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions to limit the response space. The model is then evaluated by …
Cite
×