Evaluation

Evaluation methods allow model developers, users, and stakeholders to compare the capabilities and limitations of different LLMs, and to understand the scope of their utilities and risks in particular domains (Chang et al., 2024). Evaluations can inform all stages of model development, from mixing pre-training data (Held et al., 2025; Mizrahi et al., 2025) to selecting and optimizing reward models for alignment (Frick et al., 2024; Lambert et al., 2025). After models are trained, evaluations arguably become even more important. They act quality filters to decide whether and how companies deploy models (Liang et al., 2022). They inform researchers on the most important and promising directions for model improvement (Srivastava et al., 2022), and help anticipate models’ future capabilities (Hoffmann et al., 2022; Kaplan et al., 2020). And finally, they shape public perceptions (Liao & Sundar, 2022), and inform policy and other regulatory decisions (Eriksson et al., 2025).

Without a human-centered evaluation of LLMs, model development, deployment, and governance may be oriented not towards the long-term and collective good, but rather towards profit incentives and short-term gains on surface-level heuristics (Eriksson et al., 2025). In this chapter, we observe common pitfalls and highlight best practices in human-centered evaluation, spanning three levels as shown in the figure. First, we consider evaluations at the level of model outputs (Model-Level Evaluations), using both quantitative metrics (Quantitative Evaluation) and qualitative evaluations (Qualitative Evaluation). Beyond raw outputs, we also consider how people experience LLMs (Human-Level Evaluations), considering human values (Human Values), as well as concerns over bias (Bias and Fairness Evaluation) and safety (Safety Evaluations). Lastly, we discuss extrinsic evaluations at the societal level (Societal-level Evaluation), measuring the system’s real world impact.

Figure. In this chapter, we discuss common pitfalls and best practices for evaluating HCLLMs, considering three distinct levels of evaluation: the model level (Model-Level Evaluations), the human level (Human-Level Evaluations), and the societal level (Societal-level Evaluation).

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., & Xie, X. (2024). A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol., 15(3). https://doi.org/10.1145/3641289

Eriksson, M., Purificato, E., Noroozian, A., Vinagre, J., Chaslot, G., Gomez, E., & Fernandez-Llorca, D. (2025). Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(1), 850–864.

Frick, E., Li, T., Chen, C., Chiang, W.-L., Angelopoulos, A. N., Jiao, J., Zhu, B., Gonzalez, J. E., & Stoica, I. (2024). How to Evaluate Reward Models for RLHF. The Thirteenth International Conference on Learning Representations.

Held, W., Paranjape, B., Koura, P. S., Lewis, M., Zhang, F., & Mihaylov, T. (2025). Optimizing pretraining data mixtures with llm-estimated utility. arXiv Preprint arXiv:2501.11747.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., & others. (2022). Training compute-optimal large language models. ArXiv Preprint, abs/2203.15556. https://arxiv.org/abs/2203.15556

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. ArXiv Preprint, abs/2001.08361. https://arxiv.org/abs/2001.08361

Lambert, N., Pyatkin, V., Morrison, J., Miranda, L. J. V., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., & others. (2025). Rewardbench: Evaluating reward models for language modeling. Findings of the Association for Computational Linguistics: NAACL 2025, 1755–1797.

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., & others. (2022). Holistic evaluation of language models. ArXiv Preprint, abs/2211.09110. https://arxiv.org/abs/2211.09110

Liao, Q. V., & Sundar, S. S. (2022). Designing for responsible trust in AI systems: A communication perspective. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 1257–1268.

Mizrahi, D., Larsen, A. B. L., Allardice, J., Petryk, S., Gorokhov, Y., Li, J., Fang, A., Gardner, J., Gunter, T., & Dehghan, A. (2025). Language models improve when pretraining data matches target tasks. arXiv Preprint arXiv:2507.12466.

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., & others. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ArXiv Preprint, abs/2206.04615. https://arxiv.org/abs/2206.04615

Evaluation

Model-Level Evaluations

Human-Level Evaluations

Societal-level Evaluation

Graph View

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Model-Level Evaluations

Human-Level Evaluations

Societal-level Evaluation

Graph View

Backlinks