Benchmarks
Benchmarking has long been a key driver in the development of AI systems. Benchmarks act as a compass, encoding the values, priorities, and goals of the AI research community (Birhane et al., 2022ReferenceBirhane, A., Kalluri, P., Card, D., Agnew, W., Dotan, R., & Bao, M. (2022). The values encoded in machine learning research. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 173–184.; Ethayarajh & Jurafsky, 2020ReferenceEthayarajh, K., & Jurafsky, D. (2020). Utility is in the Eye of the User: A Critique of NLP Leaderboards. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4846–4853). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.393). They help determine not only how capable a model is, but also what we consider meaningful progress, allowing us to compare the strengths of different models. Recent research has argued for a shift in what and how we evaluate, especially for human-centered applications. As LLMs increasingly influence real-world decision-making, especially in domains like education, law, healthcare, and customer support, the limitations of traditional benchmarks become even more critical. Benchmarks must evolve to better represent human values such as fairness, robustness, usability, and positive societal impact. Below, we discuss a few general principles for thinking about evaluating human-centered LLMs.
Moving Away from “Exams” and Rethinking What We Evaluate.
Traditional benchmarks often mimic academic exams, assessing LLMs by how well they can replicate human outputs or solve static problems in standardized formats like multiple-choice questions (Hendrycks et al., 2021ReferenceHendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding. International Conference on Learning Representations.), math problems (Sun et al., 2025ReferenceSun, H., Min, Y., Chen, Z., Zhao, W. X., Fang, L., Liu, Z., Wang, Z., & Wen, J.-R. (2025). Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models. arXiv Preprint arXiv:2503.21380.), and code generation (Jimenez et al., 2024ReferenceJimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. R. (2024). SWE-bench: Can Language Models Resolve Real-world Github Issues? The Twelfth International Conference on Learning Representations.). While useful, this framing compresses complex, multidimensional model behavior into a single metric. Even interaction-based, human-voting evaluations like ChatbotArena (Chiang et al., 2024ReferenceChiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M. I., Gonzalez, J. E., & Stoica, I. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. ICML. https://openreview.net/forum?id=3MW8GKNyzI) are limited by their brittleness or their misalignment with how humans actually use AI (Singh et al., 2025ReferenceSingh, S., Nan, Y., Wang, A., D’Souza, D., Kapoor, S., Üstün, A., Koyejo, S., Deng, Y., Longpre, S., Smith, N. A., & others. (2025). The leaderboard illusion. arXiv Preprint arXiv:2504.20879.). In real-world use, LLMs are collaborators, copilots, or tools embedded in workflows, so it becomes necessary to evaluate them in natural, complex, and multi-step human-AI interaction settings, not just in isolation. One promising alternative is centaur evaluations (Haupt & Brynjolfsson, 2025ReferenceHaupt, A., & Brynjolfsson, E. (2025). Ai should not be an imitation game: Centaur evaluations. URL Https://Www. Andyhaupt. Com/Assets/Papers/Centaur Evaluations. Pdf.) where humans and models collaborate. Here, we care about the outcome of the combined system. These setups get closer to how AI is actually used in practice, whether it’s writing, analysis, customer support, diagnosis, or decision-making.
Ecological Validity.
A central challenge in evaluating LLMs for real-world use is ecological validity, the extent to which a benchmark setting reflects the complexity of how systems are actually used. Controlled evaluations may offer cleaner signals, but they often fail to generalize to interactive, user-facing deployments. Recent work (Li et al., 2025ReferenceLi, M., Barr Held, W., Ryan, M. J., Pipatanakul, K., Manakul, P., Zhu, H., & Yang, D. (2025). Mind the Gap! Static and Interactive Evaluations of Large Audio Models. arXiv E-Prints, arXiv-2502.) has shown that no single benchmark strongly correlates with interactive performance for audio models across 20 existing datasets. A model that excels at standard static tasks might still struggle in dynamic or collaborative environments. This mismatch suggests a need for richer, context-aware evaluations. One promising direction is to build evaluations bottom-up from in-the-wild data. For example, Röttger et al. (2025)ReferenceRöttger, P., Hinck, M., Hofmann, V., Hackenburg, K., Pyatkin, V., Brahman, F., & Hovy, D. (2025). IssueBench: millions of realistic prompts for measuring issue bias in LLM writing assistance. arXiv Preprint arXiv:2502.08395. evaluate perspective and framing biases in LLM responses to natural user queries. Benchmarks should also be robust and reliable, correlating good performance with success in real tasks. This requires vetted examples with accurate annotations and sufficient statistical power (Bowman & Dahl, 2021ReferenceBowman, S. R., & Dahl, G. (2021). What Will it Take to Fix Benchmarking in Natural Language Understanding? In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4843–4855). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.385). Finally, effective benchmarks should reveal potential biases, artifacts, and any dual uses, as well as ways to mitigate such unintended consequences (Weidinger et al., 2022ReferenceWeidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P.-S., Mellor, J., Glaese, A., Cheng, M., Balle, B., Kasirzadeh, A., & others. (2022). Taxonomy of risks posed by language models. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 214–229.).
Data Contamination and Dynamic Alternatives.
With LLMs trained on massive web-scale corpora, the risk of benchmark contamination has become a serious issue. Many popular benchmarks are at least partially contained in training data, making their validity as evaluation tools less convincing. The line between training and testing becomes blurry, especially for static tasks. This is one benefit of dynamic, evolving benchmarks. Examples like DynaBench (Kiela et al., 2021ReferenceKiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu, Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., Ma, Z., Thrush, T., Riedel, S., Waseem, Z., Stenetorp, P., Jia, R., Bansal, M., Potts, C., & Williams, A. (2021). Dynabench: Rethinking Benchmarking in NLP. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4110–4124). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.324), Chatbot Arena (Chiang et al., 2024ReferenceChiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M. I., Gonzalez, J. E., & Stoica, I. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. ICML. https://openreview.net/forum?id=3MW8GKNyzI), WebArena (Zhou et al., 2023ReferenceZhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., & others. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. The Twelfth International Conference on Learning Representations.), and WildVision-Arena (Lu et al., 2024ReferenceLu, Y., Jiang, D., Chen, W., Wang, W. Y., Choi, Y., & Lin, B. Y. (2024). Wildvision: Evaluating vision-language models in the wild with human preferences. Advances in Neural Information Processing Systems, 37, 48224–48255.) introduce a degree of human involvement that better mirrors real-world interaction. Such dynamic setups are promising for evaluating generalization and interaction aspects and mitigating issues around saturation and contamination.
General-Purpose vs. Domain-Specific Evaluations.
For example, DR Bench (Gao et al., 2023ReferenceGao, Y., Dligach, D., Miller, T., Caskey, J., Sharma, B., Churpek, M. M., & Afshar, M. (2023). Dr. bench: Diagnostic reasoning benchmark for clinical natural language processing. Journal of Biomedical Informatics, 138, 104286.) assesses LLMs’ diagnostic reasoning abilities, PubMedQA (Jin et al., 2019ReferenceJin, Q., Dhingra, B., Liu, Z., Cohen, W., & Lu, X. (2019). PubMedQA: A Dataset for Biomedical Research Question Answering. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 2567–2577). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1259) targets biomedical research question-answering, and LegalBench (Guha et al., 2023ReferenceGuha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A., K, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., … Li, Z. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/89e44582fd28ddfea1ea4dcb0ebbf4b0-Abstract-Datasets%5C_and%5C_Benchmarks.html) is designed for legal reasoning, including statutory interpretation and contract analysis. In education, benchmarks are emerging to evaluate LLMs’ effectiveness in providing innovative and meaningful feedback to teachers (Wang & Demszky, 2023ReferenceWang, R., & Demszky, D. (2023). Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction. In E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, & T. Zesch (Eds.), Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 626–667). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.53) and emulate expert decision-making in providing tailored math remediation help bridge the gap between technological capability and educational needs (Wang et al., 2024ReferenceWang, R., Zhang, Q., Robinson, C., Loeb, S., & Demszky, D. (2024). Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 2174–2199). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-long.120). Recently, GDPval measures model performance on economically valuable, real-world tasks across 44 occupations (OpenAI, 2025ReferenceOpenAI. (2025). Measuring the Performance of Our Models on Real-World Tasks. https://openai.com/index/gdpval/). These specialized benchmarks provide useful signals that evaluation is grounded in each specific context, offering contextualized and real-world assessment of model performance compared to simply on math and coding tasks. Collaborative efforts across domains are crucial to developing benchmarks that reflect the full complexity of human-LLM interactions and the contexts in which LLM systems are deployed.
Overall, a human-centered framework often transcends traditional metrics and benchmarks that continue to prioritize efficiency and profitability above all else. While these measures are useful in providing objective algorithmic reviews on quantitative criteria, they fail to, or sometimes not even attempt to, capture the human factors and societal patterns that are inherently present in these systems.
Quantitative Evaluation
Automatic Metrics. Notably, among automatic metrics, foundational methods have played a critical role in shaping intrinsic evaluations. These metrics provide a systematic way to evaluate model performance through standardized benchmarks, making the evaluation process more efficient and scalable (Askell et al., 2021ReferenceAskell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., … Kaplan, J. (2021). A General Language Assistant as a Laboratory for Alignment. In ArXiv preprint: Vol. abs/2112.00861. https://arxiv.org/abs/2112.00861; Hu & Zhou, 2024ReferenceHu, T., & Zhou, X.-H. (2024). Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions. https://arxiv.org/abs/2404.09135; Sai et al., 2022ReferenceSai, A. B., Mohankumar, A. K., & Khapra, M. M. (2022). A Survey of Evaluation Metrics Used for NLG Systems. ACM Comput. Surv., 55(2). https://doi.org/10.1145/3485766).
Metrics like BLEU (Papineni et al., 2002ReferencePapineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. In P. Isabelle, E. Charniak, & D. Lin (Eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135) and ROUGE (Lin, 2004ReferenceLin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out, 74–81. https://aclanthology.org/W04-1013) are valued for their simplicity and reproducibility. However, their shortcomings include reliance on strict token matching, which often penalizes valid paraphrases and fails to capture deeper semantic equivalence (Wieting et al., 2019ReferenceWieting, J., Berg-Kirkpatrick, T., Gimpel, K., & Neubig, G. (2019). Beyond BLEU: Training Neural Machine Translation with Semantic Similarity. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4344–4355). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1427). Even embedding-based metrics like BERTScore (Hanna & Bojar, 2021ReferenceHanna, M., & Bojar, O. (2021). A Fine-Grained Analysis of BERTScore. In L. Barrault, O. Bojar, F. Bougares, R. Chatterjee, M. R. Costa-jussa, C. Federmann, M. Fishel, A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz, P. Guzman, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, T. Kocmi, A. Martins, M. Morishita, & C. Monz (Eds.), Proceedings of the Sixth Conference on Machine Translation (pp. 507–517). Association for Computational Linguistics. https://aclanthology.org/2021.wmt-1.59/) can be fooled by lexical similarity, ranking a more similar incorrect translation higher than a dissimilar but correct one.
Generally, quantitative metrics also are limited in addressing other human needs, such as interpretability, latency, cognitive load, and user satisfaction. Optimizing solely for a metric like perplexity can lead to monotonous responses from a language model (Celikyilmaz et al., 2021ReferenceCelikyilmaz, A., Clark, E., & Gao, J. (2021). Evaluation of Text Generation: A Survey. https://arxiv.org/abs/2006.14799), which would be less appealing to a user. In the high-stakes domains such as healthcare, where applications are highly critical, existing metrics have been found to fail to capture trust, personalization, and empathy (Abbasian et al., 2024ReferenceAbbasian, M., Khatibi, E., Azimi, I., Oniani, D., Shakeri Hossein Abad, Z., Thieme, A., Sriram, R., Yang, Z., Wang, Y., Lin, B., & others. (2024). Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digital Medicine, 7(1), 82.). Finally, while these metrics may be automatic, they are often not scalable to tasks such as open-ended question answering and complex planning (Gehrmann et al., 2023ReferenceGehrmann, S., Clark, E., & Sellam, T. (2023). Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. Journal of Artificial Intelligence Research, 77, 103–166.). These limitations have led to the development of complementary and alternative evaluation methods.
Reference-based Metrics. Reference-based approaches measure the similarity between the system output and the predefined reference samples, such as cosine similarity (Agarwal et al., 2024ReferenceAgarwal, D., Naaman, M., & Vashistha, A. (2024). AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. In ArXiv preprint: Vol. abs/2409.11360. https://arxiv.org/abs/2409.11360), the E2E benchmark (Banerjee et al., 2023ReferenceBanerjee, D., Singh, P., Avadhanam, A., & Srivastava, S. (2023). Benchmarking LLM powered Chatbots: Methods and Metrics. In ArXiv preprint: Vol. abs/2308.04624. https://arxiv.org/abs/2308.04624), HUSE (Hashimoto et al., 2019ReferenceHashimoto, T. B., Zhang, H., & Liang, P. (2019). Unifying Human and Statistical Evaluation for Natural Language Generation. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 1689–1701). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1169), and Reward Bench (Lambert et al., 2025ReferenceLambert, N., Pyatkin, V., Morrison, J., Miranda, L. J. V., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., & others. (2025). Rewardbench: Evaluating reward models for language modeling. Findings of the Association for Computational Linguistics: NAACL 2025, 1755–1797.). These methods maintain the benefits of standardized and objective evaluation for automatic metrics, but they are also limited to the quality of the used standard. For instance, they may be inconsistent or disprove themselves against new references or optimize for closeness to a single gold standard, even if the overall response quality is worse (Nguyen et al., 2024ReferenceNguyen, B., Yu, M., Huang, Y., & Jiang, M. (2024). Reference-based Metrics Disprove Themselves in Question Generation. https://arxiv.org/abs/2403.12242). For creative tasks, such a gold standard may not even exist.
Machine-learned Metrics. Machine-learned metrics such as reward models (Ryan et al., 2024ReferenceRyan, M. J., Held, W., & Yang, D. (2024). Unintended Impacts of LLM Alignment on Global Representation. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 16121–16140). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.853) and classifier-based scoring (Shaikh et al., 2024ReferenceShaikh, O., Chai, V. E., Gelfand, M., Yang, D., & Bernstein, M. S. (2024). Rehearsal: Simulating Conflict to Teach Conflict Resolution. In F. “Floyd” Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, & I. Shklovski (Eds.), Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024 (p. 920:1-920:20). ACM. https://doi.org/10.1145/3613904.3642159) show some promise in capturing nuances of human judgment. However, it can be challenging to build pipelines to ground LLMs, such as with specific sources for factual correctness (Tang et al., 2024ReferenceTang, L., Laban, P., & Durrett, G. (2024). MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents. In Y. Al-Onaizan, M. Bansal, & Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 8818–8847). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-main.499), or to social science theories that reflect human behavior and preferences (Shaikh et al., 2024ReferenceShaikh, O., Chai, V. E., Gelfand, M., Yang, D., & Bernstein, M. S. (2024). Rehearsal: Simulating Conflict to Teach Conflict Resolution. In F. “Floyd” Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, & I. Shklovski (Eds.), Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024 (p. 920:1-920:20). ACM. https://doi.org/10.1145/3613904.3642159). Additionally, these methods face limitations in generalizing to out-of-distribution settings, particularly in addressing discrepancies in preferences across different groups of people worldwide (Ryan et al., 2024ReferenceRyan, M. J., Held, W., & Yang, D. (2024). Unintended Impacts of LLM Alignment on Global Representation. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 16121–16140). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.853).
Qualitative Evaluation
In contrast to quantitative evaluations (Quantitative Evaluation), qualitative evaluations require a nuanced approach to evaluation as they work directly with humans (or LLMs). They are perhaps more human-centered than automatic or machine-learned metrics due to their subjects, while requiring more careful considerations to design fair and effective evaluations. We first discuss two paradigms of qualitative evaluations, LLM as a Judge and Human Evaluation. We then end the section with a coverage of Extrinsic Evaluation.
LLM-as-a-Judge. The rise in popularity of LLMs has led to the “LLM-as-a-Judge” paradigm, which caters towards more human-centered systems. Given the cost and subjectivity of human evaluation, LLM evaluation proves to be a feasible alternative, and the results are generally consistent with results from human experts on some tasks (C.-H. Chiang & Lee, 2023ReferenceChiang, C.-H., & Lee, H. (2023). Can Large Language Models Be an Alternative to Human Evaluations? In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15607–15631). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.870). Within the LLM judge paradigm, there are various use cases, such as LLM-derived metrics (embedding-based, probabilities, etc.) (Jia et al., 2023ReferenceJia, Q., Ren, S., Liu, Y., & Zhu, K. (2023). Zero-shot Faithfulness Evaluation for Text Summarization with Foundation Language Model. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 11017–11031). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.679; Xie et al., 2023ReferenceXie, Z., Li, M., Cohn, T., & Lau, J. (2023). DeltaScore: Fine-Grained Story Evaluation with Perturbations. In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 5317–5331). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.353), prompting, fine-tuning LLMs with human evaluations (Ke et al., 2024ReferenceKe, P., Wen, B., Feng, A., Liu, X., Lei, X., Cheng, J., Wang, S., Zeng, A., Dong, Y., Wang, H., Tang, J., & Huang, M. (2024). CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13034–13054). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.704; Xu et al., 2023ReferenceXu, W., Wang, D., Pan, L., Song, Z., Freitag, M., Wang, W. Y., & Li, L. (2023). INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback. In ArXiv preprint: Vol. abs/2305.14282. https://arxiv.org/abs/2305.14282), and human-LLM collaborative evaluations (M. Gao et al., 2024ReferenceGao, M., Hu, X., Ruan, J., Pu, X., & Wan, X. (2024). LLM-based NLG Evaluation: Current Status and Challenges. In ArXiv preprint: Vol. abs/2402.01383. https://arxiv.org/abs/2402.01383). More recent methods employ multiple LLMs to engage in multi-agent debates for evaluations and have shown better alignment with human assessment (Chan et al., 2023ReferenceChan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., & Liu, Z. (2023). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. In ArXiv preprint: Vol. abs/2308.07201. https://arxiv.org/abs/2308.07201). However, LLM-based evaluators exhibit systematic limitations, including self-preference bias (Panickssery et al., 2024ReferencePanickssery, A., Bowman, S. R., & Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. In ArXiv preprint: Vol. abs/2404.13076. https://arxiv.org/abs/2404.13076), where models favor their own outputs, and inconsistent application of evaluation criteria (X. Hu et al., 2024ReferenceHu, X., Gao, M., Hu, S., Zhang, Y., Chen, Y., Xu, T., & Wan, X. (2024). Are LLM-based Evaluators Confusing NLG Quality Criteria? In ArXiv preprint: Vol. abs/2402.12055. https://arxiv.org/abs/2402.12055), both of which reduce the reliability of their judgments. One set of limitations stems from hallucinations and lack of consistency and reproducibility that impacts accuracy of responses. Furthermore, LLMs can exhibit biases similar to human cognitive biases, e.g., gender and authority bias (Chen et al., 2024ReferenceChen, G. H., Chen, S., Liu, Z., Jiang, F., & Wang, B. (2024). Humans or LLMs as the Judge? A Study on Judgement Bias. In Y. Al-Onaizan, M. Bansal, & Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 8301–8327). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-main.474). They also show self-preference to LLM-generated outputs (Panickssery et al., 2024ReferencePanickssery, A., Bowman, S. R., & Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. In ArXiv preprint: Vol. abs/2404.13076. https://arxiv.org/abs/2404.13076). Researchers study agreement between human and LLM evaluations using metrics such as Intraclass Correlation Coefficient (ICC) (Bartko, 1966ReferenceBartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological Reports, 19(1), 3–11.) and Cohen’s Kappa (H. Li et al., 2024ReferenceLi, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., & Liu, Y. (2024). LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. https://arxiv.org/abs/2412.05579; Warrens, 2015ReferenceWarrens, M. J. (2015). Five ways to look at Cohen’s kappa. Journal of Psychology & Psychotherapy, 5.). However, these issues are exacerbated by humans over-trusting LLM outputs for supposed objectivity in application settings (Bansal et al., 2021ReferenceBansal, G., Wu, T., Zhou, J., Fok, R., Nushi, B., Kamar, E., Ribeiro, M. T., & Weld, D. (2021). Does the whole exceed its parts? the effect of ai explanations on complementary team performance. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1–16.). One approach to address this issue is the “LLM as a jury” paradigm proposed by Verga et al. (2024)ReferenceVerga, P., Hofstatter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., & Lewis, P. (2024). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. https://arxiv.org/abs/2404.18796, to check back on bias perpetuated by a single judge and thus better align with human evaluation.
Human Evaluation. Crowd-sourcing platforms such as Amazon Mechanical Turk (MTurk)1 and Prolific2 have enabled large-scale experiments within budget. Researchers have access to a wider range of evaluators than they would have in in-person studies. Nevertheless, human evaluators online may exhibit biases and quality issues (Ipeirotis et al., 2010ReferenceIpeirotis, P. G., Provost, F., & Wang, J. (2010). Quality management on Amazon Mechanical Turk. Proceedings of the ACM SIGKDD Workshop on Human Computation, 64–67. https://doi.org/10.1145/1837885.1837906). In addition, evaluators’ demographics could be skewed depending on the platform (Difallah et al., 2018ReferenceDifallah, D., Filatova, E., & Ipeirotis, P. (2018). Demographics and Dynamics of Mechanical Turk Workers. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 135–143. https://doi.org/10.1145/3159652.3159661). Correspondingly, the data quality between the platforms might differ. Douglas et al. (2023)ReferenceDouglas, B. D., Ewell, P. J., & Brauer, M. (2023). Data quality in online human-subjects research: Comparisons between MTurk, Prolific, CloudResearch, Qualtrics, and SONA. PLOS ONE, 18(3), 1–17. https://doi.org/10.1371/journal.pone.0279720 shows that Prolific and CloudResearch are more likely to produce high-quality data, in comparison to MTurk, Qualtrics, and SONA. However, these trends may be shifting as AI agents more readily mimic human respondents and bypass AI detection methods (Westwood, 2025ReferenceWestwood, S. J. (2025). The potential existential threat of large language models to online survey research. Proceedings of the National Academy of Sciences, 122(47), e2518075122.).
Such human evaluations must be designed according to best practices. Relevant questions are, how are human ratings collected? What questions are asked? We must design human evaluations carefully to avoid low-quality annotations. There exists a difficulty in standardization of human evaluations. found that 25% of HITs (Human Intelligence Task, MTurk NLP studies) have technical issues, with unclear / incomplete instruction issues and poor communications. In some cases, humans may feel pressured to perform annotations they are unsure about. 35% of requesters were also assessed to pay poorly or very badly. Attempts to standardize human evaluations have been made in the form of inter-evaluator agreement, which is not commonly used (18% of 135 papers (Amidei et al., 2019ReferenceAmidei, J., Piwek, P., & Willis, A. (2019). Agreement is overrated: A plea for correlation to assess human evaluation reliability. In K. van Deemter, C. Lin, & H. Takamura (Eds.), Proceedings of the 12th International Conference on Natural Language Generation (pp. 344–354). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-8642)), and is suggested to have limitations pertaining to human language variability (Amidei et al., 2018ReferenceAmidei, J., Piwek, P., & Willis, A. (2018). Rethinking the Agreement in Human Evaluation Tasks. In E. M. Bender, L. Derczynski, & P. Isabelle (Eds.), Proceedings of the 27th International Conference on Computational Linguistics (pp. 3318–3329). Association for Computational Linguistics. https://aclanthology.org/C18-1281/). Thus, the answers to the above questions remain resoundingly insufficient. Such issues need to be resolved for human evaluations to have representative power.
There exist discrepancies between human annotator evaluation versus actual user evaluation, and preferences do not always correlate directly with objective model performance (Mozannar et al., 2024ReferenceMozannar, H., Chen, V., Alsobay, M., Das, S., Zhao, S., Wei, D., Nagireddy, M., Sattigeri, P., Talwalkar, A., & Sontag, D. (2024). The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers. In ArXiv preprint: Vol. abs/2404.02806. https://arxiv.org/abs/2404.02806). This underscores the importance of capturing first-person user experience in evaluating human-centered LLMs. Such limitations in current mainstream human evaluation techniques makes one wonder; how do current human evaluations fit into human-centered evaluation paradigm? It is vital that human-centered evaluation of language models follow the needs of human stakeholders i.e. end-users. Any attempt to short-cut such process would result in inadequate task designs that serve the designer of the tasks, nothing more. Who the stakeholders of the tasks are is then interesting question; for example, for a paper review generation task, the stakeholders would be domain experts (NLP researchers) (Q. Wang et al., 2020ReferenceWang, Q., Zeng, Q., Huang, L., Knight, K., Ji, H., & Rajani, N. (2020). ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis. ArXiv, abs/2010.06119. https://api.semanticscholar.org/CorpusID:222310232). For other tasks, careful design around actual users of the system may be necessary to ensure the evaluations remain human-centered.