Model-Level Evaluations

Benchmarks

Benchmarking has long been a key driver in the development of AI systems. Benchmarks act as a compass, encoding the values, priorities, and goals of the AI research community (Birhane et al., 2022; Ethayarajh & Jurafsky, 2020). They help determine not only how capable a model is, but also what we consider meaningful progress, allowing us to compare the strengths of different models. Recent research has argued for a shift in what and how we evaluate, especially for human-centered applications. As LLMs increasingly influence real-world decision-making, especially in domains like education, law, healthcare, and customer support, the limitations of traditional benchmarks become even more critical. Benchmarks must evolve to better represent human values such as fairness, robustness, usability, and positive societal impact. Below, we discuss a few general principles for thinking about evaluating human-centered LLMs.

Moving Away from “Exams” and Rethinking What We Evaluate.

Traditional benchmarks often mimic academic exams, assessing LLMs by how well they can replicate human outputs or solve static problems in standardized formats like multiple-choice questions (Hendrycks et al., 2021), math problems (Sun et al., 2025), and code generation (Jimenez et al., 2024). While useful, this framing compresses complex, multidimensional model behavior into a single metric. Even interaction-based, human-voting evaluations like ChatbotArena (Chiang et al., 2024) are limited by their brittleness or their misalignment with how humans actually use AI (Singh et al., 2025). In real-world use, LLMs are collaborators, copilots, or tools embedded in workflows, so it becomes necessary to evaluate them in natural, complex, and multi-step human-AI interaction settings, not just in isolation. One promising alternative is centaur evaluations (Haupt & Brynjolfsson, 2025) where humans and models collaborate. Here, we care about the outcome of the combined system. These setups get closer to how AI is actually used in practice, whether it’s writing, analysis, customer support, diagnosis, or decision-making.

Ecological Validity.

A central challenge in evaluating LLMs for real-world use is ecological validity, the extent to which a benchmark setting reflects the complexity of how systems are actually used. Controlled evaluations may offer cleaner signals, but they often fail to generalize to interactive, user-facing deployments. Recent work (Li et al., 2025) has shown that no single benchmark strongly correlates with interactive performance for audio models across 20 existing datasets. A model that excels at standard static tasks might still struggle in dynamic or collaborative environments. This mismatch suggests a need for richer, context-aware evaluations. One promising direction is to build evaluations bottom-up from in-the-wild data. For example, Röttger et al. (2025) evaluate perspective and framing biases in LLM responses to natural user queries. Benchmarks should also be robust and reliable, correlating good performance with success in real tasks. This requires vetted examples with accurate annotations and sufficient statistical power (Bowman & Dahl, 2021). Finally, effective benchmarks should reveal potential biases, artifacts, and any dual uses, as well as ways to mitigate such unintended consequences (Weidinger et al., 2022).

Data Contamination and Dynamic Alternatives.

With LLMs trained on massive web-scale corpora, the risk of benchmark contamination has become a serious issue. Many popular benchmarks are at least partially contained in training data, making their validity as evaluation tools less convincing. The line between training and testing becomes blurry, especially for static tasks. This is one benefit of dynamic, evolving benchmarks. Examples like DynaBench (Kiela et al., 2021), Chatbot Arena (Chiang et al., 2024), WebArena (Zhou et al., 2023), and WildVision-Arena (Lu et al., 2024) introduce a degree of human involvement that better mirrors real-world interaction. Such dynamic setups are promising for evaluating generalization and interaction aspects and mitigating issues around saturation and contamination.

General-Purpose vs. Domain-Specific Evaluations.

For example, DR Bench (Gao et al., 2023) assesses LLMs’ diagnostic reasoning abilities, PubMedQA (Jin et al., 2019) targets biomedical research question-answering, and LegalBench (Guha et al., 2023) is designed for legal reasoning, including statutory interpretation and contract analysis. In education, benchmarks are emerging to evaluate LLMs’ effectiveness in providing innovative and meaningful feedback to teachers (Wang & Demszky, 2023) and emulate expert decision-making in providing tailored math remediation help bridge the gap between technological capability and educational needs (Wang et al., 2024). Recently, GDPval measures model performance on economically valuable, real-world tasks across 44 occupations (OpenAI, 2025). These specialized benchmarks provide useful signals that evaluation is grounded in each specific context, offering contextualized and real-world assessment of model performance compared to simply on math and coding tasks. Collaborative efforts across domains are crucial to developing benchmarks that reflect the full complexity of human-LLM interactions and the contexts in which LLM systems are deployed.

Overall, a human-centered framework often transcends traditional metrics and benchmarks that continue to prioritize efficiency and profitability above all else. While these measures are useful in providing objective algorithmic reviews on quantitative criteria, they fail to, or sometimes not even attempt to, capture the human factors and societal patterns that are inherently present in these systems.

Quantitative Evaluation

Automatic Metrics. Notably, among automatic metrics, foundational methods have played a critical role in shaping intrinsic evaluations. These metrics provide a systematic way to evaluate model performance through standardized benchmarks, making the evaluation process more efficient and scalable (Askell et al., 2021; Hu & Zhou, 2024; Sai et al., 2022).

Metrics like BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) are valued for their simplicity and reproducibility. However, their shortcomings include reliance on strict token matching, which often penalizes valid paraphrases and fails to capture deeper semantic equivalence (Wieting et al., 2019). Even embedding-based metrics like BERTScore (Hanna & Bojar, 2021) can be fooled by lexical similarity, ranking a more similar incorrect translation higher than a dissimilar but correct one.

Generally, quantitative metrics also are limited in addressing other human needs, such as interpretability, latency, cognitive load, and user satisfaction. Optimizing solely for a metric like perplexity can lead to monotonous responses from a language model (Celikyilmaz et al., 2021), which would be less appealing to a user. In the high-stakes domains such as healthcare, where applications are highly critical, existing metrics have been found to fail to capture trust, personalization, and empathy (Abbasian et al., 2024). Finally, while these metrics may be automatic, they are often not scalable to tasks such as open-ended question answering and complex planning (Gehrmann et al., 2023). These limitations have led to the development of complementary and alternative evaluation methods.

Reference-based Metrics. Reference-based approaches measure the similarity between the system output and the predefined reference samples, such as cosine similarity (Agarwal et al., 2024), the E2E benchmark (Banerjee et al., 2023), HUSE (Hashimoto et al., 2019), and Reward Bench (Lambert et al., 2025). These methods maintain the benefits of standardized and objective evaluation for automatic metrics, but they are also limited to the quality of the used standard. For instance, they may be inconsistent or disprove themselves against new references or optimize for closeness to a single gold standard, even if the overall response quality is worse (Nguyen et al., 2024). For creative tasks, such a gold standard may not even exist.

Machine-learned Metrics. Machine-learned metrics such as reward models (Ryan et al., 2024) and classifier-based scoring (Shaikh et al., 2024) show some promise in capturing nuances of human judgment. However, it can be challenging to build pipelines to ground LLMs, such as with specific sources for factual correctness (Tang et al., 2024), or to social science theories that reflect human behavior and preferences (Shaikh et al., 2024). Additionally, these methods face limitations in generalizing to out-of-distribution settings, particularly in addressing discrepancies in preferences across different groups of people worldwide (Ryan et al., 2024).

Qualitative Evaluation

In contrast to quantitative evaluations (Quantitative Evaluation), qualitative evaluations require a nuanced approach to evaluation as they work directly with humans (or LLMs). They are perhaps more human-centered than automatic or machine-learned metrics due to their subjects, while requiring more careful considerations to design fair and effective evaluations. We first discuss two paradigms of qualitative evaluations, LLM as a Judge and Human Evaluation. We then end the section with a coverage of Extrinsic Evaluation.

LLM-as-a-Judge. The rise in popularity of LLMs has led to the “LLM-as-a-Judge” paradigm, which caters towards more human-centered systems. Given the cost and subjectivity of human evaluation, LLM evaluation proves to be a feasible alternative, and the results are generally consistent with results from human experts on some tasks (C.-H. Chiang & Lee, 2023). Within the LLM judge paradigm, there are various use cases, such as LLM-derived metrics (embedding-based, probabilities, etc.) (Jia et al., 2023; Xie et al., 2023), prompting, fine-tuning LLMs with human evaluations (Ke et al., 2024; Xu et al., 2023), and human-LLM collaborative evaluations (M. Gao et al., 2024). More recent methods employ multiple LLMs to engage in multi-agent debates for evaluations and have shown better alignment with human assessment (Chan et al., 2023). However, LLM-based evaluators exhibit systematic limitations, including self-preference bias (Panickssery et al., 2024), where models favor their own outputs, and inconsistent application of evaluation criteria (X. Hu et al., 2024), both of which reduce the reliability of their judgments. One set of limitations stems from hallucinations and lack of consistency and reproducibility that impacts accuracy of responses. Furthermore, LLMs can exhibit biases similar to human cognitive biases, e.g., gender and authority bias (Chen et al., 2024). They also show self-preference to LLM-generated outputs (Panickssery et al., 2024). Researchers study agreement between human and LLM evaluations using metrics such as Intraclass Correlation Coefficient (ICC) (Bartko, 1966) and Cohen’s Kappa (H. Li et al., 2024; Warrens, 2015). However, these issues are exacerbated by humans over-trusting LLM outputs for supposed objectivity in application settings (Bansal et al., 2021). One approach to address this issue is the “LLM as a jury” paradigm proposed by Verga et al. (2024), to check back on bias perpetuated by a single judge and thus better align with human evaluation.

Human Evaluation. Crowd-sourcing platforms such as Amazon Mechanical Turk (MTurk)¹ and Prolific² have enabled large-scale experiments within budget. Researchers have access to a wider range of evaluators than they would have in in-person studies. Nevertheless, human evaluators online may exhibit biases and quality issues (Ipeirotis et al., 2010). In addition, evaluators’ demographics could be skewed depending on the platform (Difallah et al., 2018). Correspondingly, the data quality between the platforms might differ. Douglas et al. (2023) shows that Prolific and CloudResearch are more likely to produce high-quality data, in comparison to MTurk, Qualtrics, and SONA. However, these trends may be shifting as AI agents more readily mimic human respondents and bypass AI detection methods (Westwood, 2025).

Such human evaluations must be designed according to best practices. Relevant questions are, how are human ratings collected? What questions are asked? We must design human evaluations carefully to avoid low-quality annotations. There exists a difficulty in standardization of human evaluations. Huynh et al. (2021) found that 25% of HITs (Human Intelligence Task, MTurk NLP studies) have technical issues, with unclear / incomplete instruction issues and poor communications. In some cases, humans may feel pressured to perform annotations they are unsure about. 35% of requesters were also assessed to pay poorly or very badly. Attempts to standardize human evaluations have been made in the form of inter-evaluator agreement, which is not commonly used (18% of 135 papers (Amidei et al., 2019)), and is suggested to have limitations pertaining to human language variability (Amidei et al., 2018). Thus, the answers to the above questions remain resoundingly insufficient. Such issues need to be resolved for human evaluations to have representative power.

There exist discrepancies between human annotator evaluation versus actual user evaluation, and preferences do not always correlate directly with objective model performance (Mozannar et al., 2024). This underscores the importance of capturing first-person user experience in evaluating human-centered LLMs. Such limitations in current mainstream human evaluation techniques makes one wonder; how do current human evaluations fit into human-centered evaluation paradigm? It is vital that human-centered evaluation of language models follow the needs of human stakeholders i.e. end-users. Any attempt to short-cut such process would result in inadequate task designs that serve the designer of the tasks, nothing more. Who the stakeholders of the tasks are is then interesting question; for example, for a paper review generation task, the stakeholders would be domain experts (NLP researchers) (Q. Wang et al., 2020). For other tasks, careful design around actual users of the system may be necessary to ensure the evaluations remain human-centered.

Abbasian, M., Khatibi, E., Azimi, I., Oniani, D., Shakeri Hossein Abad, Z., Thieme, A., Sriram, R., Yang, Z., Wang, Y., Lin, B., & others. (2024). Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digital Medicine, 7(1), 82.

Agarwal, D., Naaman, M., & Vashistha, A. (2024). AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. In ArXiv preprint: Vol. abs/2409.11360. https://arxiv.org/abs/2409.11360

Amidei, J., Piwek, P., & Willis, A. (2018). Rethinking the Agreement in Human Evaluation Tasks. In E. M. Bender, L. Derczynski, & P. Isabelle (Eds.), Proceedings of the 27th International Conference on Computational Linguistics (pp. 3318–3329). Association for Computational Linguistics. https://aclanthology.org/C18-1281/

Amidei, J., Piwek, P., & Willis, A. (2019). Agreement is overrated: A plea for correlation to assess human evaluation reliability. In K. van Deemter, C. Lin, & H. Takamura (Eds.), Proceedings of the 12th International Conference on Natural Language Generation (pp. 344–354). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-8642

Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., … Kaplan, J. (2021). A General Language Assistant as a Laboratory for Alignment. In ArXiv preprint: Vol. abs/2112.00861. https://arxiv.org/abs/2112.00861

Banerjee, D., Singh, P., Avadhanam, A., & Srivastava, S. (2023). Benchmarking LLM powered Chatbots: Methods and Metrics. In ArXiv preprint: Vol. abs/2308.04624. https://arxiv.org/abs/2308.04624

Bansal, G., Wu, T., Zhou, J., Fok, R., Nushi, B., Kamar, E., Ribeiro, M. T., & Weld, D. (2021). Does the whole exceed its parts? the effect of ai explanations on complementary team performance. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1–16.

Bartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological Reports, 19(1), 3–11.

Birhane, A., Kalluri, P., Card, D., Agnew, W., Dotan, R., & Bao, M. (2022). The values encoded in machine learning research. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 173–184.

Bowman, S. R., & Dahl, G. (2021). What Will it Take to Fix Benchmarking in Natural Language Understanding? In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4843–4855). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.385

Celikyilmaz, A., Clark, E., & Gao, J. (2021). Evaluation of Text Generation: A Survey. https://arxiv.org/abs/2006.14799

Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., & Liu, Z. (2023). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. In ArXiv preprint: Vol. abs/2308.07201. https://arxiv.org/abs/2308.07201

Chen, G. H., Chen, S., Liu, Z., Jiang, F., & Wang, B. (2024). Humans or LLMs as the Judge? A Study on Judgement Bias. In Y. Al-Onaizan, M. Bansal, & Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 8301–8327). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-main.474

Chiang, C.-H., & Lee, H. (2023). Can Large Language Models Be an Alternative to Human Evaluations? In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15607–15631). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.870

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M. I., Gonzalez, J. E., & Stoica, I. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. ICML. https://openreview.net/forum?id=3MW8GKNyzI

Difallah, D., Filatova, E., & Ipeirotis, P. (2018). Demographics and Dynamics of Mechanical Turk Workers. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 135–143. https://doi.org/10.1145/3159652.3159661

Douglas, B. D., Ewell, P. J., & Brauer, M. (2023). Data quality in online human-subjects research: Comparisons between MTurk, Prolific, CloudResearch, Qualtrics, and SONA. PLOS ONE, 18(3), 1–17. https://doi.org/10.1371/journal.pone.0279720

Ethayarajh, K., & Jurafsky, D. (2020). Utility is in the Eye of the User: A Critique of NLP Leaderboards. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4846–4853). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.393

Gao, M., Hu, X., Ruan, J., Pu, X., & Wan, X. (2024). LLM-based NLG Evaluation: Current Status and Challenges. In ArXiv preprint: Vol. abs/2402.01383. https://arxiv.org/abs/2402.01383

Gao, Y., Dligach, D., Miller, T., Caskey, J., Sharma, B., Churpek, M. M., & Afshar, M. (2023). Dr. bench: Diagnostic reasoning benchmark for clinical natural language processing. Journal of Biomedical Informatics, 138, 104286.

Gehrmann, S., Clark, E., & Sellam, T. (2023). Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. Journal of Artificial Intelligence Research, 77, 103–166.

Guha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A., K, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., … Li, Z. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/89e44582fd28ddfea1ea4dcb0ebbf4b0-Abstract-Datasets%5C_and%5C_Benchmarks.html

Hanna, M., & Bojar, O. (2021). A Fine-Grained Analysis of BERTScore. In L. Barrault, O. Bojar, F. Bougares, R. Chatterjee, M. R. Costa-jussa, C. Federmann, M. Fishel, A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz, P. Guzman, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, T. Kocmi, A. Martins, M. Morishita, & C. Monz (Eds.), Proceedings of the Sixth Conference on Machine Translation (pp. 507–517). Association for Computational Linguistics. https://aclanthology.org/2021.wmt-1.59/

Hashimoto, T. B., Zhang, H., & Liang, P. (2019). Unifying Human and Statistical Evaluation for Natural Language Generation. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 1689–1701). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1169

Haupt, A., & Brynjolfsson, E. (2025). Ai should not be an imitation game: Centaur evaluations. URL Https://Www. Andyhaupt. Com/Assets/Papers/Centaur Evaluations. Pdf.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding. International Conference on Learning Representations.

Hu, T., & Zhou, X.-H. (2024). Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions. https://arxiv.org/abs/2404.09135

Hu, X., Gao, M., Hu, S., Zhang, Y., Chen, Y., Xu, T., & Wan, X. (2024). Are LLM-based Evaluators Confusing NLG Quality Criteria? In ArXiv preprint: Vol. abs/2402.12055. https://arxiv.org/abs/2402.12055

Huynh, J., Bigham, J., & Eskenazi, M. (2021). A Survey of NLP-Related Crowdsourcing HITs: what works and what does not. https://arxiv.org/abs/2111.05241

Ipeirotis, P. G., Provost, F., & Wang, J. (2010). Quality management on Amazon Mechanical Turk. Proceedings of the ACM SIGKDD Workshop on Human Computation, 64–67. https://doi.org/10.1145/1837885.1837906

Jia, Q., Ren, S., Liu, Y., & Zhu, K. (2023). Zero-shot Faithfulness Evaluation for Text Summarization with Foundation Language Model. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 11017–11031). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.679

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. R. (2024). SWE-bench: Can Language Models Resolve Real-world Github Issues? The Twelfth International Conference on Learning Representations.

Jin, Q., Dhingra, B., Liu, Z., Cohen, W., & Lu, X. (2019). PubMedQA: A Dataset for Biomedical Research Question Answering. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 2567–2577). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1259

Ke, P., Wen, B., Feng, A., Liu, X., Lei, X., Cheng, J., Wang, S., Zeng, A., Dong, Y., Wang, H., Tang, J., & Huang, M. (2024). CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13034–13054). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.704

Kiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu, Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., Ma, Z., Thrush, T., Riedel, S., Waseem, Z., Stenetorp, P., Jia, R., Bansal, M., Potts, C., & Williams, A. (2021). Dynabench: Rethinking Benchmarking in NLP. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4110–4124). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.324

Lambert, N., Pyatkin, V., Morrison, J., Miranda, L. J. V., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., & others. (2025). Rewardbench: Evaluating reward models for language modeling. Findings of the Association for Computational Linguistics: NAACL 2025, 1755–1797.

Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., & Liu, Y. (2024). LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. https://arxiv.org/abs/2412.05579

Li, M., Barr Held, W., Ryan, M. J., Pipatanakul, K., Manakul, P., Zhu, H., & Yang, D. (2025). Mind the Gap! Static and Interactive Evaluations of Large Audio Models. arXiv E-Prints, arXiv-2502.

Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out, 74–81. https://aclanthology.org/W04-1013

Lu, Y., Jiang, D., Chen, W., Wang, W. Y., Choi, Y., & Lin, B. Y. (2024). Wildvision: Evaluating vision-language models in the wild with human preferences. Advances in Neural Information Processing Systems, 37, 48224–48255.

Mozannar, H., Chen, V., Alsobay, M., Das, S., Zhao, S., Wei, D., Nagireddy, M., Sattigeri, P., Talwalkar, A., & Sontag, D. (2024). The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers. In ArXiv preprint: Vol. abs/2404.02806. https://arxiv.org/abs/2404.02806

Nguyen, B., Yu, M., Huang, Y., & Jiang, M. (2024). Reference-based Metrics Disprove Themselves in Question Generation. https://arxiv.org/abs/2403.12242

OpenAI. (2025). Measuring the Performance of Our Models on Real-World Tasks. https://openai.com/index/gdpval/

Panickssery, A., Bowman, S. R., & Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. In ArXiv preprint: Vol. abs/2404.13076. https://arxiv.org/abs/2404.13076

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. In P. Isabelle, E. Charniak, & D. Lin (Eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135

Röttger, P., Hinck, M., Hofmann, V., Hackenburg, K., Pyatkin, V., Brahman, F., & Hovy, D. (2025). IssueBench: millions of realistic prompts for measuring issue bias in LLM writing assistance. arXiv Preprint arXiv:2502.08395.

Ryan, M. J., Held, W., & Yang, D. (2024). Unintended Impacts of LLM Alignment on Global Representation. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 16121–16140). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.853

Sai, A. B., Mohankumar, A. K., & Khapra, M. M. (2022). A Survey of Evaluation Metrics Used for NLG Systems. ACM Comput. Surv., 55(2). https://doi.org/10.1145/3485766

Shaikh, O., Chai, V. E., Gelfand, M., Yang, D., & Bernstein, M. S. (2024). Rehearsal: Simulating Conflict to Teach Conflict Resolution. In F. “Floyd” Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, & I. Shklovski (Eds.), Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024 (p. 920:1-920:20). ACM. https://doi.org/10.1145/3613904.3642159

Singh, S., Nan, Y., Wang, A., D’Souza, D., Kapoor, S., Üstün, A., Koyejo, S., Deng, Y., Longpre, S., Smith, N. A., & others. (2025). The leaderboard illusion. arXiv Preprint arXiv:2504.20879.

Sun, H., Min, Y., Chen, Z., Zhao, W. X., Fang, L., Liu, Z., Wang, Z., & Wen, J.-R. (2025). Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models. arXiv Preprint arXiv:2503.21380.

Tang, L., Laban, P., & Durrett, G. (2024). MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents. In Y. Al-Onaizan, M. Bansal, & Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 8818–8847). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-main.499

Verga, P., Hofstatter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., & Lewis, P. (2024). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. https://arxiv.org/abs/2404.18796

Wang, Q., Zeng, Q., Huang, L., Knight, K., Ji, H., & Rajani, N. (2020). ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis. ArXiv, abs/2010.06119. https://api.semanticscholar.org/CorpusID:222310232

Wang, R., & Demszky, D. (2023). Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction. In E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, & T. Zesch (Eds.), Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 626–667). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.53

Wang, R., Zhang, Q., Robinson, C., Loeb, S., & Demszky, D. (2024). Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 2174–2199). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-long.120

Warrens, M. J. (2015). Five ways to look at Cohen’s kappa. Journal of Psychology & Psychotherapy, 5.

Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P.-S., Mellor, J., Glaese, A., Cheng, M., Balle, B., Kasirzadeh, A., & others. (2022). Taxonomy of risks posed by language models. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 214–229.

Westwood, S. J. (2025). The potential existential threat of large language models to online survey research. Proceedings of the National Academy of Sciences, 122(47), e2518075122.

Wieting, J., Berg-Kirkpatrick, T., Gimpel, K., & Neubig, G. (2019). Beyond BLEU: Training Neural Machine Translation with Semantic Similarity. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4344–4355). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1427

Xie, Z., Li, M., Cohn, T., & Lau, J. (2023). DeltaScore: Fine-Grained Story Evaluation with Perturbations. In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 5317–5331). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.353

Xu, W., Wang, D., Pan, L., Song, Z., Freitag, M., Wang, W. Y., & Li, L. (2023). INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback. In ArXiv preprint: Vol. abs/2305.14282. https://arxiv.org/abs/2305.14282

Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., & others. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. The Twelfth International Conference on Learning Representations.

Model-Level Evaluations

Benchmarks

Moving Away from “Exams” and Rethinking What We Evaluate.

Ecological Validity.

Data Contamination and Dynamic Alternatives.

General-Purpose vs. Domain-Specific Evaluations.

Quantitative Evaluation

Qualitative Evaluation

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Benchmarks

Moving Away from “Exams” and Rethinking What We Evaluate.

Ecological Validity.

Data Contamination and Dynamic Alternatives.

General-Purpose vs. Domain-Specific Evaluations.

Quantitative Evaluation

Qualitative Evaluation

Footnotes

Graph View

Table of Contents

Backlinks