With the increasingly pervasive influence of large language models (LLM) across sensitive domains like mental health (Abdurahman et al., 2024ReferenceAbdurahman, S., Atari, M., Karimi-Malekabadi, F., Xue, M. J., Trager, J., Park, P. S., Golazizian, P., Omrani, A., & Dehghani, M. (2024). Perils and opportunities in using large language models in psychological research. PNAS Nexus, 3(7), pgae245. https://doi.org/10.1093/pnasnexus/pgae245; Lawrence et al., 2024ReferenceLawrence, H. R., Schneider, R. A., Rubin, S. B., Matarić, M. J., McDuff, D. J., & Bell, M. J. (2024). The opportunities and risks of large language models in mental health. JMIR Mental Health, 11(1), e59479.; Stade et al., 2024ReferenceStade, E. C., Stirman, S. W., Ungar, L. H., Boland, C. L., Schwartz, H. A., Yaden, D. B., Sedoc, J., DeRubeis, R. J., Willer, R., & Eichstaedt, J. C. (2024). Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. NPJ Mental Health Research, 3(1), 12.), education (Wang & Zhang, 2024ReferenceWang, D., & Zhang, S. (2024). Large language models in medical and healthcare fields: applications, advances, and challenges. Artificial Intelligence Review, 57(11), 299.), etc., and the complex challenges these models present, evaluating their impact on users and society is crucial. Traditional evaluations of LLMs use datasets and benchmarks to assess potential hazardous behaviors, but they often fail to bridge the “sociotechnical gap” between controlled assessments and real-world performance (Ibrahim et al., 2024ReferenceIbrahim, L., Huang, S., Ahmad, L., & Anderljung, M. (2024). Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks. ArXiv Preprint, abs/2405.10632. https://arxiv.org/abs/2405.10632; Weidinger et al., 2023ReferenceWeidinger, L., Rauh, M., Marchal, N., Manzini, A., Hendricks, L. A., Mateos-Garcia, J., Bergman, S., Kay, J., Griffin, C., Bariach, B., & others. (2023). Sociotechnical safety evaluation of generative ai systems. ArXiv Preprint, abs/2310.11986. https://arxiv.org/abs/2310.11986). By focusing on models in isolation, these methods overlook crucial human factors, resulting in an inadequate understanding of human-model interactions and their consequences. Furthermore, these evaluations fail to account for the continuation and amplification of societal inequities and biases existing in the data on which these models were trained. Thus, it becomes essential to employ a comprehensive extrinsic evaluation (Jones & Galliers, 1995ReferenceJones, K. S., & Galliers, J. R. (1995). The framework: Scope and concepts. In K. S. Jones & J. R. Galliers (Eds.), Evaluating Natural Language Processing Systems: An Analysis and Review (pp. 3–63). Springer Berlin Heidelberg. https://doi.org/10.1007/BFb0027472) framework that considers various categories of social impacts, such as bias and stereotypes, cultural values, performance disparities, privacy protection, financial implications, environmental costs, and content moderation labor (Solaiman et al., 2023ReferenceSolaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S. L., Chen, C., Daumé III, H., Dodge, J., Duan, I., & others. (2023). Evaluating the social impact of generative ai systems in systems and society. ArXiv Preprint, abs/2306.05949. https://arxiv.org/abs/2306.05949).

As mentioned in HCI for HCLLMs, randomized controlled trials (RCTs) and large-scale behavioral assessments play a big role in understanding and evaluating LLMs’ behavioral impact on users and society. For instance, to evaluate the effect and quality of a newly developed AI co-tutor that provides LLM-generated feedback to real-life tutors on student performance, R. E. Wang et al. (2025)ReferenceWang, R. E., Ribeiro, A. T., Robinson, C. D., Loeb, S., & Demszky, D. (2025). Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise. https://arxiv.org/abs/2410.03017 conducted an experiment, where the tutors would either get access to the AI co-pilot in their tutoring sessions or they would tutor without the assistance of the AI collaborator. This setup enabled R. E. Wang et al. (2025)ReferenceWang, R. E., Ribeiro, A. T., Robinson, C. D., Loeb, S., & Demszky, D. (2025). Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise. https://arxiv.org/abs/2410.03017 to systemically evaluate the performance of their model in an ecologically valid setting, receiving working feedback from both the users and the model behavior itself (Brynjolfsson et al., 2024ReferenceBrynjolfsson, E., Li, D., & Raymond, L. (2024). Generative AI at Work. https://arxiv.org/abs/2304.11771).

The application of robust evaluation techniques spans various domains. In healthcare, where existing metrics often fail to capture critical factors like user comprehension and trust, researchers are developing new metrics to assess LLMs’ impact on end-user decision-making and expectations (Abbasian et al., 2024ReferenceAbbasian, M., Khatibi, E., Azimi, I., Oniani, D., Shakeri Hossein Abad, Z., Thieme, A., Sriram, R., Yang, Z., Wang, Y., Lin, B., & others. (2024). Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digital Medicine, 7(1), 82.). These metrics measure both immediate behavioral changes and long-term adoption patterns. For example, (Yang et al., 2023ReferenceYang, K., Ji, S., Zhang, T., Xie, Q., Kuang, Z., & Ananiadou, S. (2023). Towards Interpretable Mental Health Analysis with Large Language Models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 6056–6077). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.370) evaluated LLMs for mental health analysis, showing that while ChatGPT demonstrates strong in-context learning, specialized methods often outperform it. They also found that effective prompt engineering with emotional cues can improve results. Similarly, (Bak & Chin, 2024ReferenceBak, M., & Chin, J. (2024). The potential and limitations of large language models in identification of the states of motivations for facilitating health behavior change. Journal of the American Medical Informatics Association, ocae057.) observed that LLMs provided 20%-30% irrelevant information when identifying users’ motivation states for health behavior change, highlighting their limitations.

When evaluating LLMs, it is essential to consider the impact at scale through longitudinal and large-scale studies, which are critical in understanding not just the immediate outcomes of LLM use, but also their sustained effects, as LLMs already undergo significant change in response to user engagement(Liu et al., 2024ReferenceLiu, Y., Cong, T., Zhao, Z., Backes, M., Shen, Y., & Zhang, Y. (2024). Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of Large Language Models. https://arxiv.org/abs/2308.07847). As an example, Eloundou et al. (2024)ReferenceEloundou, T., Manning, S., Mishkin, P., & Rock, D. (2024). GPTs are GPTs: Labor market impact potential of LLMs. Science, 384(6702), 1306–1308. examines the impact of LLMs on the U.S. labor market, particularly the enhanced effects of LLM-powered software, finding that higher income jobs may face greater exposure. These studies emphasize the importance of robust behavioral experimental design and scale—in units of time and users—in evaluating LLMs, as results obtained from small-scale and lab-controlled studies may not always generalize to larger, more diverse, real-life user populations. Furthermore, such evaluations allow us to explore cumulative effects, such as changes in users’ attitudes, considerations, and even decision-making.

Extrinsic Evaluation. Extrinsic evaluations cover behavioral impacts, self-efficacy reports, standardized evaluation, short-term outcomes, and long-term outcomes (D. Yang et al., 2024ReferenceYang, D., Ziems, C., Held, W., Shaikh, O., Bernstein, M. S., & Mitchell, J. (2024). Social Skill Training with Large Language Models. In ArXiv preprint: Vol. abs/2404.04204. https://arxiv.org/abs/2404.04204).

Behavioral impacts: Behavioral impacts track the changes in qualitatively coded participant behaviors before and after exposure to a system. Evaluations use task-based assessments, tracking engagement and task completion. For example, realHumanEval measures the number of tasks completed, time to task success, acceptance rate, and number of chat code copies, to comprehensively analyze the quality of AI-coding tools and their impacts on human users (Mozannar et al., 2024ReferenceMozannar, H., Chen, V., Alsobay, M., Das, S., Zhao, S., Wei, D., Nagireddy, M., Sattigeri, P., Talwalkar, A., & Sontag, D. (2024). The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers. In ArXiv preprint: Vol. abs/2404.02806. https://arxiv.org/abs/2404.02806). Standardized evaluations are also more objective and draw on pre-defined assessments, for instance, the effects of AI-generated suggestions on writing style (Agarwal et al., 2024ReferenceAgarwal, D., Naaman, M., & Vashistha, A. (2024). AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. In ArXiv preprint: Vol. abs/2409.11360. https://arxiv.org/abs/2409.11360).

Self-efficacy evaluation: Self-efficacy evaluation includes questionnaires of participants’ perceptions of a system’s usefulness and their own perceived levels of ownership when interacting with a system (Long et al., 2024ReferenceLong, T., Gero, K. I., & Chilton, L. B. (2024). Not Just Novelty: A Longitudinal Study on Utility and Customization of an AI Workflow. In ArXiv preprint: Vol. abs/2402.09894. https://arxiv.org/abs/2402.09894). Together, these extrinsic methods distinguish between performance-based metrics, like tracking behavior or test performance, and perception-based metrics, such as user surveys.

Short-term and Long-term evaluation: Evaluations can be split into short-term and long-term (D. Yang et al., 2024ReferenceYang, D., Ziems, C., Held, W., Shaikh, O., Bernstein, M. S., & Mitchell, J. (2024). Social Skill Training with Large Language Models. In ArXiv preprint: Vol. abs/2404.04204. https://arxiv.org/abs/2404.04204) with short-term being constrained interactive sessions and long-term being much longer studies (i.e. 1 week). In short-term evaluations, new AI tools may receive subjectively higher scores due to novelty bias. (Sadeghi et al., 2022ReferenceSadeghi, S., Gupta, S., Gramatovici, S., Lu, J., Ai, H., & Zhang, R. (2022). Novelty and Primacy: A Long-Term Estimator for Online Experiments. Technometrics, 64(4), 524–534. https://doi.org/10.1080/00401706.2022.2124309; Shin et al., 2018ReferenceShin, G., Feng, Y., Jarrahi, M. H., & Gafinowitz, N. (2018). Beyond novelty effect: a mixed-methods exploration into the motivation for long-term activity tracker use. JAMIA Open, 2(1), 62–72. https://doi.org/10.1093/jamiaopen/ooy048). This is less of an issue with long-term evaluations. For example, we can consider one longitudinal study on the AI chain tool (Long et al., 2024ReferenceLong, T., Gero, K. I., & Chilton, L. B. (2024). Not Just Novelty: A Longitudinal Study on Utility and Customization of an AI Workflow. In ArXiv preprint: Vol. abs/2402.09894. https://arxiv.org/abs/2402.09894). The tool used in this study is about creating a Tweetorial1 chain on science communication. The study finds that after a “familiarization phase” the perceived utility of the tool even increases higher than when there was novelty bias, suggesting that the end-user’s utilization capability of the tool increases via customizing the prompt and other means. Thus, AI tool was found to be more useful in the long-term.

Drawing from economics and psychology research, other evaluations trace the impacts of AI assistance and the interaction of users’ short-term and long-term behaviors and attitudes. For example, skill training systems have measured the changes in productivity and wages in participants after training (Adhvaryu et al., 2018ReferenceAdhvaryu, A., Kala, N., & Nyshadham, A. (2018). The Skills to Pay the Bills: Returns to On-the-job Soft Skills Training (Working Paper No. 24313; Working Paper Series). National Bureau of Economic Research. https://doi.org/10.3386/w24313; Chioda et al., 2021ReferenceChioda, L., Contreras-Loya, D., Gertler, P., & Carney, D. (2021). Making Entrepreneurs: Returns to Training Youth in Hard Versus Soft Business Skills (Working Paper No. 28845; Working Paper Series). National Bureau of Economic Research. https://doi.org/10.3386/w28845), and other measures look at how health, risk-taking behaviors, and levels of societal trust have changed over time (Barrera-Osorio et al., 2020ReferenceBarrera-Osorio, F., Kugler, A. D., & Silliman, M. I. (2020). Hard and Soft Skills in Vocational Training: Experimental Evidence from Colombia (Working Paper No. 27548; Working Paper Series). National Bureau of Economic Research. https://doi.org/10.3386/w27548).

Long-term evaluation of LLMs is necessary to ensure the LLMs remain human-centered in the long run. We risk measuring the novelty bias if we only perform short-term evaluations, in which LLM’s helpfulness to humans can be inflated. By measuring the long-term effect of LLMs, we will be able to accurately measure the helpfulness of LLMs, among other important human-centered metrics for the LLMs’ use-case (i.e. creativity).

In summary, well-rounded extrinsic evaluation should integrate objective performance and subjective user experience. But while extrinsic evaluations offer a more holistic assessment for human-centered objectives, they are often more complex and expensive. Successful evaluations often make use of a mix of quantitative and qualitative methods to assess the quality of the system, yet there remains room for improvement in integrating human-centered evaluations.

Footnotes

  1. Lengthy Twitter posts connected as a chain

Abbasian, M., Khatibi, E., Azimi, I., Oniani, D., Shakeri Hossein Abad, Z., Thieme, A., Sriram, R., Yang, Z., Wang, Y., Lin, B., & others. (2024). Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digital Medicine, 7(1), 82.
Abdurahman, S., Atari, M., Karimi-Malekabadi, F., Xue, M. J., Trager, J., Park, P. S., Golazizian, P., Omrani, A., & Dehghani, M. (2024). Perils and opportunities in using large language models in psychological research. PNAS Nexus, 3(7), pgae245. https://doi.org/10.1093/pnasnexus/pgae245
Adhvaryu, A., Kala, N., & Nyshadham, A. (2018). The Skills to Pay the Bills: Returns to On-the-job Soft Skills Training (Working Paper No. 24313; Working Paper Series). National Bureau of Economic Research. https://doi.org/10.3386/w24313
Agarwal, D., Naaman, M., & Vashistha, A. (2024). AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. In ArXiv preprint: Vol. abs/2409.11360. https://arxiv.org/abs/2409.11360
Bak, M., & Chin, J. (2024). The potential and limitations of large language models in identification of the states of motivations for facilitating health behavior change. Journal of the American Medical Informatics Association, ocae057.
Barrera-Osorio, F., Kugler, A. D., & Silliman, M. I. (2020). Hard and Soft Skills in Vocational Training: Experimental Evidence from Colombia (Working Paper No. 27548; Working Paper Series). National Bureau of Economic Research. https://doi.org/10.3386/w27548
Brynjolfsson, E., Li, D., & Raymond, L. (2024). Generative AI at Work. https://arxiv.org/abs/2304.11771
Chioda, L., Contreras-Loya, D., Gertler, P., & Carney, D. (2021). Making Entrepreneurs: Returns to Training Youth in Hard Versus Soft Business Skills (Working Paper No. 28845; Working Paper Series). National Bureau of Economic Research. https://doi.org/10.3386/w28845
Eloundou, T., Manning, S., Mishkin, P., & Rock, D. (2024). GPTs are GPTs: Labor market impact potential of LLMs. Science, 384(6702), 1306–1308.
Ibrahim, L., Huang, S., Ahmad, L., & Anderljung, M. (2024). Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks. ArXiv Preprint, abs/2405.10632. https://arxiv.org/abs/2405.10632
Jones, K. S., & Galliers, J. R. (1995). The framework: Scope and concepts. In K. S. Jones & J. R. Galliers (Eds.), Evaluating Natural Language Processing Systems: An Analysis and Review (pp. 3–63). Springer Berlin Heidelberg. https://doi.org/10.1007/BFb0027472
Lawrence, H. R., Schneider, R. A., Rubin, S. B., Matarić, M. J., McDuff, D. J., & Bell, M. J. (2024). The opportunities and risks of large language models in mental health. JMIR Mental Health, 11(1), e59479.
Liu, Y., Cong, T., Zhao, Z., Backes, M., Shen, Y., & Zhang, Y. (2024). Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of Large Language Models. https://arxiv.org/abs/2308.07847
Long, T., Gero, K. I., & Chilton, L. B. (2024). Not Just Novelty: A Longitudinal Study on Utility and Customization of an AI Workflow. In ArXiv preprint: Vol. abs/2402.09894. https://arxiv.org/abs/2402.09894
Mozannar, H., Chen, V., Alsobay, M., Das, S., Zhao, S., Wei, D., Nagireddy, M., Sattigeri, P., Talwalkar, A., & Sontag, D. (2024). The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers. In ArXiv preprint: Vol. abs/2404.02806. https://arxiv.org/abs/2404.02806
Sadeghi, S., Gupta, S., Gramatovici, S., Lu, J., Ai, H., & Zhang, R. (2022). Novelty and Primacy: A Long-Term Estimator for Online Experiments. Technometrics, 64(4), 524–534. https://doi.org/10.1080/00401706.2022.2124309
Shin, G., Feng, Y., Jarrahi, M. H., & Gafinowitz, N. (2018). Beyond novelty effect: a mixed-methods exploration into the motivation for long-term activity tracker use. JAMIA Open, 2(1), 62–72. https://doi.org/10.1093/jamiaopen/ooy048
Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S. L., Chen, C., Daumé III, H., Dodge, J., Duan, I., & others. (2023). Evaluating the social impact of generative ai systems in systems and society. ArXiv Preprint, abs/2306.05949. https://arxiv.org/abs/2306.05949
Stade, E. C., Stirman, S. W., Ungar, L. H., Boland, C. L., Schwartz, H. A., Yaden, D. B., Sedoc, J., DeRubeis, R. J., Willer, R., & Eichstaedt, J. C. (2024). Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. NPJ Mental Health Research, 3(1), 12.
Wang, D., & Zhang, S. (2024). Large language models in medical and healthcare fields: applications, advances, and challenges. Artificial Intelligence Review, 57(11), 299.
Wang, R. E., Ribeiro, A. T., Robinson, C. D., Loeb, S., & Demszky, D. (2025). Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise. https://arxiv.org/abs/2410.03017
Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hendricks, L. A., Mateos-Garcia, J., Bergman, S., Kay, J., Griffin, C., Bariach, B., & others. (2023). Sociotechnical safety evaluation of generative ai systems. ArXiv Preprint, abs/2310.11986. https://arxiv.org/abs/2310.11986
Yang, D., Ziems, C., Held, W., Shaikh, O., Bernstein, M. S., & Mitchell, J. (2024). Social Skill Training with Large Language Models. In ArXiv preprint: Vol. abs/2404.04204. https://arxiv.org/abs/2404.04204
Yang, K., Ji, S., Zhang, T., Xie, Q., Kuang, Z., & Ananiadou, S. (2023). Towards Interpretable Mental Health Analysis with Large Language Models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 6056–6077). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.370