Societal-level Evaluation

With the increasingly pervasive influence of large language models (LLM) across sensitive domains like mental health (Abdurahman et al., 2024; Lawrence et al., 2024; Stade et al., 2024), education (Wang & Zhang, 2024), etc., and the complex challenges these models present, evaluating their impact on users and society is crucial. Traditional evaluations of LLMs use datasets and benchmarks to assess potential hazardous behaviors, but they often fail to bridge the “sociotechnical gap” between controlled assessments and real-world performance (Ibrahim et al., 2024; Weidinger et al., 2023). By focusing on models in isolation, these methods overlook crucial human factors, resulting in an inadequate understanding of human-model interactions and their consequences. Furthermore, these evaluations fail to account for the continuation and amplification of societal inequities and biases existing in the data on which these models were trained. Thus, it becomes essential to employ a comprehensive extrinsic evaluation (Jones & Galliers, 1995) framework that considers various categories of social impacts, such as bias and stereotypes, cultural values, performance disparities, privacy protection, financial implications, environmental costs, and content moderation labor (Solaiman et al., 2023).

As mentioned in HCI for HCLLMs, randomized controlled trials (RCTs) and large-scale behavioral assessments play a big role in understanding and evaluating LLMs’ behavioral impact on users and society. For instance, to evaluate the effect and quality of a newly developed AI co-tutor that provides LLM-generated feedback to real-life tutors on student performance, R. E. Wang et al. (2025) conducted an experiment, where the tutors would either get access to the AI co-pilot in their tutoring sessions or they would tutor without the assistance of the AI collaborator. This setup enabled R. E. Wang et al. (2025) to systemically evaluate the performance of their model in an ecologically valid setting, receiving working feedback from both the users and the model behavior itself (Brynjolfsson et al., 2024).

The application of robust evaluation techniques spans various domains. In healthcare, where existing metrics often fail to capture critical factors like user comprehension and trust, researchers are developing new metrics to assess LLMs’ impact on end-user decision-making and expectations (Abbasian et al., 2024). These metrics measure both immediate behavioral changes and long-term adoption patterns. For example, (Yang et al., 2023) evaluated LLMs for mental health analysis, showing that while ChatGPT demonstrates strong in-context learning, specialized methods often outperform it. They also found that effective prompt engineering with emotional cues can improve results. Similarly, (Bak & Chin, 2024) observed that LLMs provided 20%-30% irrelevant information when identifying users’ motivation states for health behavior change, highlighting their limitations.

When evaluating LLMs, it is essential to consider the impact at scale through longitudinal and large-scale studies, which are critical in understanding not just the immediate outcomes of LLM use, but also their sustained effects, as LLMs already undergo significant change in response to user engagement(Liu et al., 2024). As an example, Eloundou et al. (2024) examines the impact of LLMs on the U.S. labor market, particularly the enhanced effects of LLM-powered software, finding that higher income jobs may face greater exposure. These studies emphasize the importance of robust behavioral experimental design and scale—in units of time and users—in evaluating LLMs, as results obtained from small-scale and lab-controlled studies may not always generalize to larger, more diverse, real-life user populations. Furthermore, such evaluations allow us to explore cumulative effects, such as changes in users’ attitudes, considerations, and even decision-making.

Extrinsic Evaluation. Extrinsic evaluations cover behavioral impacts, self-efficacy reports, standardized evaluation, short-term outcomes, and long-term outcomes (D. Yang et al., 2024).

Behavioral impacts: Behavioral impacts track the changes in qualitatively coded participant behaviors before and after exposure to a system. Evaluations use task-based assessments, tracking engagement and task completion. For example, realHumanEval measures the number of tasks completed, time to task success, acceptance rate, and number of chat code copies, to comprehensively analyze the quality of AI-coding tools and their impacts on human users (Mozannar et al., 2024). Standardized evaluations are also more objective and draw on pre-defined assessments, for instance, the effects of AI-generated suggestions on writing style (Agarwal et al., 2024).

Self-efficacy evaluation: Self-efficacy evaluation includes questionnaires of participants’ perceptions of a system’s usefulness and their own perceived levels of ownership when interacting with a system (Long et al., 2024). Together, these extrinsic methods distinguish between performance-based metrics, like tracking behavior or test performance, and perception-based metrics, such as user surveys.

Short-term and Long-term evaluation: Evaluations can be split into short-term and long-term (D. Yang et al., 2024) with short-term being constrained interactive sessions and long-term being much longer studies (i.e. 1 week). In short-term evaluations, new AI tools may receive subjectively higher scores due to novelty bias. (Sadeghi et al., 2022; Shin et al., 2018). This is less of an issue with long-term evaluations. For example, we can consider one longitudinal study on the AI chain tool (Long et al., 2024). The tool used in this study is about creating a Tweetorial¹ chain on science communication. The study finds that after a “familiarization phase” the perceived utility of the tool even increases higher than when there was novelty bias, suggesting that the end-user’s utilization capability of the tool increases via customizing the prompt and other means. Thus, AI tool was found to be more useful in the long-term.

Drawing from economics and psychology research, other evaluations trace the impacts of AI assistance and the interaction of users’ short-term and long-term behaviors and attitudes. For example, skill training systems have measured the changes in productivity and wages in participants after training (Adhvaryu et al., 2018; Chioda et al., 2021), and other measures look at how health, risk-taking behaviors, and levels of societal trust have changed over time (Barrera-Osorio et al., 2020).

Long-term evaluation of LLMs is necessary to ensure the LLMs remain human-centered in the long run. We risk measuring the novelty bias if we only perform short-term evaluations, in which LLM’s helpfulness to humans can be inflated. By measuring the long-term effect of LLMs, we will be able to accurately measure the helpfulness of LLMs, among other important human-centered metrics for the LLMs’ use-case (i.e. creativity).

In summary, well-rounded extrinsic evaluation should integrate objective performance and subjective user experience. But while extrinsic evaluations offer a more holistic assessment for human-centered objectives, they are often more complex and expensive. Successful evaluations often make use of a mix of quantitative and qualitative methods to assess the quality of the system, yet there remains room for improvement in integrating human-centered evaluations.

Lengthy Twitter posts connected as a chain ↩

Abbasian, M., Khatibi, E., Azimi, I., Oniani, D., Shakeri Hossein Abad, Z., Thieme, A., Sriram, R., Yang, Z., Wang, Y., Lin, B., & others. (2024). Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digital Medicine, 7(1), 82.

Abdurahman, S., Atari, M., Karimi-Malekabadi, F., Xue, M. J., Trager, J., Park, P. S., Golazizian, P., Omrani, A., & Dehghani, M. (2024). Perils and opportunities in using large language models in psychological research. PNAS Nexus, 3(7), pgae245. https://doi.org/10.1093/pnasnexus/pgae245

Adhvaryu, A., Kala, N., & Nyshadham, A. (2018). The Skills to Pay the Bills: Returns to On-the-job Soft Skills Training (Working Paper No. 24313; Working Paper Series). National Bureau of Economic Research. https://doi.org/10.3386/w24313

Agarwal, D., Naaman, M., & Vashistha, A. (2024). AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. In ArXiv preprint: Vol. abs/2409.11360. https://arxiv.org/abs/2409.11360

Bak, M., & Chin, J. (2024). The potential and limitations of large language models in identification of the states of motivations for facilitating health behavior change. Journal of the American Medical Informatics Association, ocae057.

Barrera-Osorio, F., Kugler, A. D., & Silliman, M. I. (2020). Hard and Soft Skills in Vocational Training: Experimental Evidence from Colombia (Working Paper No. 27548; Working Paper Series). National Bureau of Economic Research. https://doi.org/10.3386/w27548

Brynjolfsson, E., Li, D., & Raymond, L. (2024). Generative AI at Work. https://arxiv.org/abs/2304.11771

Chioda, L., Contreras-Loya, D., Gertler, P., & Carney, D. (2021). Making Entrepreneurs: Returns to Training Youth in Hard Versus Soft Business Skills (Working Paper No. 28845; Working Paper Series). National Bureau of Economic Research. https://doi.org/10.3386/w28845

Eloundou, T., Manning, S., Mishkin, P., & Rock, D. (2024). GPTs are GPTs: Labor market impact potential of LLMs. Science, 384(6702), 1306–1308.

Ibrahim, L., Huang, S., Ahmad, L., & Anderljung, M. (2024). Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks. ArXiv Preprint, abs/2405.10632. https://arxiv.org/abs/2405.10632

Jones, K. S., & Galliers, J. R. (1995). The framework: Scope and concepts. In K. S. Jones & J. R. Galliers (Eds.), Evaluating Natural Language Processing Systems: An Analysis and Review (pp. 3–63). Springer Berlin Heidelberg. https://doi.org/10.1007/BFb0027472

Lawrence, H. R., Schneider, R. A., Rubin, S. B., Matarić, M. J., McDuff, D. J., & Bell, M. J. (2024). The opportunities and risks of large language models in mental health. JMIR Mental Health, 11(1), e59479.

Liu, Y., Cong, T., Zhao, Z., Backes, M., Shen, Y., & Zhang, Y. (2024). Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of Large Language Models. https://arxiv.org/abs/2308.07847

Long, T., Gero, K. I., & Chilton, L. B. (2024). Not Just Novelty: A Longitudinal Study on Utility and Customization of an AI Workflow. In ArXiv preprint: Vol. abs/2402.09894. https://arxiv.org/abs/2402.09894

Mozannar, H., Chen, V., Alsobay, M., Das, S., Zhao, S., Wei, D., Nagireddy, M., Sattigeri, P., Talwalkar, A., & Sontag, D. (2024). The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers. In ArXiv preprint: Vol. abs/2404.02806. https://arxiv.org/abs/2404.02806

Sadeghi, S., Gupta, S., Gramatovici, S., Lu, J., Ai, H., & Zhang, R. (2022). Novelty and Primacy: A Long-Term Estimator for Online Experiments. Technometrics, 64(4), 524–534. https://doi.org/10.1080/00401706.2022.2124309

Shin, G., Feng, Y., Jarrahi, M. H., & Gafinowitz, N. (2018). Beyond novelty effect: a mixed-methods exploration into the motivation for long-term activity tracker use. JAMIA Open, 2(1), 62–72. https://doi.org/10.1093/jamiaopen/ooy048

Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S. L., Chen, C., Daumé III, H., Dodge, J., Duan, I., & others. (2023). Evaluating the social impact of generative ai systems in systems and society. ArXiv Preprint, abs/2306.05949. https://arxiv.org/abs/2306.05949

Stade, E. C., Stirman, S. W., Ungar, L. H., Boland, C. L., Schwartz, H. A., Yaden, D. B., Sedoc, J., DeRubeis, R. J., Willer, R., & Eichstaedt, J. C. (2024). Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. NPJ Mental Health Research, 3(1), 12.

Wang, D., & Zhang, S. (2024). Large language models in medical and healthcare fields: applications, advances, and challenges. Artificial Intelligence Review, 57(11), 299.

Wang, R. E., Ribeiro, A. T., Robinson, C. D., Loeb, S., & Demszky, D. (2025). Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise. https://arxiv.org/abs/2410.03017

Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hendricks, L. A., Mateos-Garcia, J., Bergman, S., Kay, J., Griffin, C., Bariach, B., & others. (2023). Sociotechnical safety evaluation of generative ai systems. ArXiv Preprint, abs/2310.11986. https://arxiv.org/abs/2310.11986

Yang, D., Ziems, C., Held, W., Shaikh, O., Bernstein, M. S., & Mitchell, J. (2024). Social Skill Training with Large Language Models. In ArXiv preprint: Vol. abs/2404.04204. https://arxiv.org/abs/2404.04204

Yang, K., Ji, S., Zhang, T., Xie, Q., Kuang, Z., & Ananiadou, S. (2023). Towards Interpretable Mental Health Analysis with Large Language Models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 6056–6077). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.370

Societal-level Evaluation

Graph View

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Footnotes

Graph View

Backlinks