Next, we discuss how we can apply a human-centered lens to training and evaluating LLMs to enable a future of work that jointly benefits the different stakeholders we have discussed. To start, much of the current discourse around LLM systems emphasizes autonomous execution, or systems that complete complex, multi-step tasks with minimal human involvement (Shen et al., 2025ReferenceShen, S. Z., Chen, V., Gu, K., Ross, A., Ma, Z., Ross, J., Gu, A., Si, C., Chi, W., Peng, A., & others. (2025). Completion \neq Collaboration: Scaling Collaborative Effort with Agents. arXiv Preprint arXiv:2510.25744.). Yet, focusing primarily on this form of autonomous interaction, overlooks the potential gains for systems in which humans and LLMs collaborate. For example, researchers such as Wang et al. (2026)ReferenceWang, Z. Z., Yang, J., Lieret, K., Tartaglini, A., Chen, V., Wei, Y., Zhang, Z. W. L., Narasimhan, K., Schmidt, L., Neubig, G., & others. (2026). Position: Humans are Missing from AI Coding Agent Research. have raised exactly this concern in the context of AI coding agents—a domain where models are quite performant and has seen significant adoption especially amongst professionals in this field (Appel et al., 2025ReferenceAppel, R., McCrory, P., Tamkin, A., McCain, M., Neylon, T., & Stern, M. (2025). Anthropic economic index report: Uneven geographic and enterprise ai adoption. arXiv Preprint arXiv:2511.15080.). They argue that the field has moved too quickly toward full automation and that human involvement deserves to be treated as a first-class design consideration rather than a transitional inconvenience. Taking this critique seriously requires asking what it would actually take, technically, to build human-centered LLMs for the workforce.

Designing Models as Collaborators.

One prerequisite is that models need to know how to collaborate. However, collaboration is not simply a matter of adding a confirmation step before a model takes action. It requires a more nuanced understanding of when to act autonomously, when to defer, and how to communicate uncertainty or request input in ways that feel natural rather than disruptive. As Shen et al. (2025)ReferenceShen, S. Z., Chen, V., Gu, K., Ross, A., Ma, Z., Ross, J., Gu, A., Si, C., Chi, W., Peng, A., & others. (2025). Completion \neq Collaboration: Scaling Collaborative Effort with Agents. arXiv Preprint arXiv:2510.25744. show, building more capable models does not necessarily equate to better collaboration, thus necessitating dedicated efforts. In fact, current training paradigms, such as preference tuning over single conversational turns, can even hinder collaboration abilities (Wu et al., 2025ReferenceWu, S., Galley, M., Peng, B., Cheng, H., Li, G., Dou, Y., Cai, W., Zou, J., Leskovec, J., & Gao, J. (2025). CollabLLM: From Passive Responders to Active Collaborators. International Conference on Machine Learning, 67260–67283.). Furthermore, successful collaboration does not mean indiscriminately putting a human in the loop. While ostensibly beneficial, this interaction paradigm can lead to unnecessary verification and inefficiencies that ultimately slow down the collaboration process. The best form of collaboration will necessarily vary, depending on the task at hand as well as the skillset of the human and LLM involved. For example, sometimes the human may need to audit the LLM’s work; other times the right form of collaboration may need to be a back-and-forth conversation; and in others end-to-end execution by the model may suffice.

To start, we must return to the core question: what are the ingredients for a successful collaboration? In human organizations, a common collaboration paradigm is delegation. At the moment, agents are capable of decomposing complex tasks into more manageable units of action. However, the next step here is being able to assign these actions. Thus, we need to know what tasks to hand-off to these models and which may be better suited for humans to complete or even to delegate across multiple different models (Wang et al., 2025ReferenceWang, Z. Z., Shao, Y., Shaikh, O., Fried, D., Neubig, G., & Yang, D. (2025). How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations. arXiv Preprint arXiv:2510.22780.). This delegation capability must furthermore be adaptive across users, tasks, context, and model capabilities (Fügener et al., 2022ReferenceFügener, A., Grahl, J., Gupta, A., & Ketter, W. (2022). Cognitive challenges in human–artificial intelligence collaboration: Investigating the path toward productive delegation. Information Systems Research, 33(2), 678–696.; Guggenberger et al., 2023ReferenceGuggenberger, T., Lämmermann, L., Urbach, N., Walter, A. M., & Hofmann, P. (2023). Task delegation from AI to humans: A principal-agent perspective.; Tomašev et al., 2026ReferenceTomašev, N., Franklin, M., & Osindero, S. (2026). Intelligent AI Delegation. arXiv Preprint arXiv:2602.11865.). Beyond being able to delegate, there are many other proprties that help make a good collaborator which are lacking in existing models, such as proactivity (Lu et al., 2024ReferenceLu, Y., Yang, S., Qian, C., Chen, G., Luo, Q., Wu, Y., Wang, H., Cong, X., Zhang, Z., Lin, Y., & others. (2024). Proactive agent: Shifting llm agents from reactive responses to active assistance. arXiv Preprint arXiv:2410.12361.), transparency (Liao & Vaughan, 2023ReferenceLiao, Q. V., & Vaughan, J. W. (2023). Ai transparency in the age of llms: A human-centered research roadmap. ArXiv Preprint, abs/2306.01941. https://arxiv.org/abs/2306.01941), and social norm adherence (Forbes et al., 2020ReferenceForbes, M., Hwang, J. D., Shwartz, V., Sap, M., & Choi, Y. (2020). Social chemistry 101: Learning to reason about social and moral norms. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 653–670.; Shaikh et al., 2025ReferenceShaikh, O., Sapkota, S., Rizvi, S., Horvitz, E., Park, J. S., Yang, D., & Bernstein, M. S. (2025). Creating General User Models from Computer Use. Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. https://doi.org/10.1145/3746059.3747722; Ziems et al., 2023ReferenceZiems, C., Dwivedi-Yu, J., Wang, Y.-C., Halevy, A., & Yang, D. (2023). NormBank: A Knowledge Bank of Situational Social Norms. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 7756–7776). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.429)— all presenting fruitful areas for future inquiry.

Improving our Understanding of Workers.

A second technical requirement is a richer understanding of the users these systems are meant to serve. One push in this direction is developing better user simulators which can capture the behavioral patterns and habits of individuals (Lei et al., 2026ReferenceLei, Y., Wang, T., Lian, J., Hu, Z., Lian, D., & Xie, X. (2026). HumanLLM: Towards Personalized Understanding and Simulation of Human Nature. arXiv Preprint arXiv:2601.15793.; Paglieri et al., 2026ReferencePaglieri, D., Cross, L., Cunningham, W. A., Leibo, J. Z., & Vezhnevets, A. S. (2026). Persona Generators: Generating Diverse Synthetic Personas at Scale. arXiv Preprint arXiv:2602.03545.). As our ability to simulate users improves, a question we ought to return to is which users are being simulated. At the moment, many efforts are concentrated on generic user behavior or focused more so on knowledge workers (e.g., developers) (Naous et al., 2025ReferenceNaous, T., Laban, P., Xu, W., & Neville, J. (2025). Flipping the Dialogue: Training and Evaluating User Language Models. arXiv Preprint arXiv:2510.06552.; Wang et al., 2026ReferenceWang, Z. Z., Yang, J., Lieret, K., Tartaglini, A., Chen, V., Wei, Y., Zhang, Z. W. L., Narasimhan, K., Schmidt, L., Neubig, G., & others. (2026). Position: Humans are Missing from AI Coding Agent Research.). This provides a deep but narrow viewpoint of the workforce. The population of workers whose jobs will intersect with LLMs is far broader, spanning domains like healthcare, retail, education, logistics, and skilled trades (Handa et al., 2025ReferenceHanda, K., Tamkin, A., McCain, M., Huang, S., Durmus, E., Heck, S., Mueller, J., Hong, J., Ritchie, S., Belonax, T., & others. (2025). Which economic tasks are performed with ai? evidence from millions of claude conversations. arXiv Preprint arXiv:2503.04761.; Tomlinson et al., 2025ReferenceTomlinson, K., Jaffe, S., Wang, W., Counts, S., & Suri, S. (2025). Working with AI: measuring the applicability of generative AI to occupations. arXiv Preprint arXiv:2507.07935.). Developing better user simulators for these contexts requires going beyond theoretical modeling to gain a deeper understanding of how people are actually using these systems in the real world. This is a genuine challenge, because the most valuable behavioral data tends to be held by the companies deploying models and is rarely made available to the broader research community. Finding ways to bridge this gap—whether through privacy-preserving data sharing, partnerships, or the careful design of in-the-wild studies—is a prerequisite for building systems that serve a broader swath of workers.

Evaluating for Ecologically Valid Tasks.

Finally, across chapters we have emphasized the importance of data in the development and progress towards human-centered LLMs. To see progress in this area, we must also have benchmarks that actually reflect the complexity of real work. Many of the standard benchmarks used to evaluate LLM performance consist of synthetic or highly constrained tasks that bear little resemblance to the messy, context-dependent, and often ambiguous nature of professional work. More general-purpose benchmarks, such as GDPval (Patwardhan et al., 2025ReferencePatwardhan, T., Dias, R., Proehl, E., Kim, G., Wang, M., Watkins, O., Fishman, S. P., Aljubeh, M., Thacker, P., Fauconnet, L., & others. (2025). Gdpval: Evaluating ai model performance on real-world economically valuable tasks. arXiv Preprint arXiv:2510.04374.), represent a meaningful step toward measuring model performance on more ecologically valid tasks, and domain-specific benchmarks provides an even more precise look as to how these models perform in settings such as finance (Fan et al., 2025ReferenceFan, T., Yang, Y., Jiang, Y., Zhang, Y., Chen, Y., & Huang, C. (2025). AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets. arXiv Preprint arXiv:2512.10971.), law (Guha et al., 2023ReferenceGuha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A., K, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., … Li, Z. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/89e44582fd28ddfea1ea4dcb0ebbf4b0-Abstract-Datasets%5C_and%5C_Benchmarks.html; Li et al., 2025ReferenceLi, H., Chen, J., Yang, J., Ai, Q., Jia, W., Liu, Y., Lin, K., Wu, Y., Yuan, G., Hu, Y., & others. (2025). Legalagentbench: Evaluating llm agents in legal domain. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2322–2344.), and medicine (Arora et al., 2025ReferenceArora, R. K., Wei, J., Hicks, R. S., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., & others. (2025). Healthbench: Evaluating large language models towards improved human health. arXiv Preprint arXiv:2505.08775.). In tandem, we must also consider what these benchmarks are failing to capture but that we may also want to evaluate. For instance, how can we better account for how workers are actually interacting with the systems or how the performance holds up in the messy, open-ended conditions of real work?

Appel, R., McCrory, P., Tamkin, A., McCain, M., Neylon, T., & Stern, M. (2025). Anthropic economic index report: Uneven geographic and enterprise ai adoption. arXiv Preprint arXiv:2511.15080.
Arora, R. K., Wei, J., Hicks, R. S., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., & others. (2025). Healthbench: Evaluating large language models towards improved human health. arXiv Preprint arXiv:2505.08775.
Fan, T., Yang, Y., Jiang, Y., Zhang, Y., Chen, Y., & Huang, C. (2025). AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets. arXiv Preprint arXiv:2512.10971.
Forbes, M., Hwang, J. D., Shwartz, V., Sap, M., & Choi, Y. (2020). Social chemistry 101: Learning to reason about social and moral norms. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 653–670.
Fügener, A., Grahl, J., Gupta, A., & Ketter, W. (2022). Cognitive challenges in human–artificial intelligence collaboration: Investigating the path toward productive delegation. Information Systems Research, 33(2), 678–696.
Guggenberger, T., Lämmermann, L., Urbach, N., Walter, A. M., & Hofmann, P. (2023). Task delegation from AI to humans: A principal-agent perspective.
Guha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A., K, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., … Li, Z. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/89e44582fd28ddfea1ea4dcb0ebbf4b0-Abstract-Datasets%5C_and%5C_Benchmarks.html
Handa, K., Tamkin, A., McCain, M., Huang, S., Durmus, E., Heck, S., Mueller, J., Hong, J., Ritchie, S., Belonax, T., & others. (2025). Which economic tasks are performed with ai? evidence from millions of claude conversations. arXiv Preprint arXiv:2503.04761.
Lei, Y., Wang, T., Lian, J., Hu, Z., Lian, D., & Xie, X. (2026). HumanLLM: Towards Personalized Understanding and Simulation of Human Nature. arXiv Preprint arXiv:2601.15793.
Li, H., Chen, J., Yang, J., Ai, Q., Jia, W., Liu, Y., Lin, K., Wu, Y., Yuan, G., Hu, Y., & others. (2025). Legalagentbench: Evaluating llm agents in legal domain. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2322–2344.
Liao, Q. V., & Vaughan, J. W. (2023). Ai transparency in the age of llms: A human-centered research roadmap. ArXiv Preprint, abs/2306.01941. https://arxiv.org/abs/2306.01941
Lu, Y., Yang, S., Qian, C., Chen, G., Luo, Q., Wu, Y., Wang, H., Cong, X., Zhang, Z., Lin, Y., & others. (2024). Proactive agent: Shifting llm agents from reactive responses to active assistance. arXiv Preprint arXiv:2410.12361.
Naous, T., Laban, P., Xu, W., & Neville, J. (2025). Flipping the Dialogue: Training and Evaluating User Language Models. arXiv Preprint arXiv:2510.06552.
Paglieri, D., Cross, L., Cunningham, W. A., Leibo, J. Z., & Vezhnevets, A. S. (2026). Persona Generators: Generating Diverse Synthetic Personas at Scale. arXiv Preprint arXiv:2602.03545.
Patwardhan, T., Dias, R., Proehl, E., Kim, G., Wang, M., Watkins, O., Fishman, S. P., Aljubeh, M., Thacker, P., Fauconnet, L., & others. (2025). Gdpval: Evaluating ai model performance on real-world economically valuable tasks. arXiv Preprint arXiv:2510.04374.
Shaikh, O., Sapkota, S., Rizvi, S., Horvitz, E., Park, J. S., Yang, D., & Bernstein, M. S. (2025). Creating General User Models from Computer Use. Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. https://doi.org/10.1145/3746059.3747722
Shen, S. Z., Chen, V., Gu, K., Ross, A., Ma, Z., Ross, J., Gu, A., Si, C., Chi, W., Peng, A., & others. (2025). Completion \neq Collaboration: Scaling Collaborative Effort with Agents. arXiv Preprint arXiv:2510.25744.
Tomašev, N., Franklin, M., & Osindero, S. (2026). Intelligent AI Delegation. arXiv Preprint arXiv:2602.11865.
Tomlinson, K., Jaffe, S., Wang, W., Counts, S., & Suri, S. (2025). Working with AI: measuring the applicability of generative AI to occupations. arXiv Preprint arXiv:2507.07935.
Wang, Z. Z., Shao, Y., Shaikh, O., Fried, D., Neubig, G., & Yang, D. (2025). How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations. arXiv Preprint arXiv:2510.22780.
Wang, Z. Z., Yang, J., Lieret, K., Tartaglini, A., Chen, V., Wei, Y., Zhang, Z. W. L., Narasimhan, K., Schmidt, L., Neubig, G., & others. (2026). Position: Humans are Missing from AI Coding Agent Research.
Wu, S., Galley, M., Peng, B., Cheng, H., Li, G., Dou, Y., Cai, W., Zou, J., Leskovec, J., & Gao, J. (2025). CollabLLM: From Passive Responders to Active Collaborators. International Conference on Machine Learning, 67260–67283.
Ziems, C., Dwivedi-Yu, J., Wang, Y.-C., Halevy, A., & Yang, D. (2023). NormBank: A Knowledge Bank of Situational Social Norms. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 7756–7776). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.429