Developing HCLLMs for the Future of Work

Next, we discuss how we can apply a human-centered lens to training and evaluating LLMs to enable a future of work that jointly benefits the different stakeholders we have discussed. To start, much of the current discourse around LLM systems emphasizes autonomous execution, or systems that complete complex, multi-step tasks with minimal human involvement (Shen et al., 2025). Yet, focusing primarily on this form of autonomous interaction, overlooks the potential gains for systems in which humans and LLMs collaborate. For example, researchers such as Wang et al. (2026) have raised exactly this concern in the context of AI coding agents—a domain where models are quite performant and has seen significant adoption especially amongst professionals in this field (Appel et al., 2025). They argue that the field has moved too quickly toward full automation and that human involvement deserves to be treated as a first-class design consideration rather than a transitional inconvenience. Taking this critique seriously requires asking what it would actually take, technically, to build human-centered LLMs for the workforce.

Designing Models as Collaborators.

One prerequisite is that models need to know how to collaborate. However, collaboration is not simply a matter of adding a confirmation step before a model takes action. It requires a more nuanced understanding of when to act autonomously, when to defer, and how to communicate uncertainty or request input in ways that feel natural rather than disruptive. As Shen et al. (2025) show, building more capable models does not necessarily equate to better collaboration, thus necessitating dedicated efforts. In fact, current training paradigms, such as preference tuning over single conversational turns, can even hinder collaboration abilities (Wu et al., 2025). Furthermore, successful collaboration does not mean indiscriminately putting a human in the loop. While ostensibly beneficial, this interaction paradigm can lead to unnecessary verification and inefficiencies that ultimately slow down the collaboration process. The best form of collaboration will necessarily vary, depending on the task at hand as well as the skillset of the human and LLM involved. For example, sometimes the human may need to audit the LLM’s work; other times the right form of collaboration may need to be a back-and-forth conversation; and in others end-to-end execution by the model may suffice.

To start, we must return to the core question: what are the ingredients for a successful collaboration? In human organizations, a common collaboration paradigm is delegation. At the moment, agents are capable of decomposing complex tasks into more manageable units of action. However, the next step here is being able to assign these actions. Thus, we need to know what tasks to hand-off to these models and which may be better suited for humans to complete or even to delegate across multiple different models (Wang et al., 2025). This delegation capability must furthermore be adaptive across users, tasks, context, and model capabilities (Fügener et al., 2022; Guggenberger et al., 2023; Tomašev et al., 2026). Beyond being able to delegate, there are many other proprties that help make a good collaborator which are lacking in existing models, such as proactivity (Lu et al., 2024), transparency (Liao & Vaughan, 2023), and social norm adherence (Forbes et al., 2020; Shaikh et al., 2025; Ziems et al., 2023)— all presenting fruitful areas for future inquiry.

Improving our Understanding of Workers.

A second technical requirement is a richer understanding of the users these systems are meant to serve. One push in this direction is developing better user simulators which can capture the behavioral patterns and habits of individuals (Lei et al., 2026; Paglieri et al., 2026). As our ability to simulate users improves, a question we ought to return to is which users are being simulated. At the moment, many efforts are concentrated on generic user behavior or focused more so on knowledge workers (e.g., developers) (Naous et al., 2025; Wang et al., 2026). This provides a deep but narrow viewpoint of the workforce. The population of workers whose jobs will intersect with LLMs is far broader, spanning domains like healthcare, retail, education, logistics, and skilled trades (Handa et al., 2025; Tomlinson et al., 2025). Developing better user simulators for these contexts requires going beyond theoretical modeling to gain a deeper understanding of how people are actually using these systems in the real world. This is a genuine challenge, because the most valuable behavioral data tends to be held by the companies deploying models and is rarely made available to the broader research community. Finding ways to bridge this gap—whether through privacy-preserving data sharing, partnerships, or the careful design of in-the-wild studies—is a prerequisite for building systems that serve a broader swath of workers.

Evaluating for Ecologically Valid Tasks.

Finally, across chapters we have emphasized the importance of data in the development and progress towards human-centered LLMs. To see progress in this area, we must also have benchmarks that actually reflect the complexity of real work. Many of the standard benchmarks used to evaluate LLM performance consist of synthetic or highly constrained tasks that bear little resemblance to the messy, context-dependent, and often ambiguous nature of professional work. More general-purpose benchmarks, such as GDPval (Patwardhan et al., 2025), represent a meaningful step toward measuring model performance on more ecologically valid tasks, and domain-specific benchmarks provides an even more precise look as to how these models perform in settings such as finance (Fan et al., 2025), law (Guha et al., 2023; Li et al., 2025), and medicine (Arora et al., 2025). In tandem, we must also consider what these benchmarks are failing to capture but that we may also want to evaluate. For instance, how can we better account for how workers are actually interacting with the systems or how the performance holds up in the messy, open-ended conditions of real work?

Appel, R., McCrory, P., Tamkin, A., McCain, M., Neylon, T., & Stern, M. (2025). Anthropic economic index report: Uneven geographic and enterprise ai adoption. arXiv Preprint arXiv:2511.15080.

Arora, R. K., Wei, J., Hicks, R. S., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., & others. (2025). Healthbench: Evaluating large language models towards improved human health. arXiv Preprint arXiv:2505.08775.

Fan, T., Yang, Y., Jiang, Y., Zhang, Y., Chen, Y., & Huang, C. (2025). AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets. arXiv Preprint arXiv:2512.10971.

Forbes, M., Hwang, J. D., Shwartz, V., Sap, M., & Choi, Y. (2020). Social chemistry 101: Learning to reason about social and moral norms. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 653–670.

Fügener, A., Grahl, J., Gupta, A., & Ketter, W. (2022). Cognitive challenges in human–artificial intelligence collaboration: Investigating the path toward productive delegation. Information Systems Research, 33(2), 678–696.

Guggenberger, T., Lämmermann, L., Urbach, N., Walter, A. M., & Hofmann, P. (2023). Task delegation from AI to humans: A principal-agent perspective.

Guha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A., K, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., … Li, Z. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/89e44582fd28ddfea1ea4dcb0ebbf4b0-Abstract-Datasets%5C_and%5C_Benchmarks.html

Handa, K., Tamkin, A., McCain, M., Huang, S., Durmus, E., Heck, S., Mueller, J., Hong, J., Ritchie, S., Belonax, T., & others. (2025). Which economic tasks are performed with ai? evidence from millions of claude conversations. arXiv Preprint arXiv:2503.04761.

Lei, Y., Wang, T., Lian, J., Hu, Z., Lian, D., & Xie, X. (2026). HumanLLM: Towards Personalized Understanding and Simulation of Human Nature. arXiv Preprint arXiv:2601.15793.

Li, H., Chen, J., Yang, J., Ai, Q., Jia, W., Liu, Y., Lin, K., Wu, Y., Yuan, G., Hu, Y., & others. (2025). Legalagentbench: Evaluating llm agents in legal domain. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2322–2344.

Liao, Q. V., & Vaughan, J. W. (2023). Ai transparency in the age of llms: A human-centered research roadmap. ArXiv Preprint, abs/2306.01941. https://arxiv.org/abs/2306.01941

Lu, Y., Yang, S., Qian, C., Chen, G., Luo, Q., Wu, Y., Wang, H., Cong, X., Zhang, Z., Lin, Y., & others. (2024). Proactive agent: Shifting llm agents from reactive responses to active assistance. arXiv Preprint arXiv:2410.12361.

Naous, T., Laban, P., Xu, W., & Neville, J. (2025). Flipping the Dialogue: Training and Evaluating User Language Models. arXiv Preprint arXiv:2510.06552.

Paglieri, D., Cross, L., Cunningham, W. A., Leibo, J. Z., & Vezhnevets, A. S. (2026). Persona Generators: Generating Diverse Synthetic Personas at Scale. arXiv Preprint arXiv:2602.03545.

Patwardhan, T., Dias, R., Proehl, E., Kim, G., Wang, M., Watkins, O., Fishman, S. P., Aljubeh, M., Thacker, P., Fauconnet, L., & others. (2025). Gdpval: Evaluating ai model performance on real-world economically valuable tasks. arXiv Preprint arXiv:2510.04374.

Shaikh, O., Sapkota, S., Rizvi, S., Horvitz, E., Park, J. S., Yang, D., & Bernstein, M. S. (2025). Creating General User Models from Computer Use. Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. https://doi.org/10.1145/3746059.3747722

Shen, S. Z., Chen, V., Gu, K., Ross, A., Ma, Z., Ross, J., Gu, A., Si, C., Chi, W., Peng, A., & others. (2025). Completion \neq Collaboration: Scaling Collaborative Effort with Agents. arXiv Preprint arXiv:2510.25744.

Tomašev, N., Franklin, M., & Osindero, S. (2026). Intelligent AI Delegation. arXiv Preprint arXiv:2602.11865.

Tomlinson, K., Jaffe, S., Wang, W., Counts, S., & Suri, S. (2025). Working with AI: measuring the applicability of generative AI to occupations. arXiv Preprint arXiv:2507.07935.

Wang, Z. Z., Shao, Y., Shaikh, O., Fried, D., Neubig, G., & Yang, D. (2025). How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations. arXiv Preprint arXiv:2510.22780.

Wang, Z. Z., Yang, J., Lieret, K., Tartaglini, A., Chen, V., Wei, Y., Zhang, Z. W. L., Narasimhan, K., Schmidt, L., Neubig, G., & others. (2026). Position: Humans are Missing from AI Coding Agent Research.

Wu, S., Galley, M., Peng, B., Cheng, H., Li, G., Dou, Y., Cai, W., Zou, J., Leskovec, J., & Gao, J. (2025). CollabLLM: From Passive Responders to Active Collaborators. International Conference on Machine Learning, 67260–67283.

Ziems, C., Dwivedi-Yu, J., Wang, Y.-C., Halevy, A., & Yang, D. (2023). NormBank: A Knowledge Bank of Situational Social Norms. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 7756–7776). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.429

Developing HCLLMs for the Future of Work

Designing Models as Collaborators.

Improving our Understanding of Workers.

Evaluating for Ecologically Valid Tasks.

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Designing Models as Collaborators.

Improving our Understanding of Workers.

Evaluating for Ecologically Valid Tasks.

Graph View

Table of Contents

Backlinks