Defining the What: Principles and Challenges for Designing HCLLMs

While involving humans throughout the LLM lifecycle helps us understand who is affected by these systems, it also raises fundamental questions about how to design for their needs. Designing human-centered LLMs requires more than simply optimizing model capabilities. It demands careful attention to how users interact with, make sense of, and form relationships with these systems across diverse contexts.

Designing human-AI interaction is not a new challenge. Pre-dating LLMs, prior work Yang et al. (2020) has delineated the unique challenges of human-AI interaction, highlighting the (1) uncertainties about model capabilities and (2) the complexity of AI outputs—both of which are problems that remain even as model capabilities have improved. In tandem, many others have enumerated many best practices for designing human-AI interaction (Amershi et al., 2019; Yildirim et al., 2023). These principles span the lifecycle of model development, including the initial conceptualization (e.g., what values are imbued in the system, how is privacy and fairness handled), the model development process (e.g., what data is used, how is the model trained), the model deployment (e.g., how are errors handled, how are users’ preferences accounted for), and the interface layer where humans interact with the system (e.g., how are users’ expectations calibrated, how transparent are the model’s outputs) (Wright et al., 2020). While these principles provide a strong foundation, human-centered LLMs require adapting and extending them to address the unique characteristics of large language models, accounting for their open-ended generation capabilities, evolving social and relational roles, and wide-scale deployment.

In this section, we explore specific challenges that arise when designing HCLLMs. While of course there are challenges beyond those covered in this subsection, our goal is to illustrate the differing levels of consideration we must attend to. These range from considerations at the unit of the individual, such as scaffolding users’ interactions with models and model outputs, to societal level questions when it comes to deploying models across diverse cultures and contexts.

Bridging the Gulf of Envisioning.

Within HCI, foundational frameworks for design are @norman1988design’s gulfs of execution and evaluation. The gulf of execution pertains to gaps related to users figuring out how to do certain actions that they want, whereas the gulf of evaluation arises when the user is not able to interpret the system’s output. These gulfs are persistent across the design of many technologies, ranging from everyday, physical objects like doorhandles to cutting edge technologies. However, LLMs pose a new challenge for users: the “gulf of envisioning” (Subramonyam et al., 2024). This gulf refers to the distance that emerges between what users may intend to do with an LLM and the prompts that are ultimately produced.

Why does this gulf of envisioning occur? Although LLMs have proven capable across a wide-range of tasks, they still require users to guide how they are being used. The de facto form of user interaction with LLMs is prompting, or providing textual instructions delineating the desired interactions. While prompting seems an ostensibly simple task, users consistently struggle with writing prompts, underspecifying the instructions and foregoing the appropriate level of detail, ultimately yielding unsatisfactory model outputs (Zamfirescu-Pereira et al., 2023). Furthermore, users must contend with the indeterminacy of model outputs as well as the “black-box” nature of how their inputs are transformed into outputs (Agrawala, 2023; Subramonyam et al., 2024; Yang et al., 2020).

Thus, one core challenge for designing human-centered LLMs is overcoming this gulf of envisioning. One approach to do so has been focused on designing interfaces that can better scaffold users’ interactions with models. For example, recent works have introduced direct manipulation interfaces as an alternative to prompt writing (Arawjo et al., 2024; Masson et al., 2024; Wu et al., 2022), such as providing a visual programming environment that helps users create more complex prompt chains (Arawjo et al., 2024). Others have operationalized prompting pipelines as modular code components to provide a more systematic process for optimizing and refining inputs (Khattab et al., 2024). There are also approaches beyond intervening at the prompt level that are intended to improve users’ interactions with LLMs. One line of work has explored how to improve the model’s understanding of the user, such as through building better user models or improving context that LLMs have about the user (Lei et al., 2026; Naous et al., 2025; Shaikh et al., 2025). These approaches aim to reduce this gap between users’ intentions and what they specify when interacting with an LLM. The improving capabilities of models are only beneficial to users in so far as they can harness them. Efforts coming from both directions—making it easier for users to guide model outputs with less effort and improving models’ understanding of users—are required to reduce this gulf of envisioning.

Interpreting LLM Outputs.

So far, our discussion of human-LLM interaction has mainly focused on how users communicate intent to models. We now turn our attention to how users evaluate and make sense of model outputs. As-is, model outputs can be verbose and unstructured, making it difficult for users to understand and leaving them feeling overwhelmed (Jiang et al., 2023). To address this challenge, prior work in human-computer interaction has proposed new interfaces to support user sensemaking—or providing external representations to encode data for task-specific purposes. For example, works such as Sensecape (Suh et al., 2023) and Graphologue (Jiang et al., 2023) provide interactive visualizations that allow users to visually explore LLMs’ outputs in a structured format rather than having to parse large amounts of text.

Beyond helping users understand the content that LLMs output, we also must interrogate how users subsequently interpret and act upon these outputs. Questions surrounding user trust and reliance on AI systems are long-standing issues that predate the rise of LLMs (Papenmeier et al., 2022; Vasconcelos et al., 2023). Nonetheless, LLMs introduce new challenges to these established problems. The complex nature of these models means that understanding why and how models produced the outputs is inscrutable to experts and end users alike (Ameisen et al., 2025). However, user trust and reliance are not solely model-specific issues; the design of LLM applications also introduces new challenges. For example, Swoopes et al. (2025) demonstrate how the chat-based design of most user-facing LLMs can hide the inherent stochasticity of the models, making it difficult for users to calibrate trust. Furthermore, anthropomorphic features of models can further foster user trust, even when it may be unwarranted (Cohn et al., 2024). To combat these issues, researchers have proposed design features that can foster appropriate reliance, such as generating explanations, expressing uncertainty, and adding sources to claims (Kim et al., 2025; Zhou et al., 2023). These solutions are not perfect; for instance, significant technical challenges remain in ensuring that cited sources are correct and relevant. The hypothesized benefit of features like sourcing is their potential to engage users in slower, more careful thinking, but how to design LLMs that effectively empower users’ critical thinking over their outputs remains a key open question.

Navigating Human-LLM Relationships.

Finally, we consider the evolving role of LLMs relative to users. Traditionally, AI systems have functioned as assistants: tools that can augment human capabilities in restricted ways and that require users to delegate tasks. However, as model capabilities approach or exceed human performance in certain domains, visions of models as equal collaborators are becoming increasingly plausible. For instance, Shao et al. (2024) articulates a framework in which humans engage in bidirectional collaboration with LLM-based agents across a diverse set of tasks. Shifting roles from assistant to collaborator carries significant implications for the design of LLMs. For example, while assistants wait for explicit instructions to execute tasks, a model that serves as a collaborator might proactively suggest alternative approaches, challenge assumptions, or redirect problem-solving strategies, demanding fundamentally different interface affordances. These questions echo and build upon classic debates in HCI between direct manipulation interfaces, which afford users high degrees of control, and interface agents, which can act autonomously on users’ behalf (Shneiderman & Maes, 1997). Determining how much control is required is not prescriptive but rather will be modulated by contextual factors such as the type of task or user expertise.

Furthermore, as LLMs adopt more expansive roles beyond serving as functional tools in our lives, we must also consider the affective dimension in human-LLM interaction. Models are increasingly anthropomorphized, meaning that they are perceived as having human-like characteristics—a fact that is exacerbated by the linguistic expressions in generated outputs (Cheng et al., 2024, 2025). Already, users are interacting with these models not only as coworkers or collaborators but also as friends, companions, and romantic partners (Pataranutaporn et al., 2025; Zhang et al., 2025). LLMs are also increasingly used in sensitive domains, such as providing emotional support or being used for therapy purposes (Zao-Sanders, 2025). These affective interactions are also not always intentional. For example, Zhang et al. (2025) observed that companionship-oriented interactions emerge even when users are not primarily interacting with LLMs for emotionally laden tasks. While human-LLM relationships can have positive outcomes for users, including combating loneliness and reducing distress (De Freitas et al., 2025), there are also many documented adverse impacts, such as fostering emotional dependence (Laestadius et al., 2024; Pentina et al., 2023), encouraging harmful behavior (Dupré, 2024; R. Zhang et al., 2025), or, in the extreme, triggering cases of intense mental health crises (Morrin et al., 2025). Balancing the trade-offs between these benefits that human-LLM relationships can have with these very real harms is an open challenge that necessitates interdisciplinary interventions from technologists, social scientists, ethicists, and policymakers. As an example of a work in this area, Kirk et al. (2025) called for the community to prioritize the socio-affective alignment of LLMs, accounting for how models fit into and actively shape individual users’ social and psychological ecosystems. Nonetheless, how exactly we design human-LLM relationships that promote user well-being in the long-term or other pro-social outcomes remains an open area for exploration.

Designing for Diverse Cultures and Contexts.

Finally, we turn our attention to the critical challenge of designing LLMs that are sensitive and adaptive to diverse cultural contexts. LLMs are not culturally neutral artifacts. As discussed in more detail in Data for HCLLMs, they are trained predominantly on data from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies (Mihalcea et al., 2025). As such, models inherently encode and propagate specific cultural values, communication styles, and social norms (Durmus et al., 2023; Naous et al., 2024; Ryan et al., 2024). These misalignments can render models unhelpful, at best, and culturally insensitive or actively harmful to users.

HCI offers critical theoretical lenses to help us not only question the assumptions underlying LLM design but also offer more generative opportunities for design. For example, @bardzell2010feminist’s foundational work on Feminist HCI argues that we ought to be attending to marginalized user groups when designing rather than focusing only on a presumed “default”. In this vein, other critical theories on postcolonial computing and literature on decolonial practices have urged designers to decenter dominant Western perspectives and account for the plurality of worldviews and epistemologies that exist [@irani2010postcolonial; @alvarado2021decolonial]. These works push designers to move beyond simply “de-biasing” models and instead question the fundamental assumptions embedded within them: whose knowledge is centered, whose values are prioritized, and whose ways of being are marginalized? For example, users from different cultural backgrounds may have different expectations from what they wanted out of an ideal AI system and use cases, requiring that we are able to localize models to these needs rather than assuming a single default [@ge2024culture; @qadri2025case; @phutane2025disability]. Systematically prioritizing knowledge from certain groups or cultures over others is not simply a design challenge for better human-LLM interaction but can present quality-of-service differences with material impacts on users [@wilson2025no; @dev2022measures]. How to design, build, and evaluate LLMs that are genuinely context-aware and culturally adaptive remains a significant and vital open question for the field.

Agrawala, M. (2023). Unpredictable black boxes are terrible interfaces. ACM TechTalks.

Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W., Turner, N. L., Chen, B., Citro, C., Abrahams, D., Carter, S., Hosmer, B., & others. (2025). Circuit tracing: Revealing computational graphs in language models. Transformer Circuits Thread, 6. https://transformer-circuits.pub/2025/attribution-graphs/methods.html

Amershi, S., Weld, D., Vorvoreanu, M., Fourney, A., Nushi, B., Collisson, P., Suh, J., Iqbal, S., Bennett, P. N., Inkpen, K., & others. (2019). Guidelines for human-AI interaction. Proceedings of the 2019 Chi Conference on Human Factors in Computing Systems, 1–13.

Arawjo, I., Swoopes, C., Vaithilingam, P., Wattenberg, M., & Glassman, E. L. (2024). Chainforge: A visual toolkit for prompt engineering and llm hypothesis testing. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 1–18.

Cheng, M., Blodgett, S. L., DeVrio, A., Egede, L., & Olteanu, A. (2025). Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems. In W. Che, J. Nabende, E. Shutova, & M. T. Pilehvar (Eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.

Cheng, M., Gligoric, K., Piccardi, T., & Jurafsky, D. (2024). AnthroScore: A Computational Linguistic Measure of Anthropomorphism. In Y. Graham & M. Purver (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 807–825). Association for Computational Linguistics. https://aclanthology.org/2024.eacl-long.49

Cohn, M., Pushkarna, M., Olanubi, G. O., Moran, J. M., Padgett, D., Mengesha, Z., & Heldreth, C. (2024). Believing anthropomorphism: examining the role of anthropomorphic cues on trust in large language models. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 1–15.

De Freitas, J., Oğuz-Uğuralp, Z., Uğuralp, A. K., & Puntoni, S. (2025). AI companions reduce loneliness. Journal of Consumer Research, ucaf040.

Dupré, M. H. (2024, December). AI Chatbots Are Encouraging Teens to Engage in Self-Harm. Futurism. https://futurism.com/ai-chatbots-teens-self-harm

Durmus, E., Nguyen, K., Liao, T. I., Schiefer, N., Askell, A., Bakhtin, A., Chen, C., Hatfield-Dodds, Z., Hernandez, D., Joseph, N., & others. (2023). Towards measuring the representation of subjective global opinions in language models. ArXiv Preprint, abs/2306.16388. https://arxiv.org/abs/2306.16388

Jiang, P., Rayan, J., Dow, S. P., & Xia, H. (2023). Graphologue: Exploring large language model responses with interactive diagrams. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 1–20.

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., A, S. V., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., Miller, H., Zaharia, M., & Potts, C. (2024). DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=sY5N0zY5Od

Kim, S. S., Vaughan, J. W., Liao, Q. V., Lombrozo, T., & Russakovsky, O. (2025). Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–19.

Kirk, H. R., Gabriel, I., Summerfield, C., Vidgen, B., & Hale, S. A. (2025). Why human–AI relationships need socioaffective alignment. Humanities and Social Sciences Communications, 12(1), 1–9.

Laestadius, L., Bishop, A., Gonzalez, M., Illenčı́k, D., & Campos-Castillo, C. (2024). Too human and not human enough: A grounded theory analysis of mental health harms from emotional dependence on the social chatbot Replika. New Media & Society, 26(10), 5923–5941.

Lei, Y., Wang, T., Lian, J., Hu, Z., Lian, D., & Xie, X. (2026). HumanLLM: Towards Personalized Understanding and Simulation of Human Nature. arXiv Preprint arXiv:2601.15793.

Masson, D., Malacria, S., Casiez, G., & Vogel, D. (2024). Directgpt: A direct manipulation interface to interact with large language models. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 1–16.

Mihalcea, R., Ignat, O., Bai, L., Borah, A., Chiruzzo, L., Jin, Z., Kwizera, C., Nwatu, J., Poria, S., & Solorio, T. (2025). Why AI Is WEIRD and Shouldn’t Be This Way: Towards AI for Everyone, with Everyone, by Everyone. Proceedings of the AAAI Conference on Artificial Intelligence, 39(27), 28657–28670.

Morrin, H., Nicholls, L., Levin, M., Yiend, J., Iyengar, U., DelGuidice, F., Bhattacharyya, S., MacCabe, J., Tognin, S., Twumasi, R., & others. (2025). Delusions by design? How everyday AIs might be fuelling psychosis (and what can be done about it). OSF.

Naous, T., Laban, P., Xu, W., & Neville, J. (2025). Flipping the Dialogue: Training and Evaluating User Language Models. arXiv Preprint arXiv:2510.06552.

Naous, T., Ryan, M. J., Ritter, A., & Xu, W. (2024). Having Beer after Prayer? Measuring Cultural Bias in Large Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 16366–16393). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.862

Papenmeier, A., Kern, D., Englebienne, G., & Seifert, C. (2022). It’s complicated: The relationship between user trust, model accuracy and explanations in AI. ACM Transactions on Computer-Human Interaction (TOCHI), 29(4), 1–33.

Pataranutaporn, P., Karny, S., Archiwaranguprok, C., Albrecht, C., Liu, A. R., & Maes, P. (2025). “My Boyfriend is AI”’: A Computational Analysis of Human-AI Companionship in Reddit’s AI Community. arXiv Preprint arXiv:2509.11391.

Pentina, I., Hancock, T., & Xie, T. (2023). Exploring relationship development with social chatbots: A mixed-method study of replika. Computers in Human Behavior, 140, 107600.

Ryan, M. J., Held, W., & Yang, D. (2024). Unintended Impacts of LLM Alignment on Global Representation. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 16121–16140). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.853

Shaikh, O., Sapkota, S., Rizvi, S., Horvitz, E., Park, J. S., Yang, D., & Bernstein, M. S. (2025). Creating General User Models from Computer Use. Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. https://doi.org/10.1145/3746059.3747722

Shao, Y., Samuel, V., Jiang, Y., Yang, J., & Yang, D. (2024). Collaborative gym: A framework for enabling and evaluating human-agent collaboration. arXiv Preprint arXiv:2412.15701.

Shneiderman, B., & Maes, P. (1997). Direct manipulation vs. interface agents. Interactions, 4(6), 42–61.

Subramonyam, H., Pea, R., Pondoc, C., Agrawala, M., & Seifert, C. (2024). Bridging the gulf of envisioning: Cognitive challenges in prompt based interactions with llms. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 1–19.

Suh, S., Min, B., Palani, S., & Xia, H. (2023). Sensecape: Enabling multilevel exploration and sensemaking with large language models. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 1–18.

Swoopes, C., Holloway, T., & Glassman, E. L. (2025). The Impact of Revealing Large Language Model Stochasticity on Trust, Reliability, and Anthropomorphization. arXiv Preprint arXiv:2503.16114.

Vasconcelos, H., Jörke, M., Grunde-McLaughlin, M., Gerstenberg, T., Bernstein, M. S., & Krishna, R. (2023). Explanations can reduce overreliance on ai systems during decision-making. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW1), 1–38.

Wright, A. P., Wang, Z. J., Park, H., Guo, G., Sperrle, F., El-Assady, M., Endert, A., Keim, D., & Chau, D. H. (2020). A comparative analysis of industry human-AI interaction guidelines. arXiv Preprint arXiv:2010.11761.

Wu, T., Terry, M., & Cai, C. J. (2022). AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In S. D. J. Barbosa, C. Lampe, C. Appert, D. A. Shamma, S. M. Drucker, J. R. Williamson, & K. Yatani (Eds.), CHI ’22: CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April 2022 - 5 May 2022 (p. 385:1-385:22). ACM. https://doi.org/10.1145/3491102.3517582

Yang, Q., Steinfeld, A., Rosé, C. P., & Zimmerman, J. (2020). Re-examining Whether, Why, and How Human-AI Interaction Is Uniquely Difficult to Design. In R. Bernhaupt, F. “Floyd” Mueller, D. Verweij, J. Andres, J. McGrenere, A. Cockburn, I. Avellino, A. Goguey, P. Bjøn, S. Zhao, B. P. Samson, & R. Kocielnik (Eds.), CHI ’20: CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, April 25-30, 2020 (pp. 1–13). ACM. https://doi.org/10.1145/3313831.3376301

Yildirim, N., Pushkarna, M., Goyal, N., Wattenberg, M., & Viégas, F. (2023). Investigating how practitioners use human-ai guidelines: A case study on the people+ ai guidebook. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 1–13.

Zamfirescu-Pereira, J. D., Wong, R. Y., Hartmann, B., & Yang, Q. (2023). Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3544548.3581388

Zao-Sanders, M. (2025). How People Are Really Using Gen AI in 2025. Harvard Business Review. https://hbr.org/2025/04/how-people-are-really-using-gen-ai-in-2025

Zhang, R., Li, H., Meng, H., Zhan, J., Gan, H., & Lee, Y.-C. (2025). The dark side of ai companionship: A taxonomy of harmful algorithmic behaviors in human-ai relationships. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–17.

Zhang, Y., Zhao, D., Hancock, J. T., Kraut, R., & Yang, D. (2025). The Rise of AI Companions: How Human-Chatbot Relationships Influence Well-Being. arXiv Preprint arXiv:2506.12605.

Zhou, K., Jurafsky, D., & Hashimoto, T. (2023). Navigating the Grey Area: How Expressions of Uncertainty and Overconfidence Affect Language Models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 5506–5524). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.335

Defining the What: Principles and Challenges for Designing HCLLMs

Bridging the Gulf of Envisioning.

Interpreting LLM Outputs.

Navigating Human-LLM Relationships.

Designing for Diverse Cultures and Contexts.

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Bridging the Gulf of Envisioning.

Interpreting LLM Outputs.

Navigating Human-LLM Relationships.

Designing for Diverse Cultures and Contexts.

Graph View

Table of Contents

Backlinks