Addressing these design challenges for HCLLMs requires coordinated interventions and across the entire LLM development pipeline—from the data curation and model training processes to the user interface design. Already when introducing these challenges, we have discussed interface-level interventions. For example, at the model level, improving model capabilities via post-training, such as instruction tuning and preference learning (as discussed in NLP for HCLLMs), can help models better interpret underspecified prompts, helping bridge the gulf of envisioning. Decisions that developers make around what sources to include in the training data will affect models’ capabilities across different cultures and context (Data for HCLLMs). Beyond the model training pipeline, choices around the technical design of the system, such as how to handle source attribution or what safety guardrails are put in place (Responsible Human-Centered LLMs), will mold users’ interactions, impacting relationships that may form. Nonetheless, the critical point of this chapter is that these technical interventions are most effective when informed by and evaluated against the human-centered principles outlined above. The challenges we have charted are fundamentally sociotechnical problems; they cannot be solved by better models alone, nor by better interfaces alone, but through the careful co-design of both.
Case Study: Motivating Physical Activity with HCLLMs
To make the human-centered LLM design process concrete, we present a case study based on two related systems for LLM physical activity coaching: (1) GPTCoach (Jörke et al., 2025ReferenceJörke, M., Sapkota, S., Warkenthien, L., Vainio, N., Schmiedmayer, P., Brunskill, E., & Landay, J. A. (2025). GPTCoach: Towards LLM-Based Physical Activity Coaching. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3706598.3713819), an LLM-based coach that implements motivational interviewing, and (2) Bloom (Jörke et al., 2026ReferenceJörke, M., Genç, D., Teutschbein, V., Sapkota, S., Chung, S., Schmiedmayer, P., Campero, M. I., King, A. C., Brunskill, E., & Landay, J. A. (2026). Bloom: Designing for LLM-Augmented Behavior Change Interactions. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems.), a mobile application that integrates GPTCoach with established, UI-based behavior change interactions. Physical inactivity is a major public health concern, with large portions of the population falling short of recommended guidelines for physical activity. LLMs present a promising opportunity to combine the scalability of existing mobile health interventions with the personalization of human coaching. Through this case study of designing an LLM health coach, we illustrate how a human-centered process can help realize this opportunity.
Let us start by considering the status quo approach that treats training a good LLM health coach as an instruction-following problem. The approach is to collect user data (e.g., common barriers to activity, wearable data), feed it to a model, and have the model generate personalized nudges, exercise plans, or advice. This framing is an intuitive starting point, and it maps cleanly onto standard LLM training pipelines. However, it also implicitly encodes a set of assumptions: that users always want or need recommendations and advice, that more information yields better outcomes, and that the primary bottleneck is the model’s ability to produce accurate health advice.
GPTCoach and Bloom instead took a human-centered approach, exemplifying the following three concepts discussed in the chapter:
-
Working with stakeholders to shape the system. Rather than starting from the status quo, Jörke et al. (2025)ReferenceJörke, M., Sapkota, S., Warkenthien, L., Vainio, N., Schmiedmayer, P., Brunskill, E., & Landay, J. A. (2025). GPTCoach: Towards LLM-Based Physical Activity Coaching. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3706598.3713819 conducted formative interviews with health experts and prospective end-users. These interviews revealed that health experts emphasized the importance of facilitative, non-prescriptive support that refrains from giving unsolicited advice—a mode of engagement that runs counter to how a standard LLM chatbot operates. Experts described their role as staying “in the passenger seat” and helping clients surface their own goals and barriers, rather than telling them what to do. This learning fundamentally reframed the problem to be solved, the system that was designed to address it, and the evaluation criteria. Moreover, engagement with experts did not end after the formative study. After the lab study, the authors hired trained experts to code all transcripts to measure adherence to motivational interviewing. Expert coding indicated that GPTCoach used conversational strategies that were consistent with motivational interviewing or neutral over 93% of the time, but qualitative feedback revealed important gaps compared to skilled human practitioners, highlighting specific areas for improvement that would not have surfaced without expert involvement.
-
Focusing on the interaction. A second theme from this case study is that model capability, while important, is not sufficient in and of itself for a successful human-centered system. While the authors could have devoted significant efforts to training, prompt chaining proved sufficient to enable LLM-based motivational interviewing. This created space for the authors to focus on how different aspects of the interaction design could shape users’ experiences in substantive ways. For example, GPTCoach’s non-prescriptive and non-judgmental communication style had a greater impact on participants’ overall experience than its analysis of their health data. In Bloom, the authors represented the coaching agent as a bee avatar named Beebo. Beebo’s capabilities are the same as those of a generic chatbot, but it’s representation substantially changed the nature of the interaction. Many participants resonated with the avatar and described Beebo in relational terms, leading to increased engagement and adherence. Beebo’s clear role as a “coach,” not a general purpose assistant, helped set expectations when Beebo redirected conversations back towards physical activity, or when guardrails triggered refusals for medical advice. This dynamic points to the design challenge of navigating human-LLM relationships from Defining the What. Taken together, these design choices reflect a move beyond thinking narrowly about model performance toward a more holistic understanding of how users will interact with these systems.
-
Evaluating with users. Finally, this case study showcases how the different methods from HCI offer new ways of knowing for evaluating HCLLMs. In a four-week randomized field study (N=54) comparing Bloom to a no-LLM control, the authors used a mixed methods evaluation, synthesizing insights across qualitative coding of participant interviews, survey data, app usage logs, and wearable data. The quantitative data revealed a 5x increase in overall app usage time in the LLM condition, while mean physical activity levels stayed comparable in both conditions. Meanwhile, survey measures revealed substantial shifts in physical activity mindsets and satisfaction. Qualitative coding added rich nuance to these findings, with participants in the LLM condition reporting stronger beliefs that activity was beneficial to their health, greater enjoyment of exercise, an expanded appreciation of “what counts” as activity, and increased self-compassion when goals were missed. Most importantly, participants attributed these mindset shifts to interactions with Beebo that kept them in control of their own behavior change, such as finding flexible alternatives when plans fell through. More broadly, this illustrates how qualitative “thick” understanding can surface the why behind user experiences, which are insights that a purely quantitative evaluation might miss.
Overall, this case study demonstrates how critically engaging with humans early in the design process can reframe the problem being solved and the evaluation target, leading to qualitatively different solutions that better serve human needs. Notably, the most consequential outcomes in both studies—positive changes in participants’ beliefs about physical activity and their own abilities—emerged from the design process rather than from improvements in model capability.