Steerable HCLLMs

Current Approaches to Steerability

Steerability is the second dimension we highlight for HCLLM deployment. Steerability is the degree to which a model can be aligned along a particular dimension (Miehling et al., 2025), such as preferences, norms, or user-constraints (Chen et al., 2025). In contrast to static notions of alignment that aim to produce a single globally acceptable behavior, steerability emphasizes conditional control, or the ability to modulate LLM outputs in accord with its particular users. This property is especially important for HCLLMs, which are designed to interact with diverse users embedded in heterogeneous social, cultural, and institutional contexts.

Steerability spans multiple dimensions. First, personalization is the goal in which an LLM’s outputs adaptively reflect the preferences of individual users or groups of users, along with their prior knowledge, goals, and needs (Tseng et al., 2024). Personalized models should be able to provide more relevant recommendations (Hu et al., 2024), and they should be calibrated to the user’s preferred writing styles (Zhang et al., 2024), learning styles (Park et al., 2024), and norms around privacy (Asthana et al., 2024; Shao et al., 2024) and social behavior (Li et al., 2024). A user’s preferences can be derived from explicit feedback, as in pairwise preference datasets, or from implicit feedback like historical interaction data. For more in-depth discussion of personalization methods, see Personalization.

Related to personalization is persona alignment or role play. Here, the goal is that HCLLMs will adopt consistent identities or roles, like software developers, expert tutors, empathetic counselors, or skeptical reviewers, which remain stable across interactions (J. Li et al., 2024; Samuel et al., 2024; Shanahan et al., 2023). Persona alignment is especially important in multi-agent settings where multiple LLM personas interact and collaborate (Guo et al., 2024; J. S. Park et al., 2023).

In addition to better personalization and role playing, steerable HCLLMs should be able to adaptively understand low-resource languages, dialects, or sociolinguistic varieties (Ziems et al., 2023). We refer to this target as linguistic alignment. This form of steerability is critical for equitable access, as models trained predominantly on high-resource, standardized corpora often underperform for marginalized linguistic communities (see Data Representation, Bias and Ethics). Finally, cultural alignment means that models can be steered to reflect the norms, values, and narratives of particular communities or demographic groups (Santurkar et al., 2023). Unlike personalization, which targets individuals, cultural alignment operates at the level of shared practices and collective meaning-making.

A range of technical mechanisms support steerability. At inference time, models may be steered through in-context learning, prompt engineering, or output filtering (Wies et al., 2023). Post-hoc control methods can condition generation on specific attributes or enforce constraints via decoding strategies. More structurally, personalized reward models and fine-tuning procedures can encode user- or group-specific objectives into model parameters (D. Chen et al., 2024).

Steerability is still fundamentally constrained by the representational biases embedded in pre-training and post-training data (Mihalcea et al., 2025). If certain identities, linguistic forms, or cultural narratives are underrepresented or stereotyped in the training corpus, then prompt-based steering may have limited expressive range. In this sense, steerability is not just a matter of control at inference time, but is rooted much earlier in data provenance (Data Provenance) and evaluation (Evaluation). Steerability starts with measuring where biases arise, localizing their sources in the data pipeline, and redesigning collection and annotation practices accordingly.

Looking Forward

Looking forward, we envision several research directions that extend current notions of steerability. First, we consider continual learning for preference, culture, and pluralistic alignment. Continual preference learning would allow models to adapt dynamically to evolving user needs by tracking implicit cues in interaction patterns or contextual data (Shaikh et al., 2025). Such systems often rely on persistent memory stores like chat histories or retrieval-augmented generation. A central challenge is achieving this adaptivity while preserving user autonomy and privacy (Zhang et al., 2025). In cultural and pluralistic alignment, rather than optimizing toward a static representation of culture or user values, HCLLMs should accommodate evolving norms and intra-group disagreement, as supported by continual learning. One way to elicit these evolving norms is to facilitate community-centered discussion and debate, following methods like STELA (Bergman et al., 2024). Rather than treating communities as homogeneous preference aggregates, pluralistic alignment frameworks should model disagreement as a first-class signal.

Pluralistic alignment comes with an array of technical and political challenges that will need to be addressed. On the technical side, post-training can induce mode-collapse, in which heterogeneous group preferences and opinions are compressed in a lossy manner, suppressing minority viewpoints and preserving only the majority preferences for a given group (Bisbee et al., 2024; Durmus et al., 2024; Röttger et al., 2024). To elicit diverse generations from LLMs, some inference time methods use iterative prompting (Feng et al., 2024; Hayati et al., 2024), but these methods are not calibrated to real-world distributions. Other methods involve modifying the prompt with explicit identity terms for diverse user groups (Giulianelli et al., 2023), but identity coded names induce models to draw on distributions of stereotypical representations or out-group perceptions rather than in-group perspectives (Wang et al., 2025). To address mode-collapse in a manner that preserves in-group perspectives, it may be necessary to update LLM priors implicitly through distributionally-aligned in-context examples, following methods like spectrum tuning (Sorensen et al., 2025).

It is not only a methodological challenge, but also a political challenge to achieve localized, domain-specific alignment processes that enable communities and stakeholders to meaningfully shape model behavior (Delgado et al., 2023). Methodologically, we have the participatory HCI approaches covered in Participatory Approaches. However, politically, most model developers lack incentives to share control with communities (Gabriel, 2020). Current alignment pipelines are centralized by the small set of companies with resources to develop LLMs. Similarly, academic institutions and governments can serve to centralize decision-making across the LLM development pipeline (Suresh et al., 2024). There are a number of less centralized alternatives. For example, Masakhane (Orife et al., 2020) is a network of NLP researchers working on NLP for local African languages, with community involvement at every stage, from dataset creation and annotation to model training. EleutherAI is a distributed, volunteer-driven research collective that has created open pre-training corpora (Gao et al., 2021), models (Black et al., 2022), and scaling checkpoints (Biderman et al., 2023). BigScience was a research effort in which over 1,000 researchers from academia and industry coordinated through HuggingFace and built BLOOM (Workshop et al., 2022), a 176B-parameter multilingual autoregressive language model, with decision-making steps clearly and publicly documented.

These initiatives illustrate that steerability does not need to be confined to top-down fine-tuning interfaces or proprietary alignment pipelines. Distributed model development allows communities to steer models from the ground up, establishing linguistic resources, value functions, and development practices that ultimately shape downstream behavior. At the same time, decentralization introduces its own tensions, including coordination costs, uneven resource distribution, and challenges of accountability. Open and community-led efforts may broaden participation, but they must still grapple with questions of safety, quality control, and transparent maintenance of HCLLMs. In the following sections, we will discuss safety (Safe HCLLMs) and interpretability (Interpretable and Explainable HCLLMs), as well as the tensions between these objectives.

Asthana, S., Im, J., Chen, Z., & Banovic, N. (2024). “ I know even if you don’t tell me”: Understanding Users’ Privacy Preferences Regarding AI-based Inferences of Sensitive Information for Personalization. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 1–21.

Bergman, S., Marchal, N., Mellor, J., Mohamed, S., Gabriel, I., & Isaac, W. (2024). STELA: a community-centred approach to norm elicitation for AI alignment. Scientific Reports, 14(1), 6616.

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., & others. (2023). Pythia: A suite for analyzing large language models across training and scaling. International Conference on Machine Learning, 2397–2430.

Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., & Larson, J. M. (2024). Synthetic replacements for human survey data? The perils of large language models. Political Analysis, 32(4), 401–416.

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., & others. (2022). Gpt-neox-20b: An open-source autoregressive language model. Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, 95–136.

Chen, D., Chen, Y., Rege, A., & Vinayak, R. K. (2024). PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences. NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability. https://openreview.net/forum?id=wVg2kVQOzq

Chen, K., He, Z., Shi, T., & Lerman, K. (2025). STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models. arXiv Preprint arXiv:2505.20645.

Delgado, F., Yang, S., Madaio, M., & Yang, Q. (2023). The participatory turn in ai design: Theoretical foundations and the current state of practice. Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 1–23.

Durmus, E., Nguyen, K., Liao, T., Schiefer, N., Askell, A., Bakhtin, A., Chen, C., Hatfield-Dodds, Z., Hernandez, D., Joseph, N., Lovitt, L., McCandlish, S., Sikder, O., Tamkin, A., Thamkul, J., Kaplan, J., Clark, J., & Ganguli, D. (2024). Towards Measuring the Representation of Subjective Global Opinions in Language Models. First Conference on Language Modeling. https://openreview.net/forum?id=zl16jLb91v

Feng, S., Sorensen, T., Liu, Y., Fisher, J., Park, C. Y., Choi, Y., & Tsvetkov, Y. (2024). Modular Pluralism: Pluralistic Alignment via Multi-LLM Collaboration. In Y. Al-Onaizan, M. Bansal, & Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 4151–4171). Association for Computational Linguistics. https://aclanthology.org/2024.emnlp-main.240

Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3), 411–437.

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2021). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. In ArXiv preprint: Vol. abs/2101.00027. https://arxiv.org/abs/2101.00027

Giulianelli, M., Baan, J., Aziz, W., Fern’andez, R., & Plank, B. (2023). What Comes Next? Evaluating Uncertainty in Neural Text Generators Against Human Production Variability. Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:258823318

Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., Wiest, O., & Zhang, X. (2024). Large language model based multi-agents: a survey of progress and challenges. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 8048–8057.

Hayati, S. A., Lee, M., Rajagopal, D., & Kang, D. (2024). How Far Can We Extract Diverse Perspectives from Large Language Models? In Y. Al-Onaizan, M. Bansal, & Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 5336–5366). Association for Computational Linguistics. https://aclanthology.org/2024.emnlp-main.306

Hu, J., Xia, W., Zhang, X., Fu, C., Wu, W., Huan, Z., Li, A., Tang, Z., & Zhou, J. (2024). Enhancing sequential recommendation via llm-based semantic embedding learning. Companion Proceedings of the ACM Web Conference 2024, 103–111.

Li, J., Peris, C., Mehrabi, N., Goyal, P., Chang, K.-W., Galstyan, A., Zemel, R., & Gupta, R. (2024). The steerability of large language models toward data-driven personas. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 7290–7305.

Li, S., Sun, T., Cheng, Q., & Qiu, X. (2024). Agent Alignment in Evolving Social Norms. https://arxiv.org/abs/2401.04620

Miehling, E., Desmond, M., Ramamurthy, K. N., Daly, E. M., Varshney, K. R., Farchi, E., Dognin, P., Rios, J., Bouneffouf, D., Liu, M., & others. (2025). Evaluating the prompt steerability of large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 7874–7900).

Mihalcea, R., Ignat, O., Bai, L., Borah, A., Chiruzzo, L., Jin, Z., Kwizera, C., Nwatu, J., Poria, S., & Solorio, T. (2025). Why AI Is WEIRD and Shouldn’t Be This Way: Towards AI for Everyone, with Everyone, by Everyone. Proceedings of the AAAI Conference on Artificial Intelligence, 39(27), 28657–28670.

Orife, I., Kreutzer, J., Sibanda, B., Whitenack, D., Siminyu, K., Martinus, L., Ali, J. T., Abbott, J., Marivate, V., Kabongo, S., & others. (2020). Masakhane–machine translation for Africa. arXiv Preprint arXiv:2003.11529.

Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. Proceedings of the 36th Annual Acm Symposium on User Interface Software and Technology, 1–22.

Park, M., Kim, S., Lee, S., Kwon, S., & Kim, K. (2024). Empowering personalized learning through a conversation-based tutoring system with student modeling. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 1–10.

Röttger, P., Hofmann, V., Pyatkin, V., Hinck, M., Kirk, H., Schütze, H., & Hovy, D. (2024). Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15295–15311.

Samuel, V., Zou, H. P., Zhou, Y., Chaudhari, S., Kalyan, A., Rajpurohit, T., Deshpande, A., Narasimhan, K., & Murahari, V. (2024). Personagym: Evaluating persona agents and llms. arXiv Preprint arXiv:2407.18416.

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., & Hashimoto, T. (2023). Whose Opinions Do Language Models Reflect? In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Vol. 202, pp. 29971–30004). PMLR. https://proceedings.mlr.press/v202/santurkar23a.html

Shaikh, O., Sapkota, S., Rizvi, S., Horvitz, E., Park, J. S., Yang, D., & Bernstein, M. S. (2025). Creating General User Models from Computer Use. Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. https://doi.org/10.1145/3746059.3747722

Shanahan, M., McDonell, K., & Reynolds, L. (2023). Role play with large language models. Nature, 623(7987), 493–498.

Shao, Y., Li, T., Shi, W., Liu, Y., & Yang, D. (2024). Privacylens: Evaluating privacy norm awareness of language models in action. Advances in Neural Information Processing Systems, 37, 89373–89407.

Sorensen, T., Newman, B., Moore, J., Park, C., Fisher, J., Mireshghallah, N., Jiang, L., & Choi, Y. (2025). Spectrum tuning: Post-training for distributional coverage and in-context steerability. arXiv Preprint arXiv:2510.06084.

Suresh, H., Tseng, E., Young, M., Gray, M., Pierson, E., & Levy, K. (2024). Participation in the age of foundation models. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 1609–1621.

Tseng, Y.-M., Huang, Y.-C., Hsiao, T.-Y., Chen, W.-L., Huang, C.-W., Meng, Y., & Chen, Y.-N. (2024). Two tales of persona in llms: A survey of role-playing and personalization. Findings of the Association for Computational Linguistics: EMNLP 2024, 16612–16631.

Wang, A., Morgenstern, J., & Dickerson, J. P. (2025). Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7(3), 400–411.

Wies, N., Levine, Y., & Shashua, A. (2023). The learnability of in-context learning. Advances in Neural Information Processing Systems, 36, 36637–36651.

Workshop, B., Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., & others. (2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv Preprint arXiv:2211.05100.

Zhang, Z., Rossi, R. A., Kveton, B., Shao, Y., Yang, D., Zamani, H., Dernoncourt, F., Barrow, J., Yu, T., Kim, S., & others. (2024). Personalization of large language models: A survey. Transactions on Machine Learning Research.

Zhang, Z., Zhang, Y. E., Shi, F., & Li, T. (2025). Autonomy Matters: A Study on Personalization-Privacy Dilemma in LLM Agents. arXiv Preprint arXiv:2510.04465.

Ziems, C., Held, W., Yang, J., Dhamala, J., Gupta, R., & Yang, D. (2023). Multi-VALUE: A framework for cross-dialectal English NLP. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 744–768.

Steerable HCLLMs

Current Approaches to Steerability

Looking Forward

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Current Approaches to Steerability

Looking Forward

Graph View

Table of Contents

Backlinks