Safe HCLLMs

The third dimension we emphasize when building responsible HCLLMs is safety. As defined in Safety Evaluations, safety is conceptualized as preventing LLMs from producing undesirable outputs (i.e., those that may be toxic, harmful, discriminatory, or dangerous), even when prompted to do so. For example, the widespread use of LLMs raises critical concerns about ethical and social risks related to their outputs, including discrimination, hate speech, exclusion, misinformation harms, malicious uses, and so on (Bender et al., 2021; Zhang et al., 2022). At the same time, there are concerns that LLMs can be used as agents of harm, such as using AI-generated propaganda for misinformation purposes (Goldstein et al., 2024) or spreading information that can facilitate harmful actions like the manufacturing of weapons (Shaikh et al., 2023).

We begin by discussing the existing methods that are employed for addressing safety concerns both at the model training and interaction layers. Looking forward, we advocate for expanding beyond this current definition of safety, which focuses on preventing harms, to encompass how we can build HCLLMs that also maximize user benefits.

Current Approaches to Safety

First, we will discuss current methods for measuring and mitigating safety concerns. Red-teaming is a common practice for identifying harmful behavior through adversarial testing prior to deployment. Red-teaming approaches differ across model providers, and details are often not publicly disclosed as these practices are conducted in industry settings (Feffer et al., 2024). As Feffer et al. (2024) survey, red-teamers typically come from three pools: subject-matter experts (Ahmad et al., 2025), crowdworkers (Ganguli et al., 2022), or automated methods, such as the language models themselves (Ganguli et al., 2022; Perez et al., 2022). The objectives for red-teaming can range from broad mandates to identify any harmful behavior to targeted assessments of specific risks, such as those related to national security. After deployment, model providers may also run bug bounty programs that offer incentives for discovering safety or security vulnerabilities (Anthropic, 2025; OpenAI, 2025). In addition to red-teaming efforts, model developers make use of benchmarks and other safety evaluations, which we discuss in detail in Safety Evaluations.

Mitigations can appear at various stages of the model development pipeline. At the pre-training phase, there is interest in filtering datasets to remove toxic content as a preventative safeguard (Mendu et al., 2025; O’Brien et al., 2025; Stranisci & Hardmeier, 2025). At the same time, other work has argued that filtering toxic data during pre-training can have detrimental downstream effects, and that including such data at pre-training can actually make these behaviors easier to remove through fine-tuning (Li et al., 2025; Longpre et al., 2024; Maini et al., 2025). Many efforts also address safety concerns during post-training. Foundational techniques for modern LLMs, such as RLHF, are useful not only for increasing model helpfulness but also for aligning models to be more harmless (Ouyang et al., 2022). Building on these principles, additional post-training methods can help automate parts of this process. For instance, Constitutional AI allows researchers to pre-determine a set of ethical principles for the model to adhere to (Bai et al., 2022). Instruction-tuning methods can also reduce toxicity (see Current Practices in Instruction Tuning). Once deployed, guardrail models are used to help moderate both user inputs and generated outputs (Dong et al., 2024; Inan et al., 2023; Rebedea et al., 2023).

Looking Forward

Like other human-centered objectives, safety can be an underspecified, ambiguous, or contested target. As we anticipate the development of more human-centered LLMs, we should be asking whose definitions of safety are prioritized, and how we can design models that not only prevent harm but can actively promote user flourishing.

Paying heed to long-term harms.

As surveyed above, existing AI safety research tends to focus on immediate harms that users face when interacting with models, such as exposure to toxic speech or the production of misinformation. Of course, these harms carry long-term societal consequences. However, an underexplored class of safety problems involves behavior that appears innocuous in the short term but can compound over repeated usage to become problematic. The recent work in this vein has identified specific model properties, such as sycophancy, which can affect users’ psychological states and behaviors (Cheng et al., 2025). An open challenge lies in measuring these long-term interaction harms, as they are difficult to capture with standard evaluation practices like benchmarking. One alternative approach, as discussed in HCI for HCLLMs, can be to run controlled experiments to understand the effect of model properties on users (Cheng et al., 2025; Kirk et al., 2025) or to conduct qualitative inquiry (Mathur et al., 2025). However, this process can be time-intensive and furthermore potentially exposes participants to the very harms being studied. This concern raises the question of what alternative valid methods exist, such as those that can simulate these harms in silica. Beyond measurement, mitigations also remain largely underexplored—both in terms of interventions at the model design and user interaction paradigms. Researchers have identified properties that can exacerbate harms (e.g., steering models towards being relationship-seeking in model design (Kirk et al., 2025), sending emotionally laden messages as users try to exit a platform (De Freitas et al., 2025)), but translating these insights into preventative measures remains an important and open area for exploration.

Expanding the definition of safety.

A second area of exploration involves expanding whose definition of safety is prioritized. Definitions of safety vary considerably across demographic groups, along factors such as ethnicity, age, and gender (Ali et al., 2025; Aroyo et al., 2023; Gabriel, 2020; Movva et al., 2024; Rastogi et al., 2025). These variances are then encoded into models through alignment processes, significantly changing model behavior (Ali et al., 2025). Thus, instead of assuming a generic definition of safety, there is interest in better capturing and modeling the diversity of conceptualizations that exist. This direction presents a natural continuation of the existing focus on pluralistic alignment within human-centered LLM research. How do we capture these differing definitions of safety? Some work has tackled the issue through recruiting diverse sets of annotators to rate safety perceptions (Aroyo et al., 2023; Rastogi et al., 2025). Others advocate for engaging with communities in a more participatory fashion to elicit safety goals, which offer a richer understanding of how different communities understand the potential harms of these technologies (Bergman et al., 2024; Qadri et al., 2025).

Despite the benefits of moving away from this “view from nowhere” conception of safety, it is important to remember that communities themselves are not monolithic. Disagreements inevitably arise about what constitutes safe or harmful content or what model behaviors are deemed desired or unacceptable (Gordon et al., 2021, 2022). There is a legitimate concern that implementing democratic methods could inadvertently drown out the voices of minority groups. Yet this is not grounds to dismiss democratic or participatory methods for AI safety as a lost cause. As Zimmermann et al. (2025) outline in their work, there are viable paths forward for reconciling these tensions by drawing on practices from political philosophy. As we move towards a more pluralistic definition of safety, this requires thinking normatively about the contexts in which we jointly maximize or balance considerations across groups of people, recognizing that collective decisions may at times conflict with individual desires or goals.

Considering not only harms but also benefits.

Finally, much of the discussion so far has focused on mitigating harmful behaviors. However, we emphasize that avoiding harm is not the same as maximizing user benefits. In pursuit of human-centered LLMs, we must also prioritize building models that bring positive change for users. This raises important questions about what beneficial model behavior looks like, and whether our current conceptions of safety align with what is truly beneficial. First, we must challenge the assumption that “safe” models are necessarily the best for promoting benefits. As Cai et al. (2024) challenge, perhaps there is a need to design antagonistic AI systems that are “actively dismissive, disagreeable, closed-off, critical, flippant, difficult, interrupting.” Much like, for instance, a student may be challenged by their teacher in the learning process, when we design for benefits rather than merely minimizing harms, the desired model behavior changes.

A second provocation concerns the scope of benefit: rather than thinking only about the one-to-one benefit of a model on an individual user, what about one-to-many benefits? We can envision designing models for collective or group-level good—for example, models deployed to promote democratic health by finding common ground through deliberation (Tessler et al., 2024), or models designed to benefit teams by serving as collaborators within group settings. Just as there are differing definitions of what safety means, there are similarly diverse conceptions of benefit, raising parallel questions about whose definition should be prioritized and how we reconcile conflicting views of what constitutes a beneficial outcome.

Ahmad, L., Agarwal, S., Lampe, M., & Mishkin, P. (2025). OpenAI’s Approach to External Red Teaming for AI Models and Systems. arXiv Preprint arXiv:2503.16431.

Ali, D., Zhao, D., Koenecke, A., & Papakyriakopoulos, O. (2025). Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior. arXiv Preprint arXiv:2511.14476.

Anthropic. (2025). Model Safety Bug Bounty Program. https://support.claude.com/en/articles/12119250-model-safety-bug-bounty-program

Aroyo, L., Taylor, A., Diaz, M., Homan, C., Parrish, A., Serapio-Garcı́a, G., Prabhakaran, V., & Wang, D. (2023). Dices dataset: Diversity in conversational ai evaluation for safety. Advances in Neural Information Processing Systems, 36, 53330–53342.

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., & others. (2022). Constitutional ai: Harmlessness from ai feedback. ArXiv Preprint, abs/2212.08073. https://arxiv.org/abs/2212.08073

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3442188.3445922

Bergman, S., Marchal, N., Mellor, J., Mohamed, S., Gabriel, I., & Isaac, W. (2024). STELA: a community-centred approach to norm elicitation for AI alignment. Scientific Reports, 14(1), 6616.

Cai, A., Arawjo, I., & Glassman, E. L. (2024). Antagonistic ai. arXiv Preprint arXiv:2402.07350.

Cheng, M., Lee, C., Khadpe, P., Yu, S., Han, D., & Jurafsky, D. (2025). Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence. arXiv Preprint arXiv:2510.01395.

De Freitas, J., Oguz-Uguralp, Z., & Kaan-Uguralp, A. (2025). Emotional manipulation by ai companions. arXiv Preprint arXiv:2508.19258.

Dong, Y., Mu, R., Jin, G., Qi, Y., Hu, J., Zhao, X., Meng, J., Ruan, W., & Huang, X. (2024). Building guardrails for large language models. ArXiv Preprint, abs/2402.01822. https://arxiv.org/abs/2402.01822

Feffer, M., Sinha, A., Deng, W. H., Lipton, Z. C., & Heidari, H. (2024). Red-teaming for generative AI: Silver bullet or security theater? Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 7, 421–437.

Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3), 411–437.

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., & others. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv Preprint arXiv:2209.07858.

Goldstein, J. A., Chao, J., Grossman, S., Stamos, A., & Tomz, M. (2024). How persuasive is AI-generated propaganda? PNAS Nexus, 3(2), pgae034.

Gordon, M. L., Lam, M. S., Park, J. S., Patel, K., Hancock, J., Hashimoto, T., & Bernstein, M. S. (2022). Jury learning: Integrating dissenting voices into machine learning models. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 1–19.

Gordon, M. L., Zhou, K., Patel, K., Hashimoto, T., & Bernstein, M. S. (2021). The disagreement deconvolution: Bringing machine learning performance metrics in line with reality. Proceedings of the 2021 Chi Conference on Human Factors in Computing Systems, 1–14.

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., & others. (2023). Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv Preprint arXiv:2312.06674.

Kirk, H. R., Davidson, H., Saunders, E., Luettgau, L., Vidgen, B., Hale, S. A., & Summerfield, C. (2025). Neural steering vectors reveal dose and exposure-dependent impacts of human-AI relationships. arXiv Preprint arXiv:2512.01991.

Li, K., Chen, Y., Viégas, F., & Wattenberg, M. (2025). When Bad Data Leads to Good Models. Forty-Second International Conference on Machine Learning.

Longpre, S., Yauney, G., Reif, E., Lee, K., Roberts, A., Zoph, B., Zhou, D., Wei, J., Robinson, K., Mimno, D., & Ippolito, D. (2024). A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 3245–3276). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-long.179

Maini, P., Goyal, S., Sam, D., Robey, A., Savani, Y., Jiang, Y., Zou, A., Fredrikson, M., Lipton, Z. C., & Kolter, J. Z. (2025). Safety pretraining: Toward the next generation of safe ai. arXiv Preprint arXiv:2504.16980.

Mathur, N., Zubatiy, T., Rozga, A., Forlizzi, J., & Mynatt, E. (2025). “ Sometimes You Need Facts, and Sometimes a Hug”: Understanding Older Adults’ Preferences for Explanations in LLM-Based Conversational AI Systems. arXiv Preprint arXiv:2510.06697.

Mendu, S. K., Yenala, H., Gulati, A., Kumar, S., & Agrawal, P. (2025). Towards safer pretraining: Analyzing and filtering harmful content in webscale datasets for responsible llms. arXiv Preprint arXiv:2505.02009.

Movva, R., Koh, P. W., & Pierson, E. (2024). Annotation alignment: Comparing LLM and human annotations of conversational safety. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 9048–9062.

O’Brien, K., Casper, S., Anthony, Q., Korbak, T., Kirk, R., Davies, X., Mishra, I., Irving, G., Gal, Y., & Biderman, S. (2025). Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs. arXiv Preprint arXiv:2508.06601.

OpenAI. (2025). GPT-5 bio bug bounty: Testing universal jailbreaks for biorisks in GPT-5. https://openai.com/gpt-5-bio-bug-bounty/

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper%5C_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html

Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., & Irving, G. (2022). Red Teaming Language Models with Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 3419–3448.

Qadri, R., Diaz, M., Wang, D., & Madaio, M. (2025). The case for" thick evaluations" of cultural representation in ai. arXiv Preprint arXiv:2503.19075.

Rastogi, C., Teh, T. H., Mishra, P., Patel, R., Wang, D., Diaz, M., Parrish, A., Davani, A. M., Ashwood, Z., Paganini, M., & others. (2025). Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models. The Thirty-Ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track.

Rebedea, T., Dinu, R., Sreedhar, M. N., Parisien, C., & Cohen, J. (2023). Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 431–445.

Shaikh, O., Zhang, H., Held, W., Bernstein, M., & Yang, D. (2023). On Second Thought, Let`s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 4454–4470). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.244

Stranisci, M. A., & Hardmeier, C. (2025). What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets. arXiv Preprint arXiv:2503.05721.

Tessler, M. H., Bakker, M. A., Jarrett, D., Sheahan, H., Chadwick, M. J., Koster, R., Evans, G., Campbell-Gillingham, L., Collins, T., Parkes, D. C., & others. (2024). AI can help humans find common ground in democratic deliberation. Science, 386(6719), eadq2852.

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M. T., Li, X., Lin, X. V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., & Zettlemoyer, L. (2022). OPT: Open Pre-trained Transformer Language Models. ArXiv, abs/2205.01068. https://api.semanticscholar.org/CorpusID:248496292

Zimmermann, A., Zeppa, A., Pandey, S., & Diao, K. (2025). Don’t Give Up on Democratizing AI for the Wrong Reasons. The Thirty-Ninth Annual Conference on Neural Information Processing Systems Position Paper Track.

Safe HCLLMs

Current Approaches to Safety

Looking Forward

Paying heed to long-term harms.

Expanding the definition of safety.

Considering not only harms but also benefits.

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Current Approaches to Safety

Looking Forward

Paying heed to long-term harms.

Expanding the definition of safety.

Considering not only harms but also benefits.

Graph View

Table of Contents

Backlinks