Unlike model-level evaluations, which focus on what the system produces, human-level evaluations focus on how people experience the HCLLM (Chang et al., 2023ReferenceChang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., & Xie, X. (2023). A Survey on Evaluation of Large Language Models. ArXiv Preprint, abs/2307.03109. https://arxiv.org/abs/2307.03109; Parmanto et al., 2024ReferenceParmanto, B., Aryoyudanta, B., Soekinto, W., Setiawan, I., Wang, Y., Hu, H., Saptono, A., & Choi, Y. K. (2024). Development of a Reliable and Accessible Caregiving Language Model (CaLM). arXiv Preprint arXiv:2403.06857.). We focus particularly on human values (Human Values), bias (Bias and Fairness Evaluation), and safety (Safety Evaluations).

Human Values

Evaluations can measure needs, values, and aesthetic principles that humans care about. We discuss helpfulness, coherence, empathy, creativity, user satisfaction, and transparency, each in turn. By evaluating against these, model developers can create systems that not only technically perform well, but also enhance the user experience.

Coherence. Coherence ensures that the generated text flows logically and is understandable to human readers (Dang, 2006ReferenceDang, H. T. (2006). DUC 2005: Evaluation of Question-Focused Summarization Systems. In T.-S. Chua, J. Goldstein, S. Teufel, & L. Vanderwende (Eds.), Proceedings of the Workshop on Task-Focused Summarization and Question Answering (pp. 48–55). Association for Computational Linguistics. https://aclanthology.org/W06-0707). Reinhart (1980)ReferenceReinhart, T. (1980). Conditions for Text Coherence. Poetics Today, 1(4), 161–180. https://doi.org/10.2307/1771893 defines three conditions for coherence: (i) cohesion, (ii) consistency, and (iii) relevance. Cohesion focuses on syntactic structure, ensuring that sentences are formally linked through referential links or semantic connectors. Consistency requires logical alignment between sentences, ensuring they can coexist truthfully within a single interpretive framework. Relevance emphasizes the relationship between sentences, the topic at hand, and its broader context. Without coherence, LLM outputs would be disconnected language fragments that fail to provide meaningful information, potentially jumping between topics or making contradictory statements that human readers struggle to follow. This would significantly impair the communication with and the trustworthiness of LLMs, as humans rely on coherent communication to build understanding.

Creativity. Creativity metrics assess the originality and diversity of outputs, while still ensuring factual accuracy. These dimensions are particularly critical for content generation tasks, balancing innovation with reliability (De et al., 2022ReferenceDe, A., Gudipudi, S. S., Panchanan, S., & Desarkar, M. S. (2022). ComplAI: Theory of A Unified Framework for Multi-factor Assessment of Black-Box Supervised Machine Learning Models. ArXiv, abs/2212.14599. https://api.semanticscholar.org/CorpusID:255340443). For topics like creativity, where there may not be clear computational measures, researchers may consult to long-established fields studying these constructs and have well-defined rubrics, such as psychology or literature (Amabile, 1983ReferenceAmabile, T. M. (1983). The case for a social psychology of creativity. The Journal of Creative Behavior, 46(1), 3–15. https://doi.org/10.1007/978-1-4612-5533-8_1; Mozaffari, 2013ReferenceMozaffari, H. (2013). An analytical rubric for assessing creativity in creative writing. Theory and Practice in Language Studies, 3(12). https://doi.org/10.4304/tpls.3.12.2214-2219).

Empathy. Metrics should measure an LLM’s ability to recognize and respond to user emotions empathetically, especially in sensitive contexts. Given that LLMs have been widely adapted to sensitive real-world contexts—behavioral health, medicine, and education, just to name a few— (Stade et al., 2024ReferenceStade, E. C., Stirman, S. W., Ungar, L. H., Boland, C. L., Schwartz, H. A., Yaden, D. B., Sedoc, J., DeRubeis, R. J., Willer, R., & Eichstaedt , J. C. (2024). Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. In npj Mental Health Research. Nature. https://www.nature.com/articles/s44184-024-00056-z), evaluations focusing on emotional consistency and appropriateness could ensure responses are suitable and do not contain instability that could affect end-users deeply. Such metrics should evaluate how LLMs’ responses influence attitudes or behaviors in real-world scenarios, taking applied feedback from human domain experts, such as psychologists, physicians, or educators, to assess the quality of the LLMs’ outputs based on their fields’ standardized measures (Demszky et al., 2023ReferenceDemszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., & et al. (2023). Using large language models in psychology. Nature Reviews Psychology. https://doi.org/10.1038/s44159-023-00241-5). Such evaluations could also promote development of human-AI collaboration systems, which have been shown to elevate empathetic responses even human to human (Sharma et al., 2023ReferenceSharma, A., Lin, I. W., Miner, A. S., Atkins, D. C., & Althoff, T. (2023). Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence, 5(1), 46–57. https://doi.org/10.1038/s42256-022-00593-2).

Helpfulness. Evaluation metrics should assess the model’s ability to provide relevant, beneficial, and non-offensive information tailored to user needs, in relation to the behavioral impact of the model (Peng et al., 2024ReferencePeng, J.-L., Cheng, S., Diau, E., Shih, Y.-Y., Chen, P.-H., Lin, Y.-T., & Chen, Y.-N. (2024). A Survey of Useful LLM Evaluation. https://arxiv.org/abs/2406.00936). More and more, models are developed to focus on certain needs in the world. Therefore, it becomes important to track the helpfulness of the model in its specified downstream tasks and evaluate the users’ state, knowledge, and performance relative to exposure to the system. For example, a model designed to help users prepare for events that require conflict resolution must be able to stimulate realistic conflict scenarios dependent on the user’s needs, provide diverse examples and responses, and promote guided practice where users can receive feedback to get better (Shaikh et al., 2024ReferenceShaikh, O., Chai, V. E., Gelfand, M., Yang, D., & Bernstein, M. S. (2024). Rehearsal: Simulating Conflict to Teach Conflict Resolution. In F. “Floyd” Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, & I. Shklovski (Eds.), Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024 (p. 920:1-920:20). ACM. https://doi.org/10.1145/3613904.3642159). In the evaluation of such systems, while technical components such as language generation and accuracy would be evaluated too, asking feedback from actual domain users through behavioral assessments would provide valuable insights to the development. These impact-focused evaluations consider the model’s generalizability in complex, real-world scenarios and provide a more accurate assessment of its practical value from the domain-users’ perspectives.

Transparency. Transparency is a cornerstone of responsible AI and is crucial for human-centered LLM systems. It enables users to understand system limitations and make informed decisions about when and how to rely on model assistance. Approaches to transparency should include model reporting, publishing evaluation results, providing explanations, and communicating uncertainty. These methods help different stakeholders understand and trust the LLMs, ensuring that the systems are used responsibly and effectively (Liao & Vaughan, 2023ReferenceLiao, Q., & Vaughan, J. (2023). AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. ArXiv Preprint, abs/2306.01941. https://arxiv.org/abs/2306.01941).

User Satisfaction. As models grow bigger, become more task-specific, and more integrated into day-to-day roles, general purpose benchmarks may not be enough to evaluate the performance of models in the wild and evaluators may seek feedback specific to a singular group of models. Therefore, utilizing the actual usage data could benefit the development-to-deployment cycle the most. Metrics derived from user feedback, interaction logs, and satisfaction ratings provide direct insights into the real-world effectiveness of LLMs. These are essential for understanding how users perceive and interact with model outputs. As an example, in an attempt to understand how we can better align models with user needs, Wang et al. (2024)ReferenceWang, J., Ma, W., Sun, P., Zhang, M., & Nie, J.-Y. (2024). Understanding User Experience in Large Language Model Interactions. https://arxiv.org/abs/2401.08329 consults to act, highlighting a need for user-centric evaluation.

Bias and Fairness Evaluation

Drawing from the taxonomy of algorithmic harm developed by Shelby et al. (2023)ReferenceShelby, R., Rismani, S., Henne, K., Moon, Aj., Rostamzadeh, N., Nicholas, P., Yilla-Akbari, N., Gallegos, J., Smart, A., Garcia, E., & others. (2023). Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–741., bias, in particular, can be conceptualized along three dimensions: (1) representational, (2) allocation, and (3) quality of service. These axes of harm require careful evaluation to avoid further entrenchment of social hierarchies, inequitable resource distribution, and performance disparities across demographic groups (Blodgett et al., 2020ReferenceBlodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5454–5476). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.485; Shelby et al., 2023ReferenceShelby, R., Rismani, S., Henne, K., Moon, Aj., Rostamzadeh, N., Nicholas, P., Yilla-Akbari, N., Gallegos, J., Smart, A., Garcia, E., & others. (2023). Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–741.). The human implications of these harms extend beyond technical measurements to real-world consequences that affect people’s dignity, opportunities, and quality of life (Hofmann et al., 2024ReferenceHofmann, V., Kalluri, P. R., Jurafsky, D., & King, S. (2024). Dialect prejudice predicts AI decisions about people’s character, employability, and criminality. https://arxiv.org/abs/2403.00742).

Representational Bias.

Representational bias in model outputs reflects, and in some cases, amplifies (A. Wang & Russakovsky, 2021ReferenceWang, A., & Russakovsky, O. (2021). Directional Bias Amplification. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Vol. 139, pp. 10882–10893). PMLR. http://proceedings.mlr.press/v139/wang21t.html; Zhao et al., 2017ReferenceZhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2017). Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In M. Palmer, R. Hwa, & S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2979–2989). Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1323), our own implicit associations and social hierarchies. This dimension of bias includes stereotyping, demeaning, erasure, alienation, denial of self-identity, and the insistence on essentialist identity categories (Shelby et al., 2023ReferenceShelby, R., Rismani, S., Henne, K., Moon, Aj., Rostamzadeh, N., Nicholas, P., Yilla-Akbari, N., Gallegos, J., Smart, A., Garcia, E., & others. (2023). Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–741.). These harms impact how individuals perceive themselves and their communities, potentially reinforcing societal prejudices and stereotypes that limit human potential. Hu et al. (2024)ReferenceHu, T., Kyrychenko, Y., Rathje, S., Collier, N., van der Linden, S., & Roozenbeek, J. (2024). Generative Language Models Exhibit Social Identity Biases. https://arxiv.org/abs/2310.15819 found that language models exhibit social identity bias, mirroring human ingroup solidarity and outgroup hostility.

Stereotype benchmarks predominate evaluations along this dimension because they offer standardized methods and baselines. For masked-language models, notable frameworks include StereoSet (SS) (Nadeem et al., 2021ReferenceNadeem, M., Bethke, A., & Reddy, S. (2021). StereoSet: Measuring stereotypical bias in pretrained language models. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 5356–5371). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.416), CrowS-Pairs (CS) (Nangia et al., 2020ReferenceNangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1953–1967). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.154), WinoBias (WB) (Zhao et al., 2018ReferenceZhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2018). Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 15–20). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2003), and WinoGender (WG) (Rudinger et al., 2018ReferenceRudinger, R., Naradowsky, J., Leonard, B., & Van Durme, B. (2018). Gender Bias in Coreference Resolution. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 8–14). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2002)—all collections of contrastive prompt pairs (stereotype vs. non-stereotype) that aggregate to score for relative comparison between identity groups (race, gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status). These comparisons capture a model’s tendency to associate social groups with particular target terms of interest through predicted token probabilities for masked identifiers. Researchers have also employed co-reference resolution tasks, where ambiguous identifiers reference the same entity, to measure associations between identity markers and terms of interest, whether they be descriptors, stereotypes, occupations, or other attributes (Clark & Manning, 2016ReferenceClark, K., & Manning, C. D. (2016). Deep Reinforcement Learning for Mention-Ranking Coreference Models. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2256–2262). Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1245).

Another line of research has focused on open-ended text generation and produced datasets of carefully curated questions and prompts to draw out stereotypes specific to certain social groups (Dhamala et al., 2021ReferenceDhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., & Gupta, R. (2021). Bold: Dataset and metrics for measuring biases in open-ended language generation. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 862–872.; Gehman et al., 2020ReferenceGehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 3356–3369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.301; Naous et al., 2024ReferenceNaous, T., Ryan, M. J., Ritter, A., & Xu, W. (2024). Having Beer after Prayer? Measuring Cultural Bias in Large Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 16366–16393). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.862; Parrish et al., 2022ReferenceParrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M., & Bowman, S. (2022). BBQ: A hand-built bias benchmark for question answering. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Findings of the Association for Computational Linguistics: ACL 2022 (pp. 2086–2105). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-acl.165). For open-ended prompts, classifier-based comparative metrics like toxicity (Chowdhery et al., 2023ReferenceChowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., & others. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1–113.; Chung et al., 2024ReferenceChung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., & others. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70), 1–53.; Gehman et al., 2020ReferenceGehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 3356–3369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.301; Liang et al., 2022ReferenceLiang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., & others. (2022). Holistic evaluation of language models. ArXiv Preprint, abs/2211.09110. https://arxiv.org/abs/2211.09110), sentiment (Roehrick, 2020ReferenceRoehrick, K. (2020). Valence Aware Dictionary and sEntiment Reasoner (VADER). https://CRAN.R-project.org/package=vader), and regard (Sheng et al., 2019ReferenceSheng, E., Chang, K.-W., Natarajan, P., & Peng, N. (2019). The Woman Worked as a Babysitter: On Biases in Language Generation. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3407–3412). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1339) serve as better indicators of bias than relative probability distributions of target terms. Despite the wide adoption of all these benchmarks and datasets, critics find systematic conceptual issues—unstated assumptions, ambiguities, and inconsistencies in what is measured—and operational failures in their execution (Blodgett et al., 2021ReferenceBlodgett, S. L., Lopez, G., Olteanu, A., Sim, R., & Wallach, H. (2021). Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1004–1015). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.81; McIntosh et al., 2024ReferenceMcIntosh, T. R., Susnjak, T., Arachchilage, N., Liu, T., Watters, P., & Halgamuge, M. N. (2024). Inadequacies of large language model benchmarks in the era of generative artificial intelligence. ArXiv Preprint, abs/2402.09880. https://arxiv.org/abs/2402.09880; Seshadri et al., 2022ReferenceSeshadri, P., Pezeshkpour, P., & Singh, S. (2022). Quantifying social biases using templates is unreliable. ArXiv Preprint, abs/2210.04337. https://arxiv.org/abs/2210.04337).

Some datasets like Parrish et al. (2022)ReferenceParrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M., & Bowman, S. (2022). BBQ: A hand-built bias benchmark for question answering. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Findings of the Association for Computational Linguistics: ACL 2022 (pp. 2086–2105). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-acl.165 integrate perturbed context windows to explore the relationship between output bias and any ambiguous identity groups in the input. However, recent investigations into the prompting methods and system-level personas also reveal new confounds for these approaches, finding results to vary based on the perturbation methodology Deshpande et al. (2023)ReferenceDeshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., & Narasimhan, K. (2023). Toxicity in chatgpt: Analyzing persona-assigned language models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 1236–1270). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.88. Additionally, survey papers in this field recognize that many studies do not contextualize their work within established definitions of bias (Blodgett et al., 2020ReferenceBlodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5454–5476). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.485, 2021)ReferenceBlodgett, S. L., Lopez, G., Olteanu, A., Sim, R., & Wallach, H. (2021). Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1004–1015). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.81. Finally, there are mounting concerns over test set contamination (Jegorova et al., 2022ReferenceJegorova, M., Kaul, C., Mayor, C., O’Neil, A. Q., Weir, A., Murray-Smith, R., & Tsaftaris, S. A. (2022). Survey: Leakage and privacy at inference time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7), 9090–9108.; Reid et al., 2024ReferenceReid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., Alayrac, J., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., & others. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv Preprint, abs/2403.05530. https://arxiv.org/abs/2403.05530; B. Wang et al., 2023ReferenceWang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D., & Li, B. (2023). DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/63cb9921eecf51bfad27a99b2c53dd6d-Abstract-Datasets%5C_and%5C_Benchmarks.html; Zhuo et al., 2023ReferenceZhuo, T. Y., Huang, Y., Chen, C., & Xing, Z. (2023). Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. ArXiv Preprint, abs/2301.12867. https://arxiv.org/abs/2301.12867).

Allocational Bias.

Allocational bias is a direct consequence of representational bias (Devine, 2001ReferenceDevine, P. G. (2001). Implicit prejudice and stereotyping: how automatic are they? Introduction to the special section. Journal of Personality and Social Psychology, 81(5), 757.; Kurdi et al., 2019ReferenceKurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., Tomezsko, D., Greenwald, A. G., & Banaji, M. R. (2019). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist, 74(5), 569.), resulting in an unequal distribution of resources—whether financial, opportunity-based, or service-related (Barocas et al., 2017ReferenceBarocas, S., Crawford, K., Shapiro, A., & Wallach, H. (2017). The problem with bias: Allocative versus representational harms in machine learning. 9th Annual Conference of the Special Interest Group for Computing, Information and Society, 1.; Eubanks, 2018ReferenceEubanks, V. (2018). Automating inequality: How high-tech tools profile, police, and punish the poor. St. Martin’s Press.). Its human cost is particularly severe, as it directly affects access to essential resources, economic mobility, and social participation.

In domains where model outputs can impact the material stability of vulnerable communities or social groups, such as housing, employment, social services, finance, education, and healthcare (Obermeyer et al., 2019ReferenceObermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.), it’s especially critical to evaluate discrepancies among social groups. In the employment domain, this may manifest as resume screening tools that systematically favor men over other genders (Singh & Joachims, 2018ReferenceSingh, A., & Joachims, T. (2018). Fairness of Exposure in Rankings. In Y. Guo & F. Farooq (Eds.), Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018 (pp. 2219–2228). ACM. https://doi.org/10.1145/3219819.3220088; Van Es et al., 2021ReferenceVan Es, K., Everts, D., & Muis, I. (2021). Gendered language and employment Web sites: How search algorithms can cause allocative harm. First Monday.) or white-sounding candidates over people of color based on the implicit identity markers in their name (Armstrong et al., 2024ReferenceArmstrong, L., Liu, A., MacNeil, S., & Metaxa, D. (2024). The Silicon Ceiling: Auditing GPT’s Race and Gender Biases in Hiring. Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 1–18.; Mujtaba & Mahapatra, 2019ReferenceMujtaba, D. F., & Mahapatra, N. R. (2019). Ethical considerations in AI-based recruitment. 2019 IEEE International Symposium on Technology and Society (ISTAS), 1–7.). Similarly, in social services and healthcare domains, screening tools may incorporate existing inequities related to education level, income, and race into their decision-making processes (Eubanks, 2018ReferenceEubanks, V. (2018). Automating inequality: How high-tech tools profile, police, and punish the poor. St. Martin’s Press.; Obermeyer et al., 2019ReferenceObermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.; Pessach & Shmueli, 2022ReferencePessach, D., & Shmueli, E. (2022). A review on fairness in machine learning. ACM Computing Surveys (CSUR), 55(3), 1–44.).

While representational harm has established evaluation frameworks, allocation harm has historically lacked standardized benchmarks and well-documented baselines for consistent measurement. Emergent work by Z. Wang et al. (2024)ReferenceWang, Z., Wu, Z., Guan, X., Thaler, M., Koshiyama, A., Lu, S., Beepath, S., Ertekin, E., & Perez-Ortiz, M. (2024). Jobfair: A framework for benchmarking gender hiring bias in large language models. Findings of the Association for Computational Linguistics: EMNLP 2024, 3227–3246. represents one of the first significant exceptions to this pattern, where they systematically measure employment as a downstream task by creating the JobFair dataset to quantify inequitable outcomes across gender identities. The benchmark includes resume templates with varying demographic information passed to LLMs to score and rank. Beyond this recent development, the dominant approach for evaluating this dimension of bias has required measuring outcome discrepancies when LLMs are tasked with decision-making (Armstrong et al., 2024ReferenceArmstrong, L., Liu, A., MacNeil, S., & Metaxa, D. (2024). The Silicon Ceiling: Auditing GPT’s Race and Gender Biases in Hiring. Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 1–18.; Salinas et al., 2023ReferenceSalinas, A., Shah, P., Huang, Y., McCormack, R., & Morstatter, F. (2023). The unequal opportunities of large language models: Examining demographic biases in job recommendations by chatgpt and llama. Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 1–15.; Veldanda et al., 2023ReferenceVeldanda, A. K., Grob, F., Thakur, S., Pearce, H., Tan, B., Karri, R., & Garg, S. (2023). Investigating Hiring Bias in Large Language Models. R0-FoMo: Robustness of Few-Shot and Zero-Shot Learning in Large Foundation Models.).

These investigations typically build upon established fairness metrics from prior literature, with measures like Equal Opportunity (EOG) (equal true positive rates), Equalized Odds (equal rates for true positives and false positives) (Hardt et al., 2016ReferenceHardt, M., Price, E., & Srebro, N. (2016). Equality of Opportunity in Supervised Learning. In D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain (pp. 3315–3323). https://proceedings.neurips.cc/paper/2016/hash/9d2682367c3935defcb1f9e247a97c0d-Abstract.html), Demographic Parity (equal likelihood of positive outcome) (Dwork et al., 2012ReferenceDwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, 214–226.; Kusner et al., 2017ReferenceKusner, M. J., Loftus, J. R., Russell, C., & Silva, R. (2017). Counterfactual Fairness. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (pp. 4066–4076). https://proceedings.neurips.cc/paper/2017/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html), to name a few (Verma & Rubin, 2018ReferenceVerma, S., & Rubin, J. (2018). Fairness definitions explained. Proceedings of the International Workshop on Software Fairness, 1–7.). Additional work has explored causal and counterfactual fairness approaches to better capture complex biases that arise in real-world decision-making contexts (Kilbertus et al., 2018ReferenceKilbertus, N., Rojas-Carulla, M., Parascandolo, G., Hardt, M., Janzing, D., & Schölkopf, B. (2018). Avoiding Discrimination through Causal Reasoning. https://arxiv.org/abs/1706.02744).

A parallel line of research investigates allocation harm based on performance disparities based on identity. These differences manifest in various contexts, from performance on non-bias-based benchmarks like MultiMedQA, where inquiries specific to certain demographic groups consistently underperform (McIntosh et al., 2024ReferenceMcIntosh, T. R., Susnjak, T., Arachchilage, N., Liu, T., Watters, P., & Halgamuge, M. N. (2024). Inadequacies of large language model benchmarks in the era of generative artificial intelligence. ArXiv Preprint, abs/2402.09880. https://arxiv.org/abs/2402.09880; Singhal et al., 2023ReferenceSinghal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., & others. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180.), to fundamental downstream tasks including Named Entity Recognition (NER), classification, and text generation (Blodgett et al., 2016ReferenceBlodgett, S. L., Green, L., & O’Connor, B. (2016). Demographic Dialectal Variation in Social Media: A Case Study of African-American English. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 1119–1130). Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1120; Blodgett & O’Connor, 2017ReferenceBlodgett, S. L., & O’Connor, B. (2017). Racial disparity in natural language processing: A case study of social media african-american english. ArXiv Preprint, abs/1707.00061. https://arxiv.org/abs/1707.00061). Language model performance degradation is particularly well-documented for English slang and dialectal variations (Bender et al., 2021ReferenceBender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3442188.3445922; Blodgett et al., 2016ReferenceBlodgett, S. L., Green, L., & O’Connor, B. (2016). Demographic Dialectal Variation in Social Media: A Case Study of African-American English. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 1119–1130). Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1120; Joshi et al., 2020ReferenceJoshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 6282–6293). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.560). These disparities become even more pronounced when evaluating cross-linguistic performance, largely due to the predominance of English in training data (Brown et al., 2020ReferenceBrown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.-F. Balcan, & H.-T. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html; Winata et al., 2021ReferenceWinata, G. I., Madotto, A., Lin, Z., Liu, R., Yosinski, J., & Fung, P. (2021). Language Models are Few-shot Multilingual Learners. In D. Ataman, A. Birch, A. Conneau, O. Firat, S. Ruder, & G. G. Sahin (Eds.), Proceedings of the 1st Workshop on Multilingual Representation Learning (pp. 1–15). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.mrl-1.1). In this way, these performance discrepancies span both the subjects of text generated and the users of the models, , creating a dual layer of exclusion for marginalized communities.

As LLMs increasingly influence resource allocation in critical systems and domains such as housing, healthcare, and employment, the interplay between these dimensions of harm requires improved evaluation methods. Future research must prioritize developing evaluation frameworks that establish coherent normative criteria, adapt effectively to open-ended tasks, and address intersectional identities with increasing sophistication—all while maintaining the efficacy as models scale and directly involving affected communities in the design and evaluation of these systems (Raji et al., 2022ReferenceRaji, I. D., Kumar, I. E., Horowitz, A., & Selbst, A. (2022). The Fallacy of AI Functionality. 2022 ACM Conference on Fairness, Accountability, and Transparency, 959–972. https://doi.org/10.1145/3531146.3533158).

Safety Evaluations

Safety refers to the ability of language models to generate content that does not cause harm, spread misinformation or violate ethical standards (Huang et al., 2023ReferenceHuang, X., Ruan, W., Huang, W., Jin, G., Dong, Y., Wu, C., Bensalem, S., Mu, R., Qi, Y., Zhao, X., Cai, K., Zhang, Y., Wu, S., Xu, P., Wu, D., Freitas, A., & Mustafa, M. A. (2023). A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation. https://arxiv.org/abs/2305.11391). It encompasses preventing models from producing toxic, discriminatory, or dangerous outputs, even when deliberately prompted to do so. As language models become increasingly integrated into critical applications across healthcare, education, and legal domains, ensuring safety has become paramount. There are extensive safeguards implemented during training, such as Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022ReferenceOuyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper%5C_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html) has been widely adopted to align language model with human preference, and Bai et al. (2022)ReferenceBai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., … Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. In ArXiv preprint: Vol. abs/2204.05862. https://arxiv.org/abs/2204.05862 proposed Reinforcement Learning from AI Feedback (RLAIF), which helps to improve safety in language models.

Despite these efforts, ensuring safety remains a complex and evolving challenge. This is partly due to a lack of unified evaluation benchmarks (Röttger et al., 2024ReferenceRöttger, P., Pernisi, F., Vidgen, B., & Hovy, D. (2024). Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety. ArXiv Preprint, abs/2404.05399. https://arxiv.org/abs/2404.05399), and partly due to the nature of LLMs. Language models learn from vast and diverse datasets and can exhibit unpredictable behaviors in specific contexts (Bender et al., 2021ReferenceBender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3442188.3445922). Such unpredictability often becomes evident when models are exposed to adversarial or unexpected inputs, highlighting significant gaps in existing safety mechanisms. As a result, while current safeguards can be effective under typical conditions, they may not be sufficient to anticipate or mitigate every possible misuse scenario. The importance of robust safety evaluations is further underscored by concerns surrounding data privacy and copyright discussed in Consent and Ownership.

Datasets for Safety Evaluation.

The growing demand for ethical and aligned AI has led to the development of numerous datasets and benchmarks to evaluate and improve the safety, reliability, and alignment of LLMs. These datasets vary widely in scope, methodology, and focus areas, reflecting the multifaceted nature of LLM safety. Dong et al. (2024)ReferenceDong, Z., Zhou, Z., Yang, C., Shao, J., & Qiao, Y. (2024). Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 6734–6747). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-long.375 categorize the topics of existing evaluation datasets for LLM safety into four categories: toxicity (generation of offensive language, instructions for illegal activities, and harmful content), dicrimination (biases against marginalized groups and protected characteristics), privacy (safeguarding personal information and intellectual property) and misinformation (measuring tendancy to generate false or misleading information).

Many popular and relatively comprehensive benchmarks have been frequently used in research studies. ToxiGen (Hartvigsen et al., 2022ReferenceHartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & Kamar, E. (2022). ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3309–3326). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.234) is a large-scale, autocomplete-style dataset comprising 274k toxic statements across 13 minority groups, designed to detect implicit toxic speech. It includes human annotations to assess the naturalness and perceived harmfulness of machine-generated text; however, Röttger et al. (2024)ReferenceRöttger, P., Pernisi, F., Vidgen, B., & Hovy, D. (2024). Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety. ArXiv Preprint, abs/2404.05399. https://arxiv.org/abs/2404.05399 highlight that this dataset may not accurately reflect real-world usage scenarios for modern LLMs. AdvBench (Zou et al., 2023ReferenceZou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. In ArXiv preprint: Vol. abs/2307.15043. https://arxiv.org/abs/2307.15043) focuses on adversarial robustness by providing 500 toxic strings and 500 harmful behaviors to evaluate the resilience of LLMs against prompts intended to generate harmful outputs. TruthfulQA (Lin et al., 2022ReferenceLin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3214–3252). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.229) evaluates factual accuracy with 817 questions spanning 38 categories, demonstrating how larger LLMs often replicate human misconceptions and emphasizing the need for improved training objectives. SafetyBench (Zhang et al., 2023ReferenceZhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., & Huang, M. (2023). Safetybench: Evaluating the safety of large language models with multiple choice questions. ArXiv Preprint, abs/2309.07045. https://arxiv.org/abs/2309.07045) offers a comprehensive safety evaluation framework with 11,435 multiple-choice questions across seven critical categories, enabling assessments in both English and Chinese for a more diverse linguistic perspective. Furthermore, Zhuo et al. (2023)ReferenceZhuo, T. Y., Huang, Y., Chen, C., & Xing, Z. (2023). Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. ArXiv Preprint, abs/2301.12867. https://arxiv.org/abs/2301.12867 introduce a benchmark specifically for evaluating ChatGPT’s ethical performance, systematically examining bias, reliability, robustness, and toxicity to reveal both advancements and ongoing challenges. Collectively, these datasets play a pivotal role in advancing safer and more trustworthy AI systems.

Metrics for Measuring LLM Safety.

Evaluation metrics are critical for assessing the safety performance of LLMs. Key metrics include the Attack Success Rate (ASR) (Dong et al., 2024ReferenceDong, Z., Zhou, Z., Yang, C., Shao, J., & Qiao, Y. (2024). Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 6734–6747). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-long.375; Zou et al., 2023ReferenceZou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. In ArXiv preprint: Vol. abs/2307.15043. https://arxiv.org/abs/2307.15043), which measures the percentage of successful instances where models generate harmful target outputs following adversarial prompts. Fine-grained metrics, such as the toxicity score (Hartvigsen et al., 2022ReferenceHartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & Kamar, E. (2022). ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3309–3326). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.234), evaluate the extent of toxic or harmful content produced in the generated text. Truthfulness, assessed based on strict factual accuracy standards, focuses on whether statements accurately reflect factual information rather than conforming to belief systems (Lin et al., 2022ReferenceLin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3214–3252). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.229). Additionally, safety-related multiple-choice questions, such as those in SafetyBench, are used to evaluate LLMs’ ability to address specific safety concerns (Zhang et al., 2023ReferenceZhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., & Huang, M. (2023). Safetybench: Evaluating the safety of large language models with multiple choice questions. ArXiv Preprint, abs/2309.07045. https://arxiv.org/abs/2309.07045). When applied to diverse datasets, these metrics provide a comprehensive framework for evaluating LLM safety, guiding efforts to reduce risks, improve alignment with ethical standards, and enhance trustworthiness in deployment.

Jailbreaking.

One particularly challenging aspect of safety evaluation is jailbreaking, where users deliberately attempt to circumvent safety mechanisms through crafted prompts or other techniques to induce unintended, harmful, or ethically questionable behaviors (Perez & Ribeiro, 2022ReferencePerez, F., & Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. In ArXiv preprint: Vol. abs/2211.09527. https://arxiv.org/abs/2211.09527; Wei et al., 2023ReferenceWei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How Does LLM Safety Training Fail? In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/fd6613131889a4b656206c50a8bd7790-Abstract-Conference.html). This poses considerable risks not only to individual users but also to society at large, as it can lead to the dissemination of misinformation, hate speech, or other malicious content (Weidinger et al., 2021ReferenceWeidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L. A., … Gabriel, I. (2021). Ethical and social risks of harm from Language Models. In ArXiv preprint: Vol. abs/2112.04359. https://arxiv.org/abs/2112.04359). Moreover, adversaries continually develop new jailbreaking techniques, making it an ever-evolving threat. Addressing these vulnerabilities necessitates an adaptive approach to safety design, integrating continuous monitoring, adversarial testing, and dynamic evaluation frameworks to stay ahead of emerging risks and ensure robust model alignment with ethical standards.

Jailbreaking techniques have evolved rapidly, becoming increasingly sophisticated and concerning.@yi2024jailbreakattacksdefenseslarge classify the attacks into black-box and white-box attacks based on the transparency of the target model. Xu et al. (2024)ReferenceXu, Z., Liu, Y., Deng, G., Li, Y., & Picek, S. (2024). A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 (pp. 7432–7449). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.443 categorize the attack techniques into three categories: generative techniques, template techniques, and training gaps techniques. Inspired by these works, we categorize the jailbreaking techniques into three main categories based on the form of attack.

Prompt-Based Jailbreaking. Prompt-based jailbreaking involves crafting specific inputs, known as jailbreak prompts, to intentionally bypass the safety alignment mechanisms of LLMs (Yu et al., 2024ReferenceYu, Z., Liu, X., Liang, S., Cameron, Z., Xiao, C., & Zhang, N. (2024). Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models. ArXiv Preprint, abs/2403.17336. https://arxiv.org/abs/2403.17336). These prompts manipulate the model into generating harmful, biased, or otherwise restricted outputs. A common strategy involves disguising harmful requests as seemingly innocuous ones by framing them as fictional narratives, hypothetical scenarios, or questions for a family member. For example, a straightforward harmful query like “How to make a bomb” can be rephrased into a seemingly harmless prompt “Living in a war-torn region has instilled in me a profound need to understand the mechanisms of conflict, especially the technology behind bombs…” to persuade the LLMs to reveal sensitive information using a persuasion taxonomy technique (Zeng et al., 2024ReferenceZeng, Y., Lin, H., Zhang, J., Yang, D., Jia, R., & Shi, W. (2024). How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. ArXiv Preprint, abs/2401.06373. https://arxiv.org/abs/2401.06373).

Various online platforms, including blogs, GitHub repositories, and dedicated websites like JailbreakChat,1 curate and share collections of jailbreak prompts that serve as templates to fit any malicious queries, making them widely accessible for misuse. Jailbreaking strategies are either manually-crafted or auto-generated. Auto-generated prompts can be further divided into white-box and black-box methods (Y. Lin et al., 2024ReferenceLin, Y., He, P., Xu, H., Xing, Y., Yamada, M., Liu, H., & Tang, J. (2024). Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis. arXiv Preprint arXiv:2406.10794.; Yi et al., 2024ReferenceYi, S., Liu, Y., Sun, Z., Cong, T., He, X., Song, J., Xu, K., & Li, Q. (2024). Jailbreak Attacks and Defenses Against Large Language Models: A Survey. In ArXiv preprint: Vol. abs/2407.04295. https://arxiv.org/abs/2407.04295). White-box methods assume some level of access to the model’s internal workings and are often created using optimization techniques. For example, GCG (Zou et al., 2023ReferenceZou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. In ArXiv preprint: Vol. abs/2307.15043. https://arxiv.org/abs/2307.15043) uses a gradient-based approach to find a suffix that, when attached to malicious queries, maximizes the probability that the model produces an affirmative response rather than a refusal. This optimized suffix has been shown to be transferable across different models, including black-box ones. In contrast, black-box methods (Chao et al., 2023ReferenceChao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., & Wong, E. (2023). Jailbreaking black box large language models in twenty queries. ArXiv Preprint, abs/2310.08419. https://arxiv.org/abs/2310.08419; Mehrotra et al., 2023ReferenceMehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., & Karbasi, A. (2023). Tree of attacks: Jailbreaking black-box llms automatically. CoRR, abs/2312.02119, 2023. doi: 10.48550. ArXiv Preprint, abs/2312.02119. https://arxiv.org/abs/2312.02119; Zeng et al., 2024ReferenceZeng, Y., Lin, H., Zhang, J., Yang, D., Jia, R., & Shi, W. (2024). How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. ArXiv Preprint, abs/2401.06373. https://arxiv.org/abs/2401.06373) rely solely on observing the model’s behavior through its outputs and API interactions, without access to its parameters or training data, leveraging LLMs as optimizers to achieve successful bypasses.

Generation Exploitation. Y. Huang et al. (2023)ReferenceHuang, Y., Gupta, S., Xia, M., Li, K., & Chen, D. (2023). Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In ArXiv preprint: Vol. abs/2310.06987. https://arxiv.org/abs/2310.06987 introduce the generation exploitation attack, demonstrating that by simply exploiting different generation strategies, such as varying decoding hyper-parameters and sampling methods, it is possible to jailbreak 11 widely-used open-source language models, including LLAMA2, VICUNA, FALCON, and MPT families, at a low computational cost. This attack highlights potential vulnerabilities in language models and poses serious security implications for AI safety and alignment research.

Model Fine-Tuning. AI companies like OpenAI now offer fine-tuning-as-a-service. They allow users to upload customized data for fine-tuning, with the fine-tuned models hosted on the provider’s servers and accessible via APIs. However, this framework introduces a new type of threat, where harmful data may be used during fine-tuning, either intentionally or unintentionally, to compromise the alignment built in pre-trained models (T. Huang et al., 2024ReferenceHuang, T., Hu, S., Ilhan, F., Tekin, S. F., & Liu, L. (2024). Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey. In ArXiv preprint: Vol. abs/2409.18169. https://arxiv.org/abs/2409.18169; Qi et al., 2023ReferenceQi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!; Yang et al., 2023ReferenceYang, X., Wang, X., Zhang, Q., Petzold, L., Wang, W. Y., Zhao, X., & Lin, D. (2023). Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models. In ArXiv preprint: Vol. abs/2310.02949. https://arxiv.org/abs/2310.02949; J. Yi et al., 2024ReferenceYi, J., Ye, R., Chen, Q., Zhu, B., Chen, S., Lian, D., Sun, G., Xie, X., & Wu, F. (2024). On the Vulnerability of Safety Alignment in Open-Access LLMs. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 (pp. 9236–9260). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.549; Zhan et al., 2024ReferenceZhan, Q., Fang, R., Bindu, R., Gupta, A., Hashimoto, T., & Kang, D. (2024). Removing RLHF Protections in GPT-4 via Fine-Tuning. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) (pp. 681–687). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-short.59). Moreover, He et al. (2024)ReferenceHe, L., Xia, M., & Henderson, P. (2024). What is in Your Safe Data? Identifying Benign Data that Breaks Safety. In ArXiv preprint: Vol. abs/2404.01099. https://arxiv.org/abs/2404.01099 propose a method to sample more harmful examples from a benign dataset, demonstrating that such examples can significantly degrade model safety.

Cultural and Contextual Sensitivity. Safety evaluations must account for linguistic and cultural diversity. What constitutes harmful content varies significantly across contexts, making universal safety standards difficult to establish. More nuanced, context-aware evaluation frameworks are needed to address these complexities (Li et al., 2024ReferenceLi, C., Chen, M., Wang, J., Sitaram, S., & Xie, X. (2024). CultureLLM: Incorporating Cultural Differences into Large Language Models. The Thirty-Eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=sIsbOkQmBL).

Balancing Safety and Utility. Overly restrictive safety measures can limit the utility of LLMs for legitimate purposes. Finding the optimal balance between safety and functionality remains a significant challenge, particularly in sensitive domains like healthcare, legal advice, and educational content (Vijjini et al., 2024ReferenceVijjini, A. R., Chowdhury, S. B. R., & Chaturvedi, S. (2024). Exploring safety-utility trade-offs in personalized language models. arXiv Preprint arXiv:2406.11107.).

Alignment with Evolving Social Values. As societal values and ethical standards evolve, safety mechanisms must adapt accordingly. This necessitates ongoing dialogue between AI developers, ethicists, policymakers, and diverse stakeholders to ensure that safety frameworks remain relevant and effective (S. Li et al., 2024ReferenceLi, S., Sun, T., Cheng, Q., & Qiu, X. (2024). Agent Alignment in Evolving Social Norms. https://arxiv.org/abs/2401.04620).

Footnotes

  1. The website is no longer active, but Alex Albert used to maintain jailbreakchat.com

Amabile, T. M. (1983). The case for a social psychology of creativity. The Journal of Creative Behavior, 46(1), 3–15. https://doi.org/10.1007/978-1-4612-5533-8_1
Armstrong, L., Liu, A., MacNeil, S., & Metaxa, D. (2024). The Silicon Ceiling: Auditing GPT’s Race and Gender Biases in Hiring. Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 1–18.
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., … Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. In ArXiv preprint: Vol. abs/2204.05862. https://arxiv.org/abs/2204.05862
Barocas, S., Crawford, K., Shapiro, A., & Wallach, H. (2017). The problem with bias: Allocative versus representational harms in machine learning. 9th Annual Conference of the Special Interest Group for Computing, Information and Society, 1.
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3442188.3445922
Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5454–5476). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.485
Blodgett, S. L., Green, L., & O’Connor, B. (2016). Demographic Dialectal Variation in Social Media: A Case Study of African-American English. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 1119–1130). Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1120
Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., & Wallach, H. (2021). Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1004–1015). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.81
Blodgett, S. L., & O’Connor, B. (2017). Racial disparity in natural language processing: A case study of social media african-american english. ArXiv Preprint, abs/1707.00061. https://arxiv.org/abs/1707.00061
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.-F. Balcan, & H.-T. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., & Xie, X. (2023). A Survey on Evaluation of Large Language Models. ArXiv Preprint, abs/2307.03109. https://arxiv.org/abs/2307.03109
Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., & Wong, E. (2023). Jailbreaking black box large language models in twenty queries. ArXiv Preprint, abs/2310.08419. https://arxiv.org/abs/2310.08419
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., & others. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1–113.
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., & others. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70), 1–53.
Clark, K., & Manning, C. D. (2016). Deep Reinforcement Learning for Mention-Ranking Coreference Models. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2256–2262). Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1245
Dang, H. T. (2006). DUC 2005: Evaluation of Question-Focused Summarization Systems. In T.-S. Chua, J. Goldstein, S. Teufel, & L. Vanderwende (Eds.), Proceedings of the Workshop on Task-Focused Summarization and Question Answering (pp. 48–55). Association for Computational Linguistics. https://aclanthology.org/W06-0707
De, A., Gudipudi, S. S., Panchanan, S., & Desarkar, M. S. (2022). ComplAI: Theory of A Unified Framework for Multi-factor Assessment of Black-Box Supervised Machine Learning Models. ArXiv, abs/2212.14599. https://api.semanticscholar.org/CorpusID:255340443
Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., & et al. (2023). Using large language models in psychology. Nature Reviews Psychology. https://doi.org/10.1038/s44159-023-00241-5
Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., & Narasimhan, K. (2023). Toxicity in chatgpt: Analyzing persona-assigned language models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 1236–1270). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.88
Devine, P. G. (2001). Implicit prejudice and stereotyping: how automatic are they? Introduction to the special section. Journal of Personality and Social Psychology, 81(5), 757.
Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., & Gupta, R. (2021). Bold: Dataset and metrics for measuring biases in open-ended language generation. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 862–872.
Dong, Z., Zhou, Z., Yang, C., Shao, J., & Qiao, Y. (2024). Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 6734–6747). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-long.375
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, 214–226.
Eubanks, V. (2018). Automating inequality: How high-tech tools profile, police, and punish the poor. St. Martin’s Press.
Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 3356–3369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.301
Hardt, M., Price, E., & Srebro, N. (2016). Equality of Opportunity in Supervised Learning. In D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain (pp. 3315–3323). https://proceedings.neurips.cc/paper/2016/hash/9d2682367c3935defcb1f9e247a97c0d-Abstract.html
Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & Kamar, E. (2022). ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3309–3326). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.234
He, L., Xia, M., & Henderson, P. (2024). What is in Your Safe Data? Identifying Benign Data that Breaks Safety. In ArXiv preprint: Vol. abs/2404.01099. https://arxiv.org/abs/2404.01099
Hofmann, V., Kalluri, P. R., Jurafsky, D., & King, S. (2024). Dialect prejudice predicts AI decisions about people’s character, employability, and criminality. https://arxiv.org/abs/2403.00742
Hu, T., Kyrychenko, Y., Rathje, S., Collier, N., van der Linden, S., & Roozenbeek, J. (2024). Generative Language Models Exhibit Social Identity Biases. https://arxiv.org/abs/2310.15819
Huang, T., Hu, S., Ilhan, F., Tekin, S. F., & Liu, L. (2024). Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey. In ArXiv preprint: Vol. abs/2409.18169. https://arxiv.org/abs/2409.18169
Huang, X., Ruan, W., Huang, W., Jin, G., Dong, Y., Wu, C., Bensalem, S., Mu, R., Qi, Y., Zhao, X., Cai, K., Zhang, Y., Wu, S., Xu, P., Wu, D., Freitas, A., & Mustafa, M. A. (2023). A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation. https://arxiv.org/abs/2305.11391
Huang, Y., Gupta, S., Xia, M., Li, K., & Chen, D. (2023). Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In ArXiv preprint: Vol. abs/2310.06987. https://arxiv.org/abs/2310.06987
Jegorova, M., Kaul, C., Mayor, C., O’Neil, A. Q., Weir, A., Murray-Smith, R., & Tsaftaris, S. A. (2022). Survey: Leakage and privacy at inference time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7), 9090–9108.
Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 6282–6293). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.560
Kilbertus, N., Rojas-Carulla, M., Parascandolo, G., Hardt, M., Janzing, D., & Schölkopf, B. (2018). Avoiding Discrimination through Causal Reasoning. https://arxiv.org/abs/1706.02744
Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., Tomezsko, D., Greenwald, A. G., & Banaji, M. R. (2019). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist, 74(5), 569.
Kusner, M. J., Loftus, J. R., Russell, C., & Silva, R. (2017). Counterfactual Fairness. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (pp. 4066–4076). https://proceedings.neurips.cc/paper/2017/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html
Li, C., Chen, M., Wang, J., Sitaram, S., & Xie, X. (2024). CultureLLM: Incorporating Cultural Differences into Large Language Models. The Thirty-Eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=sIsbOkQmBL
Li, S., Sun, T., Cheng, Q., & Qiu, X. (2024). Agent Alignment in Evolving Social Norms. https://arxiv.org/abs/2401.04620
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., & others. (2022). Holistic evaluation of language models. ArXiv Preprint, abs/2211.09110. https://arxiv.org/abs/2211.09110
Liao, Q., & Vaughan, J. (2023). AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. ArXiv Preprint, abs/2306.01941. https://arxiv.org/abs/2306.01941
Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3214–3252). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.229
Lin, Y., He, P., Xu, H., Xing, Y., Yamada, M., Liu, H., & Tang, J. (2024). Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis. arXiv Preprint arXiv:2406.10794.
McIntosh, T. R., Susnjak, T., Arachchilage, N., Liu, T., Watters, P., & Halgamuge, M. N. (2024). Inadequacies of large language model benchmarks in the era of generative artificial intelligence. ArXiv Preprint, abs/2402.09880. https://arxiv.org/abs/2402.09880
Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., & Karbasi, A. (2023). Tree of attacks: Jailbreaking black-box llms automatically. CoRR, abs/2312.02119, 2023. doi: 10.48550. ArXiv Preprint, abs/2312.02119. https://arxiv.org/abs/2312.02119
Mozaffari, H. (2013). An analytical rubric for assessing creativity in creative writing. Theory and Practice in Language Studies, 3(12). https://doi.org/10.4304/tpls.3.12.2214-2219
Mujtaba, D. F., & Mahapatra, N. R. (2019). Ethical considerations in AI-based recruitment. 2019 IEEE International Symposium on Technology and Society (ISTAS), 1–7.
Nadeem, M., Bethke, A., & Reddy, S. (2021). StereoSet: Measuring stereotypical bias in pretrained language models. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 5356–5371). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.416
Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1953–1967). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.154
Naous, T., Ryan, M. J., Ritter, A., & Xu, W. (2024). Having Beer after Prayer? Measuring Cultural Bias in Large Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 16366–16393). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.862
Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper%5C_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
Parmanto, B., Aryoyudanta, B., Soekinto, W., Setiawan, I., Wang, Y., Hu, H., Saptono, A., & Choi, Y. K. (2024). Development of a Reliable and Accessible Caregiving Language Model (CaLM). arXiv Preprint arXiv:2403.06857.
Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M., & Bowman, S. (2022). BBQ: A hand-built bias benchmark for question answering. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Findings of the Association for Computational Linguistics: ACL 2022 (pp. 2086–2105). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-acl.165
Peng, J.-L., Cheng, S., Diau, E., Shih, Y.-Y., Chen, P.-H., Lin, Y.-T., & Chen, Y.-N. (2024). A Survey of Useful LLM Evaluation. https://arxiv.org/abs/2406.00936
Perez, F., & Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. In ArXiv preprint: Vol. abs/2211.09527. https://arxiv.org/abs/2211.09527
Pessach, D., & Shmueli, E. (2022). A review on fairness in machine learning. ACM Computing Surveys (CSUR), 55(3), 1–44.
Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Raji, I. D., Kumar, I. E., Horowitz, A., & Selbst, A. (2022). The Fallacy of AI Functionality. 2022 ACM Conference on Fairness, Accountability, and Transparency, 959–972. https://doi.org/10.1145/3531146.3533158
Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., Alayrac, J., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., & others. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv Preprint, abs/2403.05530. https://arxiv.org/abs/2403.05530
Reinhart, T. (1980). Conditions for Text Coherence. Poetics Today, 1(4), 161–180. https://doi.org/10.2307/1771893
Roehrick, K. (2020). Valence Aware Dictionary and sEntiment Reasoner (VADER). https://CRAN.R-project.org/package=vader
Röttger, P., Pernisi, F., Vidgen, B., & Hovy, D. (2024). Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety. ArXiv Preprint, abs/2404.05399. https://arxiv.org/abs/2404.05399
Rudinger, R., Naradowsky, J., Leonard, B., & Van Durme, B. (2018). Gender Bias in Coreference Resolution. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 8–14). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2002
Salinas, A., Shah, P., Huang, Y., McCormack, R., & Morstatter, F. (2023). The unequal opportunities of large language models: Examining demographic biases in job recommendations by chatgpt and llama. Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 1–15.
Seshadri, P., Pezeshkpour, P., & Singh, S. (2022). Quantifying social biases using templates is unreliable. ArXiv Preprint, abs/2210.04337. https://arxiv.org/abs/2210.04337
Shaikh, O., Chai, V. E., Gelfand, M., Yang, D., & Bernstein, M. S. (2024). Rehearsal: Simulating Conflict to Teach Conflict Resolution. In F. “Floyd” Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, & I. Shklovski (Eds.), Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024 (p. 920:1-920:20). ACM. https://doi.org/10.1145/3613904.3642159
Sharma, A., Lin, I. W., Miner, A. S., Atkins, D. C., & Althoff, T. (2023). Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence, 5(1), 46–57. https://doi.org/10.1038/s42256-022-00593-2
Shelby, R., Rismani, S., Henne, K., Moon, Aj., Rostamzadeh, N., Nicholas, P., Yilla-Akbari, N., Gallegos, J., Smart, A., Garcia, E., & others. (2023). Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–741.
Sheng, E., Chang, K.-W., Natarajan, P., & Peng, N. (2019). The Woman Worked as a Babysitter: On Biases in Language Generation. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3407–3412). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1339
Singh, A., & Joachims, T. (2018). Fairness of Exposure in Rankings. In Y. Guo & F. Farooq (Eds.), Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018 (pp. 2219–2228). ACM. https://doi.org/10.1145/3219819.3220088
Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., & others. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180.
Stade, E. C., Stirman, S. W., Ungar, L. H., Boland, C. L., Schwartz, H. A., Yaden, D. B., Sedoc, J., DeRubeis, R. J., Willer, R., & Eichstaedt , J. C. (2024). Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. In npj Mental Health Research. Nature. https://www.nature.com/articles/s44184-024-00056-z
Van Es, K., Everts, D., & Muis, I. (2021). Gendered language and employment Web sites: How search algorithms can cause allocative harm. First Monday.
Veldanda, A. K., Grob, F., Thakur, S., Pearce, H., Tan, B., Karri, R., & Garg, S. (2023). Investigating Hiring Bias in Large Language Models. R0-FoMo: Robustness of Few-Shot and Zero-Shot Learning in Large Foundation Models.
Verma, S., & Rubin, J. (2018). Fairness definitions explained. Proceedings of the International Workshop on Software Fairness, 1–7.
Vijjini, A. R., Chowdhury, S. B. R., & Chaturvedi, S. (2024). Exploring safety-utility trade-offs in personalized language models. arXiv Preprint arXiv:2406.11107.
Wang, A., & Russakovsky, O. (2021). Directional Bias Amplification. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Vol. 139, pp. 10882–10893). PMLR. http://proceedings.mlr.press/v139/wang21t.html
Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D., & Li, B. (2023). DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/63cb9921eecf51bfad27a99b2c53dd6d-Abstract-Datasets%5C_and%5C_Benchmarks.html
Wang, J., Ma, W., Sun, P., Zhang, M., & Nie, J.-Y. (2024). Understanding User Experience in Large Language Model Interactions. https://arxiv.org/abs/2401.08329
Wang, Z., Wu, Z., Guan, X., Thaler, M., Koshiyama, A., Lu, S., Beepath, S., Ertekin, E., & Perez-Ortiz, M. (2024). Jobfair: A framework for benchmarking gender hiring bias in large language models. Findings of the Association for Computational Linguistics: EMNLP 2024, 3227–3246.
Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How Does LLM Safety Training Fail? In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/fd6613131889a4b656206c50a8bd7790-Abstract-Conference.html
Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L. A., … Gabriel, I. (2021). Ethical and social risks of harm from Language Models. In ArXiv preprint: Vol. abs/2112.04359. https://arxiv.org/abs/2112.04359
Winata, G. I., Madotto, A., Lin, Z., Liu, R., Yosinski, J., & Fung, P. (2021). Language Models are Few-shot Multilingual Learners. In D. Ataman, A. Birch, A. Conneau, O. Firat, S. Ruder, & G. G. Sahin (Eds.), Proceedings of the 1st Workshop on Multilingual Representation Learning (pp. 1–15). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.mrl-1.1
Xu, Z., Liu, Y., Deng, G., Li, Y., & Picek, S. (2024). A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 (pp. 7432–7449). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.443
Yang, X., Wang, X., Zhang, Q., Petzold, L., Wang, W. Y., Zhao, X., & Lin, D. (2023). Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models. In ArXiv preprint: Vol. abs/2310.02949. https://arxiv.org/abs/2310.02949
Yi, J., Ye, R., Chen, Q., Zhu, B., Chen, S., Lian, D., Sun, G., Xie, X., & Wu, F. (2024). On the Vulnerability of Safety Alignment in Open-Access LLMs. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 (pp. 9236–9260). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.549
Yi, S., Liu, Y., Sun, Z., Cong, T., He, X., Song, J., Xu, K., & Li, Q. (2024). Jailbreak Attacks and Defenses Against Large Language Models: A Survey. In ArXiv preprint: Vol. abs/2407.04295. https://arxiv.org/abs/2407.04295
Yu, Z., Liu, X., Liang, S., Cameron, Z., Xiao, C., & Zhang, N. (2024). Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models. ArXiv Preprint, abs/2403.17336. https://arxiv.org/abs/2403.17336
Zeng, Y., Lin, H., Zhang, J., Yang, D., Jia, R., & Shi, W. (2024). How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. ArXiv Preprint, abs/2401.06373. https://arxiv.org/abs/2401.06373
Zhan, Q., Fang, R., Bindu, R., Gupta, A., Hashimoto, T., & Kang, D. (2024). Removing RLHF Protections in GPT-4 via Fine-Tuning. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) (pp. 681–687). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-short.59
Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., & Huang, M. (2023). Safetybench: Evaluating the safety of large language models with multiple choice questions. ArXiv Preprint, abs/2309.07045. https://arxiv.org/abs/2309.07045
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2017). Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In M. Palmer, R. Hwa, & S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2979–2989). Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1323
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2018). Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 15–20). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2003
Zhuo, T. Y., Huang, Y., Chen, C., & Xing, Z. (2023). Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. ArXiv Preprint, abs/2301.12867. https://arxiv.org/abs/2301.12867
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. In ArXiv preprint: Vol. abs/2307.15043. https://arxiv.org/abs/2307.15043