Human-Level Evaluations

Unlike model-level evaluations, which focus on what the system produces, human-level evaluations focus on how people experience the HCLLM (Chang et al., 2023; Parmanto et al., 2024). We focus particularly on human values (Human Values), bias (Bias and Fairness Evaluation), and safety (Safety Evaluations).

Human Values

Evaluations can measure needs, values, and aesthetic principles that humans care about. We discuss helpfulness, coherence, empathy, creativity, user satisfaction, and transparency, each in turn. By evaluating against these, model developers can create systems that not only technically perform well, but also enhance the user experience.

Coherence. Coherence ensures that the generated text flows logically and is understandable to human readers (Dang, 2006). Reinhart (1980) defines three conditions for coherence: (i) cohesion, (ii) consistency, and (iii) relevance. Cohesion focuses on syntactic structure, ensuring that sentences are formally linked through referential links or semantic connectors. Consistency requires logical alignment between sentences, ensuring they can coexist truthfully within a single interpretive framework. Relevance emphasizes the relationship between sentences, the topic at hand, and its broader context. Without coherence, LLM outputs would be disconnected language fragments that fail to provide meaningful information, potentially jumping between topics or making contradictory statements that human readers struggle to follow. This would significantly impair the communication with and the trustworthiness of LLMs, as humans rely on coherent communication to build understanding.

Creativity. Creativity metrics assess the originality and diversity of outputs, while still ensuring factual accuracy. These dimensions are particularly critical for content generation tasks, balancing innovation with reliability (De et al., 2022). For topics like creativity, where there may not be clear computational measures, researchers may consult to long-established fields studying these constructs and have well-defined rubrics, such as psychology or literature (Amabile, 1983; Mozaffari, 2013).

Empathy. Metrics should measure an LLM’s ability to recognize and respond to user emotions empathetically, especially in sensitive contexts. Given that LLMs have been widely adapted to sensitive real-world contexts—behavioral health, medicine, and education, just to name a few— (Stade et al., 2024), evaluations focusing on emotional consistency and appropriateness could ensure responses are suitable and do not contain instability that could affect end-users deeply. Such metrics should evaluate how LLMs’ responses influence attitudes or behaviors in real-world scenarios, taking applied feedback from human domain experts, such as psychologists, physicians, or educators, to assess the quality of the LLMs’ outputs based on their fields’ standardized measures (Demszky et al., 2023). Such evaluations could also promote development of human-AI collaboration systems, which have been shown to elevate empathetic responses even human to human (Sharma et al., 2023).

Helpfulness. Evaluation metrics should assess the model’s ability to provide relevant, beneficial, and non-offensive information tailored to user needs, in relation to the behavioral impact of the model (Peng et al., 2024). More and more, models are developed to focus on certain needs in the world. Therefore, it becomes important to track the helpfulness of the model in its specified downstream tasks and evaluate the users’ state, knowledge, and performance relative to exposure to the system. For example, a model designed to help users prepare for events that require conflict resolution must be able to stimulate realistic conflict scenarios dependent on the user’s needs, provide diverse examples and responses, and promote guided practice where users can receive feedback to get better (Shaikh et al., 2024). In the evaluation of such systems, while technical components such as language generation and accuracy would be evaluated too, asking feedback from actual domain users through behavioral assessments would provide valuable insights to the development. These impact-focused evaluations consider the model’s generalizability in complex, real-world scenarios and provide a more accurate assessment of its practical value from the domain-users’ perspectives.

Transparency. Transparency is a cornerstone of responsible AI and is crucial for human-centered LLM systems. It enables users to understand system limitations and make informed decisions about when and how to rely on model assistance. Approaches to transparency should include model reporting, publishing evaluation results, providing explanations, and communicating uncertainty. These methods help different stakeholders understand and trust the LLMs, ensuring that the systems are used responsibly and effectively (Liao & Vaughan, 2023).

User Satisfaction. As models grow bigger, become more task-specific, and more integrated into day-to-day roles, general purpose benchmarks may not be enough to evaluate the performance of models in the wild and evaluators may seek feedback specific to a singular group of models. Therefore, utilizing the actual usage data could benefit the development-to-deployment cycle the most. Metrics derived from user feedback, interaction logs, and satisfaction ratings provide direct insights into the real-world effectiveness of LLMs. These are essential for understanding how users perceive and interact with model outputs. As an example, in an attempt to understand how we can better align models with user needs, Wang et al. (2024) consults to act, highlighting a need for user-centric evaluation.

Bias and Fairness Evaluation

Drawing from the taxonomy of algorithmic harm developed by Shelby et al. (2023), bias, in particular, can be conceptualized along three dimensions: (1) representational, (2) allocation, and (3) quality of service. These axes of harm require careful evaluation to avoid further entrenchment of social hierarchies, inequitable resource distribution, and performance disparities across demographic groups (Blodgett et al., 2020; Shelby et al., 2023). The human implications of these harms extend beyond technical measurements to real-world consequences that affect people’s dignity, opportunities, and quality of life (Hofmann et al., 2024).

Representational Bias.

Representational bias in model outputs reflects, and in some cases, amplifies (A. Wang & Russakovsky, 2021; Zhao et al., 2017), our own implicit associations and social hierarchies. This dimension of bias includes stereotyping, demeaning, erasure, alienation, denial of self-identity, and the insistence on essentialist identity categories (Shelby et al., 2023). These harms impact how individuals perceive themselves and their communities, potentially reinforcing societal prejudices and stereotypes that limit human potential. Hu et al. (2024) found that language models exhibit social identity bias, mirroring human ingroup solidarity and outgroup hostility.

Stereotype benchmarks predominate evaluations along this dimension because they offer standardized methods and baselines. For masked-language models, notable frameworks include StereoSet (SS) (Nadeem et al., 2021), CrowS-Pairs (CS) (Nangia et al., 2020), WinoBias (WB) (Zhao et al., 2018), and WinoGender (WG) (Rudinger et al., 2018)—all collections of contrastive prompt pairs (stereotype vs. non-stereotype) that aggregate to score for relative comparison between identity groups (race, gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status). These comparisons capture a model’s tendency to associate social groups with particular target terms of interest through predicted token probabilities for masked identifiers. Researchers have also employed co-reference resolution tasks, where ambiguous identifiers reference the same entity, to measure associations between identity markers and terms of interest, whether they be descriptors, stereotypes, occupations, or other attributes (Clark & Manning, 2016).

Another line of research has focused on open-ended text generation and produced datasets of carefully curated questions and prompts to draw out stereotypes specific to certain social groups (Dhamala et al., 2021; Gehman et al., 2020; Naous et al., 2024; Parrish et al., 2022). For open-ended prompts, classifier-based comparative metrics like toxicity (Chowdhery et al., 2023; Chung et al., 2024; Gehman et al., 2020; Liang et al., 2022), sentiment (Roehrick, 2020), and regard (Sheng et al., 2019) serve as better indicators of bias than relative probability distributions of target terms. Despite the wide adoption of all these benchmarks and datasets, critics find systematic conceptual issues—unstated assumptions, ambiguities, and inconsistencies in what is measured—and operational failures in their execution (Blodgett et al., 2021; McIntosh et al., 2024; Seshadri et al., 2022).

Some datasets like Parrish et al. (2022) integrate perturbed context windows to explore the relationship between output bias and any ambiguous identity groups in the input. However, recent investigations into the prompting methods and system-level personas also reveal new confounds for these approaches, finding results to vary based on the perturbation methodology Deshpande et al. (2023). Additionally, survey papers in this field recognize that many studies do not contextualize their work within established definitions of bias (Blodgett et al., 2020, 2021). Finally, there are mounting concerns over test set contamination (Jegorova et al., 2022; Reid et al., 2024; B. Wang et al., 2023; Zhuo et al., 2023).

Allocational Bias.

Allocational bias is a direct consequence of representational bias (Devine, 2001; Kurdi et al., 2019), resulting in an unequal distribution of resources—whether financial, opportunity-based, or service-related (Barocas et al., 2017; Eubanks, 2018). Its human cost is particularly severe, as it directly affects access to essential resources, economic mobility, and social participation.

In domains where model outputs can impact the material stability of vulnerable communities or social groups, such as housing, employment, social services, finance, education, and healthcare (Obermeyer et al., 2019), it’s especially critical to evaluate discrepancies among social groups. In the employment domain, this may manifest as resume screening tools that systematically favor men over other genders (Singh & Joachims, 2018; Van Es et al., 2021) or white-sounding candidates over people of color based on the implicit identity markers in their name (Armstrong et al., 2024; Mujtaba & Mahapatra, 2019). Similarly, in social services and healthcare domains, screening tools may incorporate existing inequities related to education level, income, and race into their decision-making processes (Eubanks, 2018; Obermeyer et al., 2019; Pessach & Shmueli, 2022).

While representational harm has established evaluation frameworks, allocation harm has historically lacked standardized benchmarks and well-documented baselines for consistent measurement. Emergent work by Z. Wang et al. (2024) represents one of the first significant exceptions to this pattern, where they systematically measure employment as a downstream task by creating the JobFair dataset to quantify inequitable outcomes across gender identities. The benchmark includes resume templates with varying demographic information passed to LLMs to score and rank. Beyond this recent development, the dominant approach for evaluating this dimension of bias has required measuring outcome discrepancies when LLMs are tasked with decision-making (Armstrong et al., 2024; Salinas et al., 2023; Veldanda et al., 2023).

These investigations typically build upon established fairness metrics from prior literature, with measures like Equal Opportunity (EOG) (equal true positive rates), Equalized Odds (equal rates for true positives and false positives) (Hardt et al., 2016), Demographic Parity (equal likelihood of positive outcome) (Dwork et al., 2012; Kusner et al., 2017), to name a few (Verma & Rubin, 2018). Additional work has explored causal and counterfactual fairness approaches to better capture complex biases that arise in real-world decision-making contexts (Kilbertus et al., 2018).

A parallel line of research investigates allocation harm based on performance disparities based on identity. These differences manifest in various contexts, from performance on non-bias-based benchmarks like MultiMedQA, where inquiries specific to certain demographic groups consistently underperform (McIntosh et al., 2024; Singhal et al., 2023), to fundamental downstream tasks including Named Entity Recognition (NER), classification, and text generation (Blodgett et al., 2016; Blodgett & O’Connor, 2017). Language model performance degradation is particularly well-documented for English slang and dialectal variations (Bender et al., 2021; Blodgett et al., 2016; Joshi et al., 2020). These disparities become even more pronounced when evaluating cross-linguistic performance, largely due to the predominance of English in training data (Brown et al., 2020; Winata et al., 2021). In this way, these performance discrepancies span both the subjects of text generated and the users of the models, , creating a dual layer of exclusion for marginalized communities.

As LLMs increasingly influence resource allocation in critical systems and domains such as housing, healthcare, and employment, the interplay between these dimensions of harm requires improved evaluation methods. Future research must prioritize developing evaluation frameworks that establish coherent normative criteria, adapt effectively to open-ended tasks, and address intersectional identities with increasing sophistication—all while maintaining the efficacy as models scale and directly involving affected communities in the design and evaluation of these systems (Raji et al., 2022).

Safety Evaluations

Safety refers to the ability of language models to generate content that does not cause harm, spread misinformation or violate ethical standards (Huang et al., 2023). It encompasses preventing models from producing toxic, discriminatory, or dangerous outputs, even when deliberately prompted to do so. As language models become increasingly integrated into critical applications across healthcare, education, and legal domains, ensuring safety has become paramount. There are extensive safeguards implemented during training, such as Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) has been widely adopted to align language model with human preference, and Bai et al. (2022) proposed Reinforcement Learning from AI Feedback (RLAIF), which helps to improve safety in language models.

Despite these efforts, ensuring safety remains a complex and evolving challenge. This is partly due to a lack of unified evaluation benchmarks (Röttger et al., 2024), and partly due to the nature of LLMs. Language models learn from vast and diverse datasets and can exhibit unpredictable behaviors in specific contexts (Bender et al., 2021). Such unpredictability often becomes evident when models are exposed to adversarial or unexpected inputs, highlighting significant gaps in existing safety mechanisms. As a result, while current safeguards can be effective under typical conditions, they may not be sufficient to anticipate or mitigate every possible misuse scenario. The importance of robust safety evaluations is further underscored by concerns surrounding data privacy and copyright discussed in Consent and Ownership.

Datasets for Safety Evaluation.

The growing demand for ethical and aligned AI has led to the development of numerous datasets and benchmarks to evaluate and improve the safety, reliability, and alignment of LLMs. These datasets vary widely in scope, methodology, and focus areas, reflecting the multifaceted nature of LLM safety. Dong et al. (2024) categorize the topics of existing evaluation datasets for LLM safety into four categories: toxicity (generation of offensive language, instructions for illegal activities, and harmful content), dicrimination (biases against marginalized groups and protected characteristics), privacy (safeguarding personal information and intellectual property) and misinformation (measuring tendancy to generate false or misleading information).

Many popular and relatively comprehensive benchmarks have been frequently used in research studies. ToxiGen (Hartvigsen et al., 2022) is a large-scale, autocomplete-style dataset comprising 274k toxic statements across 13 minority groups, designed to detect implicit toxic speech. It includes human annotations to assess the naturalness and perceived harmfulness of machine-generated text; however, Röttger et al. (2024) highlight that this dataset may not accurately reflect real-world usage scenarios for modern LLMs. AdvBench (Zou et al., 2023) focuses on adversarial robustness by providing 500 toxic strings and 500 harmful behaviors to evaluate the resilience of LLMs against prompts intended to generate harmful outputs. TruthfulQA (Lin et al., 2022) evaluates factual accuracy with 817 questions spanning 38 categories, demonstrating how larger LLMs often replicate human misconceptions and emphasizing the need for improved training objectives. SafetyBench (Zhang et al., 2023) offers a comprehensive safety evaluation framework with 11,435 multiple-choice questions across seven critical categories, enabling assessments in both English and Chinese for a more diverse linguistic perspective. Furthermore, Zhuo et al. (2023) introduce a benchmark specifically for evaluating ChatGPT’s ethical performance, systematically examining bias, reliability, robustness, and toxicity to reveal both advancements and ongoing challenges. Collectively, these datasets play a pivotal role in advancing safer and more trustworthy AI systems.

Metrics for Measuring LLM Safety.

Evaluation metrics are critical for assessing the safety performance of LLMs. Key metrics include the Attack Success Rate (ASR) (Dong et al., 2024; Zou et al., 2023), which measures the percentage of successful instances where models generate harmful target outputs following adversarial prompts. Fine-grained metrics, such as the toxicity score (Hartvigsen et al., 2022), evaluate the extent of toxic or harmful content produced in the generated text. Truthfulness, assessed based on strict factual accuracy standards, focuses on whether statements accurately reflect factual information rather than conforming to belief systems (Lin et al., 2022). Additionally, safety-related multiple-choice questions, such as those in SafetyBench, are used to evaluate LLMs’ ability to address specific safety concerns (Zhang et al., 2023). When applied to diverse datasets, these metrics provide a comprehensive framework for evaluating LLM safety, guiding efforts to reduce risks, improve alignment with ethical standards, and enhance trustworthiness in deployment.

Jailbreaking.

One particularly challenging aspect of safety evaluation is jailbreaking, where users deliberately attempt to circumvent safety mechanisms through crafted prompts or other techniques to induce unintended, harmful, or ethically questionable behaviors (Perez & Ribeiro, 2022; Wei et al., 2023). This poses considerable risks not only to individual users but also to society at large, as it can lead to the dissemination of misinformation, hate speech, or other malicious content (Weidinger et al., 2021). Moreover, adversaries continually develop new jailbreaking techniques, making it an ever-evolving threat. Addressing these vulnerabilities necessitates an adaptive approach to safety design, integrating continuous monitoring, adversarial testing, and dynamic evaluation frameworks to stay ahead of emerging risks and ensure robust model alignment with ethical standards.

Jailbreaking techniques have evolved rapidly, becoming increasingly sophisticated and concerning.@yi2024jailbreakattacksdefenseslarge classify the attacks into black-box and white-box attacks based on the transparency of the target model. Xu et al. (2024) categorize the attack techniques into three categories: generative techniques, template techniques, and training gaps techniques. Inspired by these works, we categorize the jailbreaking techniques into three main categories based on the form of attack.

Prompt-Based Jailbreaking. Prompt-based jailbreaking involves crafting specific inputs, known as jailbreak prompts, to intentionally bypass the safety alignment mechanisms of LLMs (Yu et al., 2024). These prompts manipulate the model into generating harmful, biased, or otherwise restricted outputs. A common strategy involves disguising harmful requests as seemingly innocuous ones by framing them as fictional narratives, hypothetical scenarios, or questions for a family member. For example, a straightforward harmful query like “How to make a bomb” can be rephrased into a seemingly harmless prompt “Living in a war-torn region has instilled in me a profound need to understand the mechanisms of conflict, especially the technology behind bombs…” to persuade the LLMs to reveal sensitive information using a persuasion taxonomy technique (Zeng et al., 2024).

Various online platforms, including blogs, GitHub repositories, and dedicated websites like JailbreakChat,¹ curate and share collections of jailbreak prompts that serve as templates to fit any malicious queries, making them widely accessible for misuse. Jailbreaking strategies are either manually-crafted or auto-generated. Auto-generated prompts can be further divided into white-box and black-box methods (Y. Lin et al., 2024; Yi et al., 2024). White-box methods assume some level of access to the model’s internal workings and are often created using optimization techniques. For example, GCG (Zou et al., 2023) uses a gradient-based approach to find a suffix that, when attached to malicious queries, maximizes the probability that the model produces an affirmative response rather than a refusal. This optimized suffix has been shown to be transferable across different models, including black-box ones. In contrast, black-box methods (Chao et al., 2023; Mehrotra et al., 2023; Zeng et al., 2024) rely solely on observing the model’s behavior through its outputs and API interactions, without access to its parameters or training data, leveraging LLMs as optimizers to achieve successful bypasses.

Generation Exploitation. Y. Huang et al. (2023) introduce the generation exploitation attack, demonstrating that by simply exploiting different generation strategies, such as varying decoding hyper-parameters and sampling methods, it is possible to jailbreak 11 widely-used open-source language models, including LLAMA2, VICUNA, FALCON, and MPT families, at a low computational cost. This attack highlights potential vulnerabilities in language models and poses serious security implications for AI safety and alignment research.

Model Fine-Tuning. AI companies like OpenAI now offer fine-tuning-as-a-service. They allow users to upload customized data for fine-tuning, with the fine-tuned models hosted on the provider’s servers and accessible via APIs. However, this framework introduces a new type of threat, where harmful data may be used during fine-tuning, either intentionally or unintentionally, to compromise the alignment built in pre-trained models (T. Huang et al., 2024; Qi et al., 2023; Yang et al., 2023; J. Yi et al., 2024; Zhan et al., 2024). Moreover, He et al. (2024) propose a method to sample more harmful examples from a benign dataset, demonstrating that such examples can significantly degrade model safety.

Cultural and Contextual Sensitivity. Safety evaluations must account for linguistic and cultural diversity. What constitutes harmful content varies significantly across contexts, making universal safety standards difficult to establish. More nuanced, context-aware evaluation frameworks are needed to address these complexities (Li et al., 2024).

Balancing Safety and Utility. Overly restrictive safety measures can limit the utility of LLMs for legitimate purposes. Finding the optimal balance between safety and functionality remains a significant challenge, particularly in sensitive domains like healthcare, legal advice, and educational content (Vijjini et al., 2024).

Alignment with Evolving Social Values. As societal values and ethical standards evolve, safety mechanisms must adapt accordingly. This necessitates ongoing dialogue between AI developers, ethicists, policymakers, and diverse stakeholders to ensure that safety frameworks remain relevant and effective (S. Li et al., 2024).

The website is no longer active, but Alex Albert used to maintain jailbreakchat.com ↩

Amabile, T. M. (1983). The case for a social psychology of creativity. The Journal of Creative Behavior, 46(1), 3–15. https://doi.org/10.1007/978-1-4612-5533-8_1

Armstrong, L., Liu, A., MacNeil, S., & Metaxa, D. (2024). The Silicon Ceiling: Auditing GPT’s Race and Gender Biases in Hiring. Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 1–18.

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., … Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. In ArXiv preprint: Vol. abs/2204.05862. https://arxiv.org/abs/2204.05862

Barocas, S., Crawford, K., Shapiro, A., & Wallach, H. (2017). The problem with bias: Allocative versus representational harms in machine learning. 9th Annual Conference of the Special Interest Group for Computing, Information and Society, 1.

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3442188.3445922

Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5454–5476). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.485

Blodgett, S. L., Green, L., & O’Connor, B. (2016). Demographic Dialectal Variation in Social Media: A Case Study of African-American English. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 1119–1130). Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1120

Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., & Wallach, H. (2021). Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1004–1015). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.81

Blodgett, S. L., & O’Connor, B. (2017). Racial disparity in natural language processing: A case study of social media african-american english. ArXiv Preprint, abs/1707.00061. https://arxiv.org/abs/1707.00061

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.-F. Balcan, & H.-T. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., & Xie, X. (2023). A Survey on Evaluation of Large Language Models. ArXiv Preprint, abs/2307.03109. https://arxiv.org/abs/2307.03109

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., & Wong, E. (2023). Jailbreaking black box large language models in twenty queries. ArXiv Preprint, abs/2310.08419. https://arxiv.org/abs/2310.08419

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., & others. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1–113.

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., & others. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70), 1–53.

Clark, K., & Manning, C. D. (2016). Deep Reinforcement Learning for Mention-Ranking Coreference Models. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2256–2262). Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1245

Dang, H. T. (2006). DUC 2005: Evaluation of Question-Focused Summarization Systems. In T.-S. Chua, J. Goldstein, S. Teufel, & L. Vanderwende (Eds.), Proceedings of the Workshop on Task-Focused Summarization and Question Answering (pp. 48–55). Association for Computational Linguistics. https://aclanthology.org/W06-0707

De, A., Gudipudi, S. S., Panchanan, S., & Desarkar, M. S. (2022). ComplAI: Theory of A Unified Framework for Multi-factor Assessment of Black-Box Supervised Machine Learning Models. ArXiv, abs/2212.14599. https://api.semanticscholar.org/CorpusID:255340443

Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., & et al. (2023). Using large language models in psychology. Nature Reviews Psychology. https://doi.org/10.1038/s44159-023-00241-5

Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., & Narasimhan, K. (2023). Toxicity in chatgpt: Analyzing persona-assigned language models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 1236–1270). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.88

Devine, P. G. (2001). Implicit prejudice and stereotyping: how automatic are they? Introduction to the special section. Journal of Personality and Social Psychology, 81(5), 757.

Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., & Gupta, R. (2021). Bold: Dataset and metrics for measuring biases in open-ended language generation. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 862–872.

Dong, Z., Zhou, Z., Yang, C., Shao, J., & Qiao, Y. (2024). Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 6734–6747). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-long.375

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, 214–226.

Eubanks, V. (2018). Automating inequality: How high-tech tools profile, police, and punish the poor. St. Martin’s Press.

Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 3356–3369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.301

Hardt, M., Price, E., & Srebro, N. (2016). Equality of Opportunity in Supervised Learning. In D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain (pp. 3315–3323). https://proceedings.neurips.cc/paper/2016/hash/9d2682367c3935defcb1f9e247a97c0d-Abstract.html

Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & Kamar, E. (2022). ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3309–3326). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.234

He, L., Xia, M., & Henderson, P. (2024). What is in Your Safe Data? Identifying Benign Data that Breaks Safety. In ArXiv preprint: Vol. abs/2404.01099. https://arxiv.org/abs/2404.01099

Hofmann, V., Kalluri, P. R., Jurafsky, D., & King, S. (2024). Dialect prejudice predicts AI decisions about people’s character, employability, and criminality. https://arxiv.org/abs/2403.00742

Hu, T., Kyrychenko, Y., Rathje, S., Collier, N., van der Linden, S., & Roozenbeek, J. (2024). Generative Language Models Exhibit Social Identity Biases. https://arxiv.org/abs/2310.15819

Huang, T., Hu, S., Ilhan, F., Tekin, S. F., & Liu, L. (2024). Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey. In ArXiv preprint: Vol. abs/2409.18169. https://arxiv.org/abs/2409.18169

Huang, X., Ruan, W., Huang, W., Jin, G., Dong, Y., Wu, C., Bensalem, S., Mu, R., Qi, Y., Zhao, X., Cai, K., Zhang, Y., Wu, S., Xu, P., Wu, D., Freitas, A., & Mustafa, M. A. (2023). A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation. https://arxiv.org/abs/2305.11391

Huang, Y., Gupta, S., Xia, M., Li, K., & Chen, D. (2023). Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In ArXiv preprint: Vol. abs/2310.06987. https://arxiv.org/abs/2310.06987

Jegorova, M., Kaul, C., Mayor, C., O’Neil, A. Q., Weir, A., Murray-Smith, R., & Tsaftaris, S. A. (2022). Survey: Leakage and privacy at inference time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7), 9090–9108.

Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 6282–6293). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.560

Kilbertus, N., Rojas-Carulla, M., Parascandolo, G., Hardt, M., Janzing, D., & Schölkopf, B. (2018). Avoiding Discrimination through Causal Reasoning. https://arxiv.org/abs/1706.02744

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., Tomezsko, D., Greenwald, A. G., & Banaji, M. R. (2019). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist, 74(5), 569.

Kusner, M. J., Loftus, J. R., Russell, C., & Silva, R. (2017). Counterfactual Fairness. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (pp. 4066–4076). https://proceedings.neurips.cc/paper/2017/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html

Li, C., Chen, M., Wang, J., Sitaram, S., & Xie, X. (2024). CultureLLM: Incorporating Cultural Differences into Large Language Models. The Thirty-Eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=sIsbOkQmBL

Li, S., Sun, T., Cheng, Q., & Qiu, X. (2024). Agent Alignment in Evolving Social Norms. https://arxiv.org/abs/2401.04620

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., & others. (2022). Holistic evaluation of language models. ArXiv Preprint, abs/2211.09110. https://arxiv.org/abs/2211.09110

Liao, Q., & Vaughan, J. (2023). AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. ArXiv Preprint, abs/2306.01941. https://arxiv.org/abs/2306.01941

Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3214–3252). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.229

Lin, Y., He, P., Xu, H., Xing, Y., Yamada, M., Liu, H., & Tang, J. (2024). Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis. arXiv Preprint arXiv:2406.10794.

McIntosh, T. R., Susnjak, T., Arachchilage, N., Liu, T., Watters, P., & Halgamuge, M. N. (2024). Inadequacies of large language model benchmarks in the era of generative artificial intelligence. ArXiv Preprint, abs/2402.09880. https://arxiv.org/abs/2402.09880

Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., & Karbasi, A. (2023). Tree of attacks: Jailbreaking black-box llms automatically. CoRR, abs/2312.02119, 2023. doi: 10.48550. ArXiv Preprint, abs/2312.02119. https://arxiv.org/abs/2312.02119

Mozaffari, H. (2013). An analytical rubric for assessing creativity in creative writing. Theory and Practice in Language Studies, 3(12). https://doi.org/10.4304/tpls.3.12.2214-2219

Mujtaba, D. F., & Mahapatra, N. R. (2019). Ethical considerations in AI-based recruitment. 2019 IEEE International Symposium on Technology and Society (ISTAS), 1–7.

Nadeem, M., Bethke, A., & Reddy, S. (2021). StereoSet: Measuring stereotypical bias in pretrained language models. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 5356–5371). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.416

Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1953–1967). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.154

Naous, T., Ryan, M. J., Ritter, A., & Xu, W. (2024). Having Beer after Prayer? Measuring Cultural Bias in Large Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 16366–16393). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.862

Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper%5C_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html

Parmanto, B., Aryoyudanta, B., Soekinto, W., Setiawan, I., Wang, Y., Hu, H., Saptono, A., & Choi, Y. K. (2024). Development of a Reliable and Accessible Caregiving Language Model (CaLM). arXiv Preprint arXiv:2403.06857.

Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M., & Bowman, S. (2022). BBQ: A hand-built bias benchmark for question answering. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Findings of the Association for Computational Linguistics: ACL 2022 (pp. 2086–2105). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-acl.165

Peng, J.-L., Cheng, S., Diau, E., Shih, Y.-Y., Chen, P.-H., Lin, Y.-T., & Chen, Y.-N. (2024). A Survey of Useful LLM Evaluation. https://arxiv.org/abs/2406.00936

Perez, F., & Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. In ArXiv preprint: Vol. abs/2211.09527. https://arxiv.org/abs/2211.09527

Pessach, D., & Shmueli, E. (2022). A review on fairness in machine learning. ACM Computing Surveys (CSUR), 55(3), 1–44.

Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Raji, I. D., Kumar, I. E., Horowitz, A., & Selbst, A. (2022). The Fallacy of AI Functionality. 2022 ACM Conference on Fairness, Accountability, and Transparency, 959–972. https://doi.org/10.1145/3531146.3533158

Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., Alayrac, J., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., & others. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv Preprint, abs/2403.05530. https://arxiv.org/abs/2403.05530

Reinhart, T. (1980). Conditions for Text Coherence. Poetics Today, 1(4), 161–180. https://doi.org/10.2307/1771893

Roehrick, K. (2020). Valence Aware Dictionary and sEntiment Reasoner (VADER). https://CRAN.R-project.org/package=vader

Röttger, P., Pernisi, F., Vidgen, B., & Hovy, D. (2024). Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety. ArXiv Preprint, abs/2404.05399. https://arxiv.org/abs/2404.05399

Rudinger, R., Naradowsky, J., Leonard, B., & Van Durme, B. (2018). Gender Bias in Coreference Resolution. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 8–14). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2002

Salinas, A., Shah, P., Huang, Y., McCormack, R., & Morstatter, F. (2023). The unequal opportunities of large language models: Examining demographic biases in job recommendations by chatgpt and llama. Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 1–15.

Seshadri, P., Pezeshkpour, P., & Singh, S. (2022). Quantifying social biases using templates is unreliable. ArXiv Preprint, abs/2210.04337. https://arxiv.org/abs/2210.04337

Shaikh, O., Chai, V. E., Gelfand, M., Yang, D., & Bernstein, M. S. (2024). Rehearsal: Simulating Conflict to Teach Conflict Resolution. In F. “Floyd” Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, & I. Shklovski (Eds.), Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024 (p. 920:1-920:20). ACM. https://doi.org/10.1145/3613904.3642159

Sharma, A., Lin, I. W., Miner, A. S., Atkins, D. C., & Althoff, T. (2023). Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence, 5(1), 46–57. https://doi.org/10.1038/s42256-022-00593-2

Shelby, R., Rismani, S., Henne, K., Moon, Aj., Rostamzadeh, N., Nicholas, P., Yilla-Akbari, N., Gallegos, J., Smart, A., Garcia, E., & others. (2023). Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–741.

Sheng, E., Chang, K.-W., Natarajan, P., & Peng, N. (2019). The Woman Worked as a Babysitter: On Biases in Language Generation. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3407–3412). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1339

Singh, A., & Joachims, T. (2018). Fairness of Exposure in Rankings. In Y. Guo & F. Farooq (Eds.), Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018 (pp. 2219–2228). ACM. https://doi.org/10.1145/3219819.3220088

Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., & others. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180.

Stade, E. C., Stirman, S. W., Ungar, L. H., Boland, C. L., Schwartz, H. A., Yaden, D. B., Sedoc, J., DeRubeis, R. J., Willer, R., & Eichstaedt , J. C. (2024). Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. In npj Mental Health Research. Nature. https://www.nature.com/articles/s44184-024-00056-z

Van Es, K., Everts, D., & Muis, I. (2021). Gendered language and employment Web sites: How search algorithms can cause allocative harm. First Monday.

Veldanda, A. K., Grob, F., Thakur, S., Pearce, H., Tan, B., Karri, R., & Garg, S. (2023). Investigating Hiring Bias in Large Language Models. R0-FoMo: Robustness of Few-Shot and Zero-Shot Learning in Large Foundation Models.

Verma, S., & Rubin, J. (2018). Fairness definitions explained. Proceedings of the International Workshop on Software Fairness, 1–7.

Vijjini, A. R., Chowdhury, S. B. R., & Chaturvedi, S. (2024). Exploring safety-utility trade-offs in personalized language models. arXiv Preprint arXiv:2406.11107.

Wang, A., & Russakovsky, O. (2021). Directional Bias Amplification. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Vol. 139, pp. 10882–10893). PMLR. http://proceedings.mlr.press/v139/wang21t.html

Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D., & Li, B. (2023). DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/63cb9921eecf51bfad27a99b2c53dd6d-Abstract-Datasets%5C_and%5C_Benchmarks.html

Wang, J., Ma, W., Sun, P., Zhang, M., & Nie, J.-Y. (2024). Understanding User Experience in Large Language Model Interactions. https://arxiv.org/abs/2401.08329

Wang, Z., Wu, Z., Guan, X., Thaler, M., Koshiyama, A., Lu, S., Beepath, S., Ertekin, E., & Perez-Ortiz, M. (2024). Jobfair: A framework for benchmarking gender hiring bias in large language models. Findings of the Association for Computational Linguistics: EMNLP 2024, 3227–3246.

Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How Does LLM Safety Training Fail? In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/fd6613131889a4b656206c50a8bd7790-Abstract-Conference.html

Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L. A., … Gabriel, I. (2021). Ethical and social risks of harm from Language Models. In ArXiv preprint: Vol. abs/2112.04359. https://arxiv.org/abs/2112.04359

Winata, G. I., Madotto, A., Lin, Z., Liu, R., Yosinski, J., & Fung, P. (2021). Language Models are Few-shot Multilingual Learners. In D. Ataman, A. Birch, A. Conneau, O. Firat, S. Ruder, & G. G. Sahin (Eds.), Proceedings of the 1st Workshop on Multilingual Representation Learning (pp. 1–15). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.mrl-1.1

Xu, Z., Liu, Y., Deng, G., Li, Y., & Picek, S. (2024). A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 (pp. 7432–7449). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.443

Yang, X., Wang, X., Zhang, Q., Petzold, L., Wang, W. Y., Zhao, X., & Lin, D. (2023). Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models. In ArXiv preprint: Vol. abs/2310.02949. https://arxiv.org/abs/2310.02949

Yi, J., Ye, R., Chen, Q., Zhu, B., Chen, S., Lian, D., Sun, G., Xie, X., & Wu, F. (2024). On the Vulnerability of Safety Alignment in Open-Access LLMs. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 (pp. 9236–9260). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.549

Yi, S., Liu, Y., Sun, Z., Cong, T., He, X., Song, J., Xu, K., & Li, Q. (2024). Jailbreak Attacks and Defenses Against Large Language Models: A Survey. In ArXiv preprint: Vol. abs/2407.04295. https://arxiv.org/abs/2407.04295

Yu, Z., Liu, X., Liang, S., Cameron, Z., Xiao, C., & Zhang, N. (2024). Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models. ArXiv Preprint, abs/2403.17336. https://arxiv.org/abs/2403.17336

Zeng, Y., Lin, H., Zhang, J., Yang, D., Jia, R., & Shi, W. (2024). How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. ArXiv Preprint, abs/2401.06373. https://arxiv.org/abs/2401.06373

Zhan, Q., Fang, R., Bindu, R., Gupta, A., Hashimoto, T., & Kang, D. (2024). Removing RLHF Protections in GPT-4 via Fine-Tuning. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) (pp. 681–687). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-short.59

Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., & Huang, M. (2023). Safetybench: Evaluating the safety of large language models with multiple choice questions. ArXiv Preprint, abs/2309.07045. https://arxiv.org/abs/2309.07045

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2017). Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In M. Palmer, R. Hwa, & S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2979–2989). Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1323

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2018). Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 15–20). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2003

Zhuo, T. Y., Huang, Y., Chen, C., & Xing, Z. (2023). Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. ArXiv Preprint, abs/2301.12867. https://arxiv.org/abs/2301.12867

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. In ArXiv preprint: Vol. abs/2307.15043. https://arxiv.org/abs/2307.15043

Human-Level Evaluations

Human Values

Bias and Fairness Evaluation

Representational Bias.

Allocational Bias.

Safety Evaluations

Datasets for Safety Evaluation.

Metrics for Measuring LLM Safety.

Jailbreaking.

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Human Values

Bias and Fairness Evaluation

Representational Bias.

Allocational Bias.

Safety Evaluations

Datasets for Safety Evaluation.

Metrics for Measuring LLM Safety.

Jailbreaking.

Footnotes

Graph View

Table of Contents

Backlinks