The story of LLM training data is a story about whose voices become computationally legible and whose are overwritten or erased. In Data Provenance, we considered how the story of data is shaped by its sources, filtering decisions, annotation pipelines, and synthetic generation practices. Imbalances or biases in the provenance of LLM pre- and post-training data can exacerbate representational, allocational, and quality-of-service harms for those who use these models. Representational harms include stereotyping, denigration, and misrecognition, when LLMs perpetuate and amplify distorted and harmful portrayals of personal identities and social groups (Blodgett, 2021ReferenceBlodgett, S. L. (2021). Sociolinguistically Driven Approaches for Just Natural Language Processing [Doctoral Dissertation, University of Massachusetts Amherst]. https://doi.org/10.7275/20410631). Allocational harms arise when LLMs reinforce or amplify inequality in the distribution of opportunities and resources (Barocas et al., 2017ReferenceBarocas, S., Crawford, K., Shapiro, A., & Wallach, H. (2017). The problem with bias: Allocative versus representational harms in machine learning. 9th Annual Conference of the Special Interest Group for Computing, Information and Society, 1.; Eubanks, 2018ReferenceEubanks, V. (2018). Automating inequality: How high-tech tools profile, police, and punish the poor. St. Martin’s Press.). Quality-of-service harms involve performance disparities across different user groups, which may cascade into both representational and allocational harms. For further discussion on how to define and measure these harms, see Bias and Fairness Evaluation.
Sociotechnical harms become harder to diagnose when data provenance is incomplete. Without visibility into the linguistic, cultural, and geographic origins of the data, as well as the filtering and curation pipelines, researchers cannot identify: (1) why the model stereotypes certain voices, (2) why specific groups are absent from generated outputs, or (3) how certain narrative tropes became dominant. We will briefly discuss the relationship between data provenance and each of these harm outcome categories, as well as data-based mitigation strategies.
Quality-of-Service Harms
Quality-of-service harms are disparities in model utility for users from different sociodemographic groups (Shelby et al., 2023ReferenceShelby, R., Rismani, S., Henne, K., Moon, Aj., Rostamzadeh, N., Nicholas, P., Yilla-Akbari, N., Gallegos, J., Smart, A., Garcia, E., & others. (2023). Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–741.). These disparities are often rooted in the composition and curation of data as discussed in Data Provenance. Pre-training data scraping, quality filtering, instruction-tuning templates, and alignment data collection tend to over-represent native English speakers from wealthy, Western nations and under-represent the language and perspectives of marginalized communities. As a result, LLMs introduce quality-of-service harms for individuals from these communities (Shah et al., 2020ReferenceShah, D. S., Schwartz, H. A., & Hovy, D. (2020). Predictive biases in natural language processing models: A conceptual framework and overview. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5248–5264.).
LLM performance degrades for speakers of non-standard language varieties or dialects on a wide range of tasks (Joshi et al., 2025ReferenceJoshi, A., Dabre, R., Kanojia, D., Li, Z., Zhan, H., Haffari, G., & Dippold, D. (2025). Natural language processing for dialects of a language: A survey. ACM Computing Surveys, 57(6), 1–37.; Kantharuban et al., 2023ReferenceKantharuban, A., Vulić, I., & Korhonen, A. (2023). Quantifying the Dialect Gap and its Correlates Across Languages. Findings of the Association for Computational Linguistics: EMNLP 2023, 7226–7245.), from text classification (Lwowski & Rios, 2021ReferenceLwowski, B., & Rios, A. (2021). The risk of racial bias while tracking influenza-related content on social media using machine learning. Journal of the American Medical Informatics Association, 28(4), 839–849.) and machine translation (Ahia et al., 2023ReferenceAhia, O., Kumar, S., Gonen, H., Kasai, J., Mortensen, D. R., Smith, N. A., & Tsvetkov, Y. (2023). Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 9904–9923.) to question answering (Fleisig et al., 2024ReferenceFleisig, E., Smith, G., Bossi, M., Rustagi, I., Yin, X., & Klein, D. (2024). Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 13541–13564.; Ziems et al., 2023ReferenceZiems, C., Held, W., Yang, J., Dhamala, J., Gupta, R., & Yang, D. (2023). Multi-VALUE: A framework for cross-dialectal English NLP. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 744–768.) and conversational AI (Artemova et al., 2024ReferenceArtemova, E., Blaschke, V., & Plank, B. (2024). Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 445–468.). This inequitable distribution of utility to LLM users can result in allocational harms like inequitable wages and quality of life, especially as LLMs are integrated into the workplace (Shao et al., 2025ReferenceShao, Y., Zope, H., Jiang, Y., Pei, J., Nguyen, D., Brynjolfsson, E., & Yang, D. (2025). Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the US Workforce. arXiv Preprint arXiv:2506.06576.) and become general purpose technologies (Eloundou et al., 2024ReferenceEloundou, T., Manning, S., Mishkin, P., & Rock, D. (2024). GPTs are GPTs: Labor market impact potential of LLMs. Science, 384(6702), 1306–1308.). Quality-of-service bias also contributes to the representational harm of erasure, and may derive in part from representational biases.
Representational Harms
LLMs demonstrate representational harms when they propagate negative or skewed representations of social groups, including cultural misrepresentation, stereotypes, essentialist language, and erasure (Chien & Danks, 2024ReferenceChien, J., & Danks, D. (2024). Beyond behaviorist representational harms: A plan for measurement and mitigation. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 933–946.). Representational harms can derive from pre-training data (Chu et al., 2024ReferenceChu, Z., Wang, Z., & Zhang, W. (2024). Fairness in Large Language Models: A Taxonomic Survey. SIGKDD Explor. Newsl., 26(1), 34–48. https://doi.org/10.1145/3682112.3682117), not only from its explicitly harmful, stereotypical, and toxic language (Luccioni & Viviano, 2021ReferenceLuccioni, A. S., & Viviano, J. D. (2021). What’s in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus. In ArXiv preprint: Vol. abs/2105.02732. https://arxiv.org/abs/2105.02732), but also from implicitly biased language , framing effects (Feng et al., 2023ReferenceFeng, S., Park, C. Y., Liu, Y., & Tsvetkov, Y. (2023). From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 11737–11762). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.656), and the sparsity of socioculturally representative data (Naous et al., 2024ReferenceNaous, T., Ryan, M. J., Ritter, A., & Xu, W. (2024). Having Beer after Prayer? Measuring Cultural Bias in Large Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 16366–16393). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.862). Data quality filters exacerbate racial and linguistic biases that skew pre-training data away from in-group perspectives in favor of unrepresentative and misinformed out-group perspectives (Wang et al., 2025ReferenceWang, A., Morgenstern, J., & Dickerson, J. P. (2025). Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7(3), 400–411.). Post-training data can further induce mode-collapse, effectively flattening their representational distributions to portray groups one-dimensionally (Bisbee et al., 2024ReferenceBisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., & Larson, J. M. (2024). Synthetic replacements for human survey data? The perils of large language models. Political Analysis, 32(4), 401–416.; Durmus et al., 2024ReferenceDurmus, E., Nguyen, K., Liao, T., Schiefer, N., Askell, A., Bakhtin, A., Chen, C., Hatfield-Dodds, Z., Hernandez, D., Joseph, N., Lovitt, L., McCandlish, S., Sikder, O., Tamkin, A., Thamkul, J., Kaplan, J., Clark, J., & Ganguli, D. (2024). Towards Measuring the Representation of Subjective Global Opinions in Language Models. First Conference on Language Modeling. https://openreview.net/forum?id=zl16jLb91v; Röttger et al., 2024ReferenceRöttger, P., Hofmann, V., Pyatkin, V., Hinck, M., Kirk, H., Schütze, H., & Hovy, D. (2024). Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15295–15311.). This kind of distributional flattening is a form of essentializing that is particularly harmful for groups historically portrayed as one-dimensional (Wang et al., 2025ReferenceWang, A., Morgenstern, J., & Dickerson, J. P. (2025). Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7(3), 400–411.).
Unsurprisingly, LLMs are known to generate harmful stereotypes in question-answering (Naous et al., 2024ReferenceNaous, T., Ryan, M. J., Ritter, A., & Xu, W. (2024). Having Beer after Prayer? Measuring Cultural Bias in Large Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 16366–16393). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.862), machine translation (Ghosh & Caliskan, 2023ReferenceGhosh, S., & Caliskan, A. (2023). Chatgpt perpetuates gender bias in machine translation and ignores non-gendered pronouns: Findings across bengali and five other low-resource languages. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 901–912.), and open-ended generation (Dhamala et al., 2021ReferenceDhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., & Gupta, R. (2021). Bold: Dataset and metrics for measuring biases in open-ended language generation. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 862–872.). These issues are only exacerbated when the prompts are written in non-standard dialects, which may trigger demeaning or condescending responses from models (Fleisig et al., 2024ReferenceFleisig, E., Smith, G., Bossi, M., Rustagi, I., Yin, X., & Klein, D. (2024). Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 13541–13564.). LLM-simulated personas also collapse into stereotypical caricatures (Cheng, Durmus, et al., 2023; Cheng, Piccardi, et al., 2023; Gupta et al., 2023)ReferenceCheng, M., Durmus, E., & Jurafsky, D. (2023). Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1504–1532). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.84ReferenceCheng, M., Piccardi, T., & Yang, D. (2023). CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 10853–10875). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.669ReferenceGupta, V., Venkit, P. N., Laurençon, H., Wilson, S., & Passonneau, R. J. (2023). CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias. ArXiv Preprint, abs/2308.12539. https://arxiv.org/abs/2308.12539. These simulations systematically misrepresent, flatten, and essentialize the perspectives of underrepresented groups based on protected characteristics like age, gender, and disability (Wang et al., 2025ReferenceWang, A., Morgenstern, J., & Dickerson, J. P. (2025). Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7(3), 400–411.).
Allocational harms
Allocational harms are disparities in individuals’ access to material resources like jobs, housing, credit, healthcare, childcare, education, and transportation (Cyberey et al., 2025ReferenceCyberey, H., Ji, Y., & Evans, D. K. (2025). Do Prevalent Bias Metrics Capture Allocational Harms from LLMs? The Sixth Workshop on Insights from Negative Results in NLP, 34–45.). When LLMs are embedded in decision-making systems, they can introduce, amplify, or otherwise reinforce allocational disparities, in part as a result of representational biases in the training data (Chien & Danks, 2024ReferenceChien, J., & Danks, D. (2024). Beyond behaviorist representational harms: A plan for measurement and mitigation. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 933–946.; Sen et al., 2025ReferenceSen, I., Lutz, M., Rogers, E., Garcia, D., & Strohmaier, M. (2025). Missing the margins: A systematic literature review on the demographic representativeness of LLMs. Findings of the Association for Computational Linguistics: ACL 2025, 24263–24289.). In pre-training, skewed representations can lead models to encode assumptions about who is qualified, creditworthy, employable, or deserving of services (Mehrabi et al., 2021ReferenceMehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6), 1–35.). During post-training, alignment data privileges the annotators’ norms of professionalism, risk, and appropriate behavior (Conitzer et al., 2024ReferenceConitzer, V., Freedman, R., Heitzig, J., Holliday, W. H., Jacobs, B. M., Lambert, N., Mossé, M., Pacuit, E., Russell, S., Schoelkopf, H., & others. (2024). Position: Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback. Forty-First International Conference on Machine Learning.), which appear in downstream allocational biases as follows.
LLMs used in hiring decisions can be more likely to recommend less-prestigious jobs to speakers of marginalized dialects (Hofmann et al., 2024ReferenceHofmann, V., Kalluri, P. R., Jurafsky, D., & King, S. (2024). AI generates covertly racist decisions about people based on their dialect. Nature, 1–8.). In content moderation, LLMs are prejudiced against speakers of African American English (Sap et al., 2019ReferenceSap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. (2019). The risk of racial bias in hate speech detection. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1668–1678.). In automated exam scoring, LLMs show disparate performance for students with backgrounds not represented in training data (Schaller et al., 2024ReferenceSchaller, N.-J., Ding, Y., Horbach, A., Meyer, J., & Jansen, T. (2024). Fairness in automated essay scoring: A comparative analysis of algorithms on German learner essays from secondary education. Proceedings of the 19th Workshop on Innovative Use of Nlp for Building Educational Applications (Bea 2024), 210–221.). And more broadly, LLM decisions are biased against underrepresented groups across domains such as business (i.e., funding a startup), finance (i.e., approving a credit card), relationships (i.e., resolving conflicts), law (i.e., issuing a passport), science (i.e., approving a research study), and the arts (i.e., awarding a filmmaking prize) (Levy et al., 2024ReferenceLevy, S., Adler, W., Karver, T. S., Dredze, M., & Kaufman, M. R. (2024). Gender bias in decision-making with large language models: A study of relationship conflicts. Findings of the Association for Computational Linguistics: EMNLP 2024, 5777–5800.; Tamkin et al., 2023ReferenceTamkin, A., Askell, A., Lovitt, L., Durmus, E., Joseph, N., Kravec, S., Nguyen, K., Kaplan, J., & Ganguli, D. (2023). Evaluating and mitigating discrimination in language model decisions. arXiv Preprint arXiv:2312.03689.).
Mitigating Harms
Mitigating sociotechnical harms requires interventions across the data pipeline. The first step is to establish transparent data provenance through documentation practices like Datasheets for Datasets and Data Statements (Bender & Friedman, 2018ReferenceBender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587–604.; Gebru et al., 2021ReferenceGebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92.). By explicitly recording the linguistic, demographic, and geographic composition of datasets, as well as filtering and annotation decisions, making it easier to diagnose representational gaps and biases. Model cards and system cards further extend this transparency to downstream users by documenting intended use, performance disparities, and known limitations (Mitchell et al., 2019ReferenceMitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–229. https://doi.org/10.1145/3287560.3287596). Recent work argues that provenance-aware documentation should include not only source descriptions but also transformation histories, like filtering, deduplication, and synthetic augmentation (Scheuerman et al., 2021ReferenceScheuerman, M. K., Hanna, A., & Denton, R. (2021). Do datasets have politics? Disciplinary values in computer vision dataset development. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2), 1–37.).
With transparent data provenance, a second mitigation step is to involve stakeholders in the process of data creation and diversification, using participatory methods (Vaughn & Jacquez, 2020ReferenceVaughn, L. M., & Jacquez, F. (2020). Participatory research methods–choice points in the research process. Journal of Participatory Research Methods, 1(1).), following Participatory Approaches. With community-level organization, it is possible to develop rich data resources for low-resource languages and underrepresented communities (Heidt, 2025ReferenceHeidt, A. (2025). Walking in two worlds: how an Indigenous computer scientist is using AI to preserve threatened languages. Nature, 641(8062), 548–550.; Orife et al., 2020ReferenceOrife, I., Kreutzer, J., Sibanda, B., Whitenack, D., Siminyu, K., Martinus, L., Ali, J. T., Abbott, J., Marivate, V., Kabongo, S., & others. (2020). Masakhane–machine translation for Africa. arXiv Preprint arXiv:2003.11529.). However, diversification alone may prove insufficient without governance structures that prevent extractive data practices and ensure ongoing community oversight (Benjamin, 2023ReferenceBenjamin, R. (2023). Race after technology. In Social Theory Re-Wired (pp. 405–415). Routledge.).
A third mitigation approach is to collect learnable data from user interactions with LLMs at the individual level. Personalized alignment methods may be considered in which individual preference data is collected from user interactions and used to shape subsequent model behavior through prompt-based (Hebert et al., 2024ReferenceHebert, L., Sayana, K., Jash, A., Karatzoglou, A., Sodhi, S., Doddapaneni, S., Cai, Y., & Kuzmin, D. (2024). PERSOMA: PERsonalized SOft ProMpt Adapter Architecture for Personalized Language Prompting. In ArXiv preprint: Vol. abs/2408.00960. https://arxiv.org/abs/2408.00960), retrieval-based (Salemi et al., 2024ReferenceSalemi, A., Mysore, S., Bendersky, M., & Zamani, H. (2024). LaMP: When Large Language Models Meet Personalization. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 7370–7392). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.399), or alignment-based methods (Ryan et al., 2025ReferenceRyan, M. J., Shaikh, O., Bhagirath, A., Frees, D., Held, W. B., & Yang, D. (2025). Synthesizeme! inducing persona-guided prompts for personalized reward models in llms. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8045–8078.). For more discussion on this direction, see Personalization.