Data is central to the language modeling paradigm, just as it has been throughout the history of machine learning (Halevy et al., 2009ReferenceHalevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2), 8–12.). Every stage of language model development, from pretraining to evaluation and deployment, depends on the availability of massive text corpora (Sun et al., 2017ReferenceSun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. Proceedings of the IEEE International Conference on Computer Vision, 843–852.). Critically, the scale, diversity, and quality of this linguistic data can determine a model’s downstream utility for users (Kaplan et al., 2020ReferenceKaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. ArXiv Preprint, abs/2001.08361. https://arxiv.org/abs/2001.08361; Liu et al., 2024ReferenceLiu, W., Zeng, W., He, K., Jiang, Y., & He, J. (2024). What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. ICLR.; Zhou et al., 2023ReferenceZhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., & Levy, O. (2023). LIMA: Less Is More for Alignment. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/ac662d74829e4407ce1d126477f4a03a-Abstract-Conference.html). Data quality and quantity can quickly become bottlenecks (Villalobos et al., 2024ReferenceVillalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., & Hobbhahn, M. (2024). Position: will we run out of data? limits of LLM scaling based on human-generated data. Proceedings of the 41st International Conference on Machine Learning.), limiting progress in AI (Longpre et al., 2024ReferenceLongpre, S., Mahari, R., Lee, A., Lund, C., Oderinwale, H., Brannon, W., Saxena, N., Obeng-Marnu, N., South, T., Hunter, C., & others. (2024). Consent in crisis: The rapid decline of the ai data commons. Advances in Neural Information Processing Systems, 37, 108042–108087.). Thus one of the most urgent challenges in AI is in identifying diverse and representative sources of data.
From the human perspective, data is more than the fuel behind AI progress. Data is a dynamic reflection of lived human experience. It reflects the people, institutions, cultures, histories, and social contexts that produce it. In this sense, data is never neutral. Rather, data encodes viewpoints and values (Dotan & Milli, 2020ReferenceDotan, R., & Milli, S. (2020). Value-laden disciplinary shifts in machine learning. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 294–294.), assumptions and biases (Paullada et al., 2021ReferencePaullada, A., Raji, I. D., Bender, E. M., Denton, E., & Hanna, A. (2021). Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11).), and even political and social structures (Capel & Brereton, 2023ReferenceCapel, T., & Brereton, M. (2023). What is Human-Centered about Human-Centered AI? A Map of the Research Landscape. In A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, A. Peters, S. Mueller, J. R. Williamson, & M. L. Wilson (Eds.), Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023 (p. 359:1-359:23). ACM. https://doi.org/10.1145/3544548.3580959; Miceli et al., 2022ReferenceMiceli, M., Posada, J., & Yang, T. (2022). Studying up machine learning data: Why talk about bias when we mean power? Proceedings of the ACM on Human-Computer Interaction, 6(GROUP), 1–14.; Scheuerman et al., 2021ReferenceScheuerman, M. K., Hanna, A., & Denton, R. (2021). Do datasets have politics? Disciplinary values in computer vision dataset development. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2), 1–37.). The origins of such data may be the subject of legal claims and privacy concerns, and its content may be highly personal or sensitive (Bender & Friedman, 2018ReferenceBender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587–604.). To understand the human impact of LLMs, it becomes necessary to consider the human origins of data that shapes our models, particularly in pre-training, instruction tuning, and alignment (the figure).
In the first subsection of this chapter, we examine the (Data Provenance) provenance of data used to develop LLMs. We ask where this data comes from, who produced it, under what conditions it was produced, and how it was transformed throughout this process. In this way, we recognize how data encodes implicit values, perspectives, and cultures that shape LLM behavior. From here, we are positioned to understand human-centered concerns around (Data Representation, Bias and Ethics) representation and bias, how the data’s origins systematically skews, misrepresents, and erases the perspectives of underrepresented groups, leading to representational and allocational harms. While rich community and personal data may be used to mitigate some of these harms, we consider issues around (Consent and Ownership) consent and ownership. Finally, we consider some of the biggest data challenges facing LLM developers today, and how proposed solutions like (Expanding Data Sources: Synthetic and Non-Traditional Data) synthetic data account or fail to account for the human-centered objectives we have outlined.