Data for HCLLMs

Data is central to the language modeling paradigm, just as it has been throughout the history of machine learning (Halevy et al., 2009). Every stage of language model development, from pretraining to evaluation and deployment, depends on the availability of massive text corpora (Sun et al., 2017). Critically, the scale, diversity, and quality of this linguistic data can determine a model’s downstream utility for users (Kaplan et al., 2020; Liu et al., 2024; Zhou et al., 2023). Data quality and quantity can quickly become bottlenecks (Villalobos et al., 2024), limiting progress in AI (Longpre et al., 2024). Thus one of the most urgent challenges in AI is in identifying diverse and representative sources of data.

From the human perspective, data is more than the fuel behind AI progress. Data is a dynamic reflection of lived human experience. It reflects the people, institutions, cultures, histories, and social contexts that produce it. In this sense, data is never neutral. Rather, data encodes viewpoints and values (Dotan & Milli, 2020), assumptions and biases (Paullada et al., 2021), and even political and social structures (Capel & Brereton, 2023; Miceli et al., 2022; Scheuerman et al., 2021). The origins of such data may be the subject of legal claims and privacy concerns, and its content may be highly personal or sensitive (Bender & Friedman, 2018). To understand the human impact of LLMs, it becomes necessary to consider the human origins of data that shapes our models, particularly in pre-training, instruction tuning, and alignment (the figure).

In the first subsection of this chapter, we examine the (Data Provenance) provenance of data used to develop LLMs. We ask where this data comes from, who produced it, under what conditions it was produced, and how it was transformed throughout this process. In this way, we recognize how data encodes implicit values, perspectives, and cultures that shape LLM behavior. From here, we are positioned to understand human-centered concerns around (Data Representation, Bias and Ethics) representation and bias, how the data’s origins systematically skews, misrepresents, and erases the perspectives of underrepresented groups, leading to representational and allocational harms. While rich community and personal data may be used to mitigate some of these harms, we consider issues around (Consent and Ownership) consent and ownership. Finally, we consider some of the biggest data challenges facing LLM developers today, and how proposed solutions like (Expanding Data Sources: Synthetic and Non-Traditional Data) synthetic data account or fail to account for the human-centered objectives we have outlined.

Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587–604.

Capel, T., & Brereton, M. (2023). What is Human-Centered about Human-Centered AI? A Map of the Research Landscape. In A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, A. Peters, S. Mueller, J. R. Williamson, & M. L. Wilson (Eds.), Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023 (p. 359:1-359:23). ACM. https://doi.org/10.1145/3544548.3580959

Dotan, R., & Milli, S. (2020). Value-laden disciplinary shifts in machine learning. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 294–294.

Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2), 8–12.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. ArXiv Preprint, abs/2001.08361. https://arxiv.org/abs/2001.08361

Liu, W., Zeng, W., He, K., Jiang, Y., & He, J. (2024). What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. ICLR.

Longpre, S., Mahari, R., Lee, A., Lund, C., Oderinwale, H., Brannon, W., Saxena, N., Obeng-Marnu, N., South, T., Hunter, C., & others. (2024). Consent in crisis: The rapid decline of the ai data commons. Advances in Neural Information Processing Systems, 37, 108042–108087.

Miceli, M., Posada, J., & Yang, T. (2022). Studying up machine learning data: Why talk about bias when we mean power? Proceedings of the ACM on Human-Computer Interaction, 6(GROUP), 1–14.

Paullada, A., Raji, I. D., Bender, E. M., Denton, E., & Hanna, A. (2021). Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11).

Scheuerman, M. K., Hanna, A., & Denton, R. (2021). Do datasets have politics? Disciplinary values in computer vision dataset development. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2), 1–37.

Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. Proceedings of the IEEE International Conference on Computer Vision, 843–852.

Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., & Hobbhahn, M. (2024). Position: will we run out of data? limits of LLM scaling based on human-generated data. Proceedings of the 41st International Conference on Machine Learning.

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., & Levy, O. (2023). LIMA: Less Is More for Alignment. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/ac662d74829e4407ce1d126477f4a03a-Abstract-Conference.html

Data for HCLLMs

Data Provenance

Data Representation, Bias and Ethics

Consent and Ownership

Expanding Data Sources: Synthetic and Non-Traditional Data

Graph View

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes