Consent and Ownership

The data used to pre- and post-train LLMs may include sensitive personal information (Data Provenance). Such personal data may actually help LLM systems become more capable, useful, and proactive. For example, LLMs can infer from users’ hidden behaviors, personal habits, and broader computer use patterns, what the user might need before they even make a request (Shaikh et al., 2025). However, the use of private or personal data comes with an array of legal and ethical challenges (Subramani et al., 2023; Yan et al., 2024). Sensitive, private, or copyrighted information can be inadvertently leaked or reproduced without authorization, further complicating compliance with laws and regulations (Khan & Hanna, 2022; Wachter, 2019; Zhang et al., 2024). Here, we will consider these concerns regarding the ownership of data.

Data Privacy Considerations

Data privacy is broadly defined as the ability of individuals to control their personal information. Privacy leaks can occur in data, either when sensitive personal information is explicitly encoded, or when this information can be inferred (Kshetri, 2023; Staab et al., 2023; Yan et al., 2024). In the former setting, LLMs can memorize personally identifiable information from training data and expose these details at inference time (Carlini et al., 2019; Staab et al., 2023; Yan et al., 2024). Attackers can exploit vulnerabilities in LLMs through methods such as backdoor attacks, membership inference attacks, and model inversion attacks, which can extract sensitive information embedded in the model during pre-training or fine-tuning (Carlini et al., 2019; Yan et al., 2024). For instance, Carlini et al. (2021) demonstrated that it is possible to recover individual training examples, including names and phone numbers, by attacking the language model. Similarly, Zhao et al. (2024) revealed that LLMs could generate infringing content when prompted with partial information from copyrighted materials. These risks are more severe in larger models with more parameters and when longer contexts are used in prompting, making it increasingly challenging to address these vulnerabilities effectively for current LLMs (Carlini et al., 2021, 2023; Karamolegkou et al., 2023).

Additionally, legal frameworks play a critical role in governing the use of personal and copyrighted data. Since 2018, the General Data Protection Regulation (GDPR) in the European Union has mandated data minimization, consent requirement, and the “Right to Erasure.” This can be re-interpreted to apply to AI systems, though with limitations after the data collection process (Neel & Chang, 2023; Wachter, 2019). More regulations and protocols are needed to comply with ethical obligations. Copyright law introduces another layer of complexity. Copyright law grants creators exclusive rights to use and distribute their work, with specific exceptions. Under §107 of the United States Copyright Law, the fair use doctrine permits limited usage of copyrighted materials without permission, typically for purposes such as commentary, research, or information extraction, but not for verbatim reproduction (Karamolegkou et al., 2023). With the increasing influence of LLMs, the use of online data has come under heightened scrutiny; justifications under principles like “Legitimate Interests” for personal data and “Fair Use” for copyrighted content are being questioned more rigorously (Franceschelli & Musolesi, 2022). Notably, companies such as OpenAI, Stability AI, and Microsoft have faced various legal challenges, including consumer privacy lawsuits and copyright infringement claims, underscoring the growing contention surrounding privacy and copyright issues in AI development (Brittain, 2024a, 2024b; Claburn, 2023; Vincent, 2023).

Proactive vs. Reactive Privacy Strategies

Adopting a proactive approach to privacy is essential. Rather than deferring mitigation until after model training, privacy considerations should inform every stage of data collection and curation. This includes implementing privacy-preserving data collection protocols, robust anonymization techniques, and consent-based frameworks from the outset. For instance, it is critical to obtain consent and minimize sensitive information collection, employ more tools to detect and remove personally identifiable information, and use more sophisticated data anonymization techniques to better protect aganst privacy leakage (Subramani et al., 2023; Yan et al., 2024). Consent-Based Data Collection should be adopted in scenarios like web scraping to respect individual’s rights (Subramani et al., 2023). Web architectures like SOLID (Sambra et al., 2016) and Consent Tagging (Zhang et al., 2023) aim to streamline consent acquisition (Zhang et al., 2024).

For more reactive privacy strategies after the data collection stage, various techniques have been proposed to mitigate these issues. Data cleaning methods aim to remove or generalize sensitive information from datasets before training (Bai et al., 2022; Brown et al., 2020; Kandpal et al., 2022; Ouyang et al., 2022). Federated Learning approaches decentralize the training process to enhance privacy by keeping data local and aggregating updates instead of sharing raw data (Chen et al., 2023; Hoory et al., 2021; Xu et al., 2023; Yu et al., 2024). Differential Privacy methods extract useful statistical information from datasets without revealing individual data by introducing controlled random noise or applying aggregation techniques (Du & Mi, 2021; Hoory et al., 2021; Li et al., 2022; Shi et al., 2022; Wu et al., 2022). Additionally, Knowledge Unlearning techniques selectively forget or remove sensitive information from models to mitigate privacy risks (J. Chen & Yang, 2023; Eldan & Russinovich, 2023; Seyitoğlu et al., 2024).

Open Challenges in Data Privacy

Currently, privacy risks persist across the entire LLM lifecycle, encompassing not only model-centric issues but also human-centered factors. From the data side, stronger anonymization techniques and tools capable of identifying memorized personal information must keep up with LLMs’ evolving capabilities (Staab et al., 2023; Subramani et al., 2023). It should also be cautioned when scaling HCLLMs, as discussed in Scaling Human Centered LLMs, that risks from memorization also increase with scale if repeated data are in the training stage (Hernandez et al., 2022). In addition, the complexities of obtaining consent, especially in scenarios involving third-party or inaccessible data sources, underscore the need for more robust frameworks to ensure transparent data sourcing and meaningful user control (Zhang et al., 2024). HCI researchers now also advocate for improved LLM interaction paradigms, a deeper understanding of user mental models, and systems that enable end-users to reclaim ownership over their personal data (T. Li et al., 2024). Despite significant progress in addressing data privacy concerns, much of the research focuses on well-known LLMs with relatively small scales. In contrast, recently released models with larger parameter sizes have received less attention due to the challenges posed by their scale, data transparency issues, and the lagging development of privacy-preserving technologies (Yan et al., 2024). Overall, greater efforts are needed to enhance legal frameworks, strengthen regulatory oversight, and advance research and technology to better safeguard privacy and copyright in the era of LLMs that developers, users, and policymakers can jointly share.

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., … Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. In ArXiv preprint: Vol. abs/2204.05862. https://arxiv.org/abs/2204.05862

Brittain, B. (2024a). US newspapers sue OpenAI for copyright infringement over AI training. Reuters. https://www.reuters.com/legal/us-newspapers-sue-openai-copyright-infringement-over-ai-training-2024-04-30/

Brittain, B. (2024b). OpenAI, Microsoft defeat US consumer-privacy lawsuit for now. Reuters. https://www.reuters.com/legal/transactional/openai-microsoft-defeat-us-consumer-privacy-lawsuit-now-2024-05-24/

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.-F. Balcan, & H.-T. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramèr, F., & Zhang, C. (2023). Quantifying Memorization Across Neural Language Models. The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. https://openreview.net/pdf?id=TatRHT%5C_1cK

Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., & Song, D. (2019). The secret sharer: Evaluating and testing unintended memorization in neural networks. 28th USENIX Security Symposium (USENIX Security 19), 267–284.

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., & others. (2021). Extracting training data from large language models. 30th USENIX Security Symposium (USENIX Security 21), 2633–2650.

Chen, C., Feng, X., Zhou, J., Yin, J., & Zheng, X. (2023). Federated large language model: A position paper. ArXiv Preprint, abs/2307.08925. https://arxiv.org/abs/2307.08925

Chen, J., & Yang, D. (2023). Unlearn What You Want to Forget: Efficient Unlearning for LLMs. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 12041–12052). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.738

Claburn, T. (2023). GitHub, Microsoft, OpenAI fail to wriggle out of Copilot copyright lawsuit. The Register. https://www.theregister.com/2023/05/12/github_microsoft_openai_copilot/

Du, J., & Mi, H. (2021). Dp-fp: Differentially private forward propagation for large models. ArXiv Preprint, abs/2112.14430. https://arxiv.org/abs/2112.14430

Eldan, R., & Russinovich, M. (2023). Who’s Harry Potter? Approximate Unlearning in LLMs. ArXiv Preprint, abs/2310.02238. https://arxiv.org/abs/2310.02238

Franceschelli, G., & Musolesi, M. (2022). Copyright in generative deep learning. Data & Policy, 4, e17.

Hernandez, D., Brown, T., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Henighan, T., Hume, T., & others. (2022). Scaling laws and interpretability of learning from repeated data. arXiv Preprint arXiv:2205.10487.

Hoory, S., Feder, A., Tendler, A., Erell, S., Peled-Cohen, A., Laish, I., Nakhost, H., Stemmer, U., Benjamini, A., Hassidim, A., & Matias, Y. (2021). Learning and Evaluating a Differentially Private Pre-trained Language Model. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 1178–1189). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-emnlp.102

Kandpal, N., Wallace, E., & Raffel, C. (2022). Deduplicating Training Data Mitigates Privacy Risks in Language Models. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, & S. Sabato (Eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (Vol. 162, pp. 10697–10707). PMLR. https://proceedings.mlr.press/v162/kandpal22a.html

Karamolegkou, A., Li, J., Zhou, L., & Søgaard, A. (2023). Copyright Violations and Large Language Models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 7403–7412). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.458

Khan, M., & Hanna, A. (2022). The subjects and stages of ai dataset development: A framework for dataset accountability. Ohio St. Tech. LJ, 19, 171.

Kshetri, N. (2023). Cybercrime and privacy threats of large language models. IT Professional, 25(3), 9–13.

Li, T., Das, S., Lee, H.-P., Wang, D., Yao, B., & Zhang, Z. (2024). Human-Centered Privacy Research in the Age of Large Language Models. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 1–4.

Li, X., Tramèr, F., Liang, P., & Hashimoto, T. (2022). Large Language Models Can Be Strong Differentially Private Learners. The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. https://openreview.net/forum?id=bVuP3ltATMz

Neel, S., & Chang, P. (2023). Privacy issues in large language models: A survey. arXiv Preprint arXiv:2312.06717.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper%5C_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html

Sambra, A. V., Mansour, E., Hawke, S., Zereba, M., Greco, N., Ghanem, A., Zagidulin, D., Aboulnaga, A., & Berners-Lee, T. (2016). Solid: a platform for decentralized social applications based on linked data. MIT CSAIL & Qatar Computing Research Institute, Tech. Rep., 2016.

Seyitoğlu, A., Kuvshinov, A., Schwinn, L., & Günnemann, S. (2024). Extracting Unlearned Information from LLMs with Activation Steering. ArXiv Preprint, abs/2411.02631. https://arxiv.org/abs/2411.02631

Shaikh, O., Sapkota, S., Rizvi, S., Horvitz, E., Park, J. S., Yang, D., & Bernstein, M. S. (2025). Creating General User Models from Computer Use. Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. https://doi.org/10.1145/3746059.3747722

Shi, W., Shea, R., Chen, S., Zhang, C., Jia, R., & Yu, Z. (2022). Just Fine-tune Twice: Selective Differential Privacy for Large Language Models. In Y. Goldberg, Z. Kozareva, & Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 6327–6340). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlp-main.425

Staab, R., Vero, M., Balunović, M., & Vechev, M. (2023). Beyond Memorization: Violating Privacy Via Inference with Large Language Models. In ArXiv preprint: Vol. abs/2310.07298. https://arxiv.org/abs/2310.07298

Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: an Analysis. In A. Ovalle, K.-W. Chang, N. Mehrabi, Y. Pruksachatkun, A. Galystan, J. Dhamala, A. Verma, T. Cao, A. Kumar, & R. Gupta (Eds.), Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) (pp. 208–220). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.trustnlp-1.18

Vincent, J. (2023). Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement. Theverge. https://www.theverge.com/2023/2/6/23587393/ai-art-copyright-lawsuit-getty-images-stable-diffusion

Wachter, S. (2019). Data protection in the age of big data. Nature Electronics, 2(1), 6–7.

Wu, X., Gong, L., & Xiong, D. (2022). Adaptive Differential Privacy for Language Model Training. In B. Y. Lin, C. He, C. Xie, F. Mireshghallah, N. Mehrabi, T. Li, M. Soltanolkotabi, & X. Ren (Eds.), Proceedings of the First Workshop on Federated Learning for Natural Language Processing (FL4NLP 2022) (pp. 21–26). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.fl4nlp-1.3

Xu, M., Cai, D., Wu, Y., Li, X., & Wang, S. (2023). Fwdllm: Efficient fedllm using forward gradient. ArXiv Preprint, abs/2308.13894. https://arxiv.org/abs/2308.13894

Yan, B., Li, K., Xu, M., Dong, Y., Zhang, Y., Ren, Z., & Cheng, X. (2024). On protecting the data privacy of large language models (llms): A survey. ArXiv Preprint, abs/2403.05156. https://arxiv.org/abs/2403.05156

Yu, S., Munoz, J. P., & Jannesari, A. (2024). Federated Foundation Models: Privacy-Preserving and Collaborative Learning for Large Models. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 7174–7184). https://aclanthology.org/2024.lrec-main.630

Zhang, D., Xia, B., Liu, Y., Xu, X., Hoang, T., Xing, Z., Staples, M., Lu, Q., & Zhu, L. (2023). Tag your fish in the broken net: A responsible web framework for protecting online privacy and copyright. arXiv Preprint arXiv:2310.07915.

Zhang, D., Xia, B., Liu, Y., Xu, X., Hoang, T., Xing, Z., Staples, M., Lu, Q., & Zhu, L. (2024). Privacy and Copyright Protection in Generative AI: A Lifecycle Perspective. Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI, 92–97.

Zhao, W., Shao, H., Xu, Z., Duan, S., & Zhang, D. (2024). Measuring Copyright Risks of Large Language Model via Partial Information Probing. ArXiv Preprint, abs/2409.13831. https://arxiv.org/abs/2409.13831

Consent and Ownership

Data Privacy Considerations

Proactive vs. Reactive Privacy Strategies

Open Challenges in Data Privacy

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Data Privacy Considerations

Proactive vs. Reactive Privacy Strategies

Open Challenges in Data Privacy

Graph View

Table of Contents

Backlinks