Synthetic Data
We often lack high-quality, diverse, and privacy-compliant data (Almeida, 2024ReferenceAlmeida, D. R. (2024). Synthetic Data Generation (Part 1). Open in GitHub. https://cookbook.openai.com/examples/sdg1). Filtering methods (Data Provenance) can filter out as much as 90% of raw web text data from the Common Crawl. To replace this data, synthetic generation is one solution employed in Nemotron-CC (Su et al., 2025ReferenceSu, D., Kong, K., Lin, Y., Jennings, J., Norick, B., Kliegl, M., Patwary, M., Shoeybi, M., & Catanzaro, B. (2025). Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2459–2475.) and other popular pre-training corpora. Synthetic data generation preserves individuals’ confidentiality, replicating only the statistical properties of real datasets without retaining any personally identifiable information. LLM-generated synthetic text can also serve as fine-tuning and evaluation data (Vongthongsri, 2025ReferenceVongthongsri, K. (2025). Using LLMs For Synthetic Data Generation: The Definitive Guide. https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms), where it is invaluable for addressing class imbalances (Moon et al., 2024ReferenceMoon, Y.-B., Nam, H.-W., Choi, W., Kim, N., Kwak, S., & Oh, T.-H. (2024). SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems (Version v3).), especially in domains like healthcare (Guo & Chen, 2024ReferenceGuo, X., & Chen, Y. (2024). Generative AI for Synthetic Data Generation: Methods, Challenges and the Future. arXiv Preprint arXiv:2403.04190. https://doi.org/10.48550/arXiv.2403.04190) where data is sensitive, and mathematical reasoning where gold examples are costly to produce (Chan et al., 2024ReferenceChan, Y.-C., Pu, G., Shanker, A., Suresh, P., Jenks, P., Heyer, J., & Denton, S. (2024). Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs. arXiv preprint arXiv:2409.19759v3 [cs.CL].).
Methods Used to Generate Synthetic Data.
Even medium-size language models can effectively expand pre-training corpora by paraphrasing existing data (Maini et al., 2024ReferenceMaini, P., Seto, S., Bai, R., Grangier, D., Zhang, Y., & Jaitly, N. (2024). Rephrasing the web: A recipe for compute and data-efficient language modeling. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 14044–14072.). Moreover, LLMs can effectively generate entirely new content from scratch, including textbooks for pre-training (Gunasekar et al., 2023ReferenceGunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., & others. (2023). Textbooks are all you need. arXiv Preprint arXiv:2306.11644.) and instruction-tuning data for post-training (Wang et al., 2023ReferenceWang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2023). Self-Instruct: Aligning Language Models with Self-Generated Instructions. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13484–13508). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.754). Procuring high-quality synthetic data with LLMs typically involves three stages: generation, curation and evaluation (Long et al., 2024ReferenceLong, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen, G., & Wang, H. (2024). On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. arXiv preprint arXiv:2406.15126v1 [cs.CL]. https://arxiv.org/abs/2406.15126v1). Generation often involves prompt engineering to elicit LLM responses in the required format. This involves using strategies such as task definition, conditional prompting, in-context learning, and multi-step generation, which address context limitations and degradation over reasoning steps (Long et al., 2024ReferenceLong, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen, G., & Wang, H. (2024). On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. arXiv preprint arXiv:2406.15126v1 [cs.CL]. https://arxiv.org/abs/2406.15126v1; C. Wang et al., 2024ReferenceWang, C., Deng, Y., Lyu, Z., Zeng, L., He, J., Yan, S., & An, B. (2024). Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning. https://arxiv.org/abs/2406.14283v4).
The generated data often contains noise or corrupted samples due to hallucination, and is generally curated using sample filtering and label enhancement techniques. Sample filtering could involve simple heuristic-based strategies or leverage the advanced language-understanding capabilities of LLMs to generate confidence scores for data points based on quality and reject samples with low scores (Chung et al., 2023ReferenceChung, J., Kamar, E., & Amershi, S. (2023). Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 575–593). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.34). Label enhancement strategies could include human inspection and annotation of low-confidence samples. These techniques are described in Pretraining Data.
After curation, the generated data must be evaluated for several components, including the statistical similarity between synthetic and real data, impact on model performance, and ensuring that synthetic data preserves essential patterns and relationships- Xia et al. (2024)ReferenceXia, Y., Wang, C.-H., Mabry, J., & Cheng, G. (2024). Advancing Retail Data Science: Comprehensive Evaluation of Synthetic Data. arXiv preprint arXiv:2406.13130v1 [cs.LG]. capture these requirements in their proposed fidelity, utility, and privacy framework.
Making Synthetic Data More Human-Centric.
A human-centered approach to synthetic data creation should explicitly incorporate human values, perspectives, and audits at all stages of development, from generation to curation and evaluation. First, generation should serve to reflect authentic human interactions and preferences when real data collection proves slow or costly (Hämäläinen et al., 2023ReferenceHämäläinen, P., Tavast, M., & Kunnari, A. (2023). Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, A. Peters, S. Mueller, J. R. Williamson, & M. L. Wilson (Eds.), Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023 (p. 433:1-433:19). ACM. https://doi.org/10.1145/3544548.3580688). Rather than simply increasing dataset sizes, synthetic data should contain realistic social interactions between individuals with diverse personalities and backgrounds. This requires persona alignment (Steerable HCLLMs) or role-play in which the LLM portrays a consistent identity (Tseng et al., 2024ReferenceTseng, Y.-M., Huang, Y.-C., Hsiao, T.-Y., Chen, W.-L., Huang, C.-W., Meng, Y., & Chen, Y.-N. (2024). Two tales of persona in llms: A survey of role-playing and personalization. Findings of the Association for Computational Linguistics: EMNLP 2024, 16612–16631.), possibly simulating a person from a particular sociodemographic background (Lutz et al., 2025ReferenceLutz, M., Sen, I., Ahnert, G., Rogers, E., & Strohmaier, M. (2025). The prompt makes the person (a): A systematic evaluation of sociodemographic persona prompting for large language models. arXiv Preprint arXiv:2507.16076.), or an agent with a role, like a tutor or counselor (Li et al., 2024ReferenceLi, J., Peris, C., Mehrabi, N., Goyal, P., Chang, K.-W., Galstyan, A., Zemel, R., & Gupta, R. (2024). The steerability of large language models toward data-driven personas. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 7290–7305.; Samuel et al., 2024ReferenceSamuel, V., Zou, H. P., Zhou, Y., Chaudhari, S., Kalyan, A., Rajpurohit, T., Deshpande, A., Narasimhan, K., & Murahari, V. (2024). Personagym: Evaluating persona agents and llms. arXiv Preprint arXiv:2407.18416.; Shanahan et al., 2023ReferenceShanahan, M., McDonell, K., & Reynolds, L. (2023). Role play with large language models. Nature, 623(7987), 493–498.). Persona alignment has been used to generate synthetic dialogues (Andukuri et al., 2024ReferenceAndukuri, C., Fränken, J.-P., Gerstenberg, T., & Goodman, N. D. (2024). Star-gate: Teaching language models to ask clarifying questions. ArXiv Preprint, abs/2403.19154. https://arxiv.org/abs/2403.19154; Occhipinti et al., 2024ReferenceOcchipinti, D., Tekiroglu, S., & Guerini, M. (2024). PRODIGy: a PROfile-based DIalogue Generation dataset. In K. Duh, H. Gomez, & S. Bethard (Eds.), Findings of the Association for Computational Linguistics: NAACL 2024 (pp. 3500–3514). Association for Computational Linguistics. https://aclanthology.org/2024.findings-naacl.222) and preference data (Castricato et al., 2025ReferenceCastricato, L., Lile, N., Rafailov, R., Fränken, J.-P., & Finn, C. (2025). Persona: A reproducible testbed for pluralistic alignment. Proceedings of the 31st International Conference on Computational Linguistics, 11348–11368.).
At the curation stage, stratified sampling should reflect real-world distributions along known axes of variation, such as opinions and preferences (Sorensen et al., 2025ReferenceSorensen, T., Newman, B., Moore, J., Park, C., Fisher, J., Mireshghallah, N., Jiang, L., & Choi, Y. (2025). Spectrum tuning: Post-training for distributional coverage and in-context steerability. arXiv Preprint arXiv:2510.06084.). Finally, robust human-in-the-loop validation and auditing is essential. Human annotators and experts can review synthetic outputs, flag problematic patterns, and iteratively refine generation procedures. One major concern is that LLMs may reproduce biases and harms present in their training data, leaking private information or reinforcing existing social inequalities. Das et al. (2024)ReferenceDas, D., Langis, K. D., Martin-Boyle, A., Kim, J., Lee, M., Kim, Z. M., Hayati, S. A., Owan, R., Hu, B., Parkar, R., Koo, R., Park, J., Tyagi, A., Ferland, L., Roy, S., Liu, V., & Kang, D. (2024). Under the Surface: Tracking the Artifactuality of LLM-Generated Data. In ArXiv preprint: Vol. abs/2401.14698. https://arxiv.org/abs/2401.14698 compare LLM-generated datasets with human-annotated benchmarks and highlight ethical concerns related to disparities in task performance and representational coverage. Chen et al. (2024)ReferenceChen, J., Zhang, Y., Wang, B., Zhao, W. X., Wen, J.-R., & Chen, W. (2024). Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024, 14855–14865. identify several failure modes in LLM-generated query—answer pairs, including instruction-following errors. To mitigate these risks, Chen et al. (2024)ReferenceChen, J., Zhang, Y., Wang, B., Zhao, W. X., Wen, J.-R., & Chen, W. (2024). Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024, 14855–14865. propose unlearning techniques to improve the reliability of synthetic queries. To preserve privacy, Ramesh et al. (2023)ReferenceRamesh, V., Zhao, R., & Goel, N. (2023). Decentralised, Scalable and Privacy-Preserving Synthetic Data Generation. In ArXiv preprint: Vol. abs/2310.20062. https://arxiv.org/abs/2310.20062 propose decentralized frameworks designed to reduce the likelihood of sensitive information exposure during data synthesis. These steps will ultimately enhance the quality, fairness, and usability of synthetic data to align with ethical standards and user expectations.
Non-traditional Data
Recent progress in LLM research has shown the value of using non-traditional data to make models more human-centered. This discussion focuses on three primary areas. The first is multimodal data, which allows LLMs to work with inputs like speech, images, and touch. The second is human-AI interaction data, such as user feedback, edits, and eye-tracking, which helps improve how well LLMs understand and respond to user needs. Lastly, human-human interaction data uses examples of real human interactions to teach LLMs how people communicate, enabling models to better handle context, complex emotions, and relationships.
Multimodal Data.
Recent research has sought to expand Large Language Models to enable multimodality (C. Li et al., 2024ReferenceLi, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J., & others. (2024). Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends® in Computer Graphics and Vision, 16(1–2), 1–214.; Yin et al., 2024ReferenceYin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2024). A survey on multimodal large language models. National Science Review, nwae403.), significantly enhancing human-LLM interaction by allowing systems to process and respond to a diverse range of input formats beyond text, such as speech (Huang et al., 2024ReferenceHuang, R., Li, M., Yang, D., Shi, J., Chang, X., Ye, Z., Wu, Y., Hong, Z., Huang, J., Liu, J., Ren, Y., Zou, Y., Zhao, Z., & Watanabe, S. (2024). AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. In M. J. Wooldridge, J. G. Dy, & S. Natarajan (Eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada (pp. 23802–23804). AAAI Press. https://doi.org/10.1609/AAAI.V38I21.30570; Rubenstein et al., 2023ReferenceRubenstein, P. K., Asawaroengchai, C., Nguyen, D. D., Bapna, A., Borsos, Z., de Chaumont Quitry, F., Chen, P., Badawy, D. E., Han, W., Kharitonov, E., Muckenhirn, H., Padfield, D., Qin, J., Rozenberg, D., Sainath, T., Schalkwyk, J., Sharifi, M., Ramanovich, M. T., Tagliasacchi, M., … Frank, C. (2023). AudioPaLM: A Large Language Model That Can Speak and Listen. In ArXiv preprint: Vol. abs/2306.12925. https://arxiv.org/abs/2306.12925), sound (Huang et al., 2024ReferenceHuang, R., Li, M., Yang, D., Shi, J., Chang, X., Ye, Z., Wu, Y., Hong, Z., Huang, J., Liu, J., Ren, Y., Zou, Y., Zhao, Z., & Watanabe, S. (2024). AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. In M. J. Wooldridge, J. G. Dy, & S. Natarajan (Eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada (pp. 23802–23804). AAAI Press. https://doi.org/10.1609/AAAI.V38I21.30570; Zhang et al., 2023ReferenceZhang, H., Li, X., & Bing, L. (2023). Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In Y. Feng & E. Lefever (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 543–553). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-demo.49), vision (Achiam et al., 2023ReferenceAchiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., & others. (2023). Gpt-4 technical report. ArXiv Preprint, abs/2303.08774. https://arxiv.org/abs/2303.08774; Fu et al., 2024ReferenceFu, L., Datta, G., Huang, H., Panitch, W. C.-H., Drake, J., Ortiz, J., Mukadam, M., Lambeta, M., Calandra, R., & Goldberg, K. (2024). A Touch, Vision, and Language Dataset for Multimodal Alignment. Forty-First International Conference on Machine Learning. https://openreview.net/forum?id=tFEOOH9eH0; B. Li et al., 2025ReferenceLi, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Cahyono, J. A., Yang, J., Li, C., & Liu, Z. (2025). Otter: A multi-modal model with in-context instruction tuning. IEEE Transactions on Pattern Analysis and Machine Intelligence.; J. Li et al., 2023ReferenceLi, J., Li, D., Savarese, S., & Hoi, S. C. H. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Vol. 202, pp. 19730–19742). PMLR. https://proceedings.mlr.press/v202/li23q.html; Zhang et al., 2023ReferenceZhang, H., Li, X., & Bing, L. (2023). Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In Y. Feng & E. Lefever (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 543–553). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-demo.49), and tactile data (Fu et al., 2024ReferenceFu, L., Datta, G., Huang, H., Panitch, W. C.-H., Drake, J., Ortiz, J., Mukadam, M., Lambeta, M., Calandra, R., & Goldberg, K. (2024). A Touch, Vision, and Language Dataset for Multimodal Alignment. Forty-First International Conference on Machine Learning. https://openreview.net/forum?id=tFEOOH9eH0; Yu et al., 2024ReferenceYu, S., Lin, K., Xiao, A., Duan, J., & Soh, H. (2024). Octopi: Object Property Reasoning with Large Tactile-Language Models. ArXiv Preprint, abs/2405.02794. https://arxiv.org/abs/2405.02794), creating richer and increasingly human-like communication channels. By integrating multiple sensory modalities, AI can better mirror human communication, which could further improve Human-AI interaction. For instance, a multimodal AI assistant could analyze a user’s tone of voice, facial expressions, and spoken words to assess emotional states (Cheng et al., 2024ReferenceCheng, Z., Cheng, Z.-Q., He, J.-Y., Sun, J., Wang, K., Lin, Y., Lian, Z., Peng, X., & Hauptmann, A. (2024). Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning. ArXiv Preprint, abs/2406.11161. https://arxiv.org/abs/2406.11161; X. Zhang et al., 2024ReferenceZhang, X., Liu, H., Xu, K., Zhang, Q., Liu, D., Ahmed, B., & Epps, J. (2024). When LLMs meets acoustic landmarks: An efficient approach to integrate speech into large language models for depression detection. ArXiv Preprint, abs/2402.13276. https://arxiv.org/abs/2402.13276), tailoring its responses accordingly. Recent works also explore integrating human physiological data (e.g. EEG, BVP) with LLMs to enhance empathic human-AI interaction (Dongre et al., 2024ReferenceDongre, P., Behravan, M., Gupta, K., Billinghurst, M., & Gračanin, D. (2024). Integrating Physiological Data with Large Language Models for Empathic Human-AI Interaction. In ArXiv preprint: Vol. abs/2404.15351. https://arxiv.org/abs/2404.15351). In applications such as education, healthcare, and accessibility, multimodality fosters inclusivity by accommodating users with diverse needs (Belyaeva et al., 2023ReferenceBelyaeva, A., Cosentino, J., Hormozdiari, F., Eswaran, K., Shetty, S., Corrado, G., Carroll, A., McLean, C. Y., & Furlotte, N. A. (2023). Multimodal LLMs for health grounded in individual-specific data. In ArXiv preprint: Vol. abs/2307.09018. https://arxiv.org/abs/2307.09018; Chang et al., 2024ReferenceChang, R.-C., Liu, Y., & Guo, A. (2024). WorldScribe: Towards Context-Aware Live Visual Descriptions. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, 1–18.; Yildirim et al., 2024ReferenceYildirim, N., Richardson, H., Wetscherek, M. T., Bajwa, J., Jacob, J., Pinnock, M. A., Harris, S., de Castro, D. C., Bannur, S., Hyland, S. L., Ghosh, P., Ranjit, M., Bouzid, K., Schwaighofer, A., Pérez-Garcı́a, F., Sharma, H., Oktay, O., Lungren, M. P., Alvarez-Valle, J., … Thieme, A. (2024). Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology. In F. “Floyd” Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, & I. Shklovski (Eds.), Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024 (p. 444:1-444:22). ACM. https://doi.org/10.1145/3613904.3642013). Ultimately, multimodal AI systems bridge the gap between machine efficiency and human communication, making interactions more seamless, adaptive, and human-centered.
Human-AI Interaction Data.
Expanding the scope of human-AI interaction data has opened new pathways for enhancing Large Language Models through both supervised fine-tuning and reinforcement learning with human feedback (RLHF). For example, Vicuna is trained with massive user-shared conversations with GPT to achieve high quality outputs (Chiang et al., 2023ReferenceChiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2023). Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/). Another valuable type of human-AI interaction data is human edits, where users adjust the outputs of LLMs to better match their desired results. This data can be leveraged to fine-tune LLMs for improved preference alignment (Shaikh et al., 2024ReferenceShaikh, O., Lam, M., Hejna, J., Shao, Y., Bernstein, M., & Yang, D. (2024). Show, Don’t Tell: Aligning Language Models with Demonstrated Feedback. ArXiv Preprint, abs/2406.00888. https://arxiv.org/abs/2406.00888) or to extract user preferences more effectively (Gao et al., 2024ReferenceGao, G., Taymanov, A., Salinas, E., Mineiro, P., & Misra, D. (2024). Aligning LLM Agents by Learning Latent Preference from User Edits. In ArXiv preprint: Vol. abs/2404.15269. https://arxiv.org/abs/2404.15269). Beyond text-based interaction data, untraditional modalities such as eye-gaze signals offer additional interaction. Eye-gaze data, in particular, provides a real-time, implicit feedback mechanism that enhances context awareness and alignment with user intent (Engel et al., 2023ReferenceEngel, J., Somasundaram, K., Goesele, M., Sun, A., Gamino, A., Turner, A., Talattof, A., Yuan, A., Souti, B., Meredith, B., & others. (2023). Project aria: A new tool for egocentric multi-modal ai research. ArXiv Preprint, abs/2308.13561. https://arxiv.org/abs/2308.13561; Konrad et al., 2024ReferenceKonrad, R., Padmanaban, N., Buckmaster, J. G., Boyle, K. C., & Wetzstein, G. (2024). Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear. ArXiv Preprint, abs/2401.17217. https://arxiv.org/abs/2401.17217; Lopez-Cardona et al., 2024ReferenceLopez-Cardona, A., Segura, C., Karatzoglou, A., Abadal, S., & Arapakis, I. (2024). Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models. ArXiv Preprint, abs/2410.01532. https://arxiv.org/abs/2410.01532; Prokofieva et al., 2019ReferenceProkofieva, A., Celikyilmaz, F. A., Hakkani-Tur, D. Z., Heck, L., & Slaney, M. (2019). Eye gaze for spoken language understanding in multi-modal conversational interactions. Google Patents.). These gaze-based interactions have been shown to improve multi-modal conversational understanding and can be leveraged in RLHF workflows to refine LLM outputs dynamically (Lopez-Cardona et al., 2024ReferenceLopez-Cardona, A., Segura, C., Karatzoglou, A., Abadal, S., & Arapakis, I. (2024). Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models. ArXiv Preprint, abs/2410.01532. https://arxiv.org/abs/2410.01532). The integration of gaze data into multi-modal frameworks would help create richer, contextually adaptive systems, fostering more intuitive, personalized, and effective interactions across diverse applications.
Human-Human Interaction Data.
Real world human-human interaction data captures the nuances of human communication, including implicit cues, turn-taking dynamics, and diverse conversational contexts. Such data has the potential to improve LLMs by fostering deeper understanding of relational and situational context, thereby enabling models to generate responses that feel more natural, empathetic, and contextually appropriate. Recent advancements demonstrate how mining teacher-student interaction data, such as dialogue transcripts and collaborative problem-solving sessions, can align LLM outputs with human cognitive and emotional patterns, which allow LLMs to address complex, interdisciplinary challenges in fields such as education, psychology, and social science by emulating and learning from authentic human interaction styles (R. Wang et al., 2023, 2024; R. Wang & Demszky, 2024ReferenceWang, R., & Demszky, D. (2024). Edu-ConvoKit: An Open-Source Library for Education Conversation Data. In K.-W. Chang, A. Lee, & N. Rajani (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations) (pp. 61–69). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-demo.6; Xu et al., 2024)ReferenceWang, R., Wirawarn, P., Goodman, N., & Demszky, D. (2023). SIGHT: A Large Annotated Dataset on Student Insights Gathered from Higher Education Transcripts. In E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, & T. Zesch (Eds.), Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 315–351). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.27ReferenceWang, R., Zhang, Q., Robinson, C., Loeb, S., & Demszky, D. (2024). Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 2174–2199). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-long.120ReferenceXu, S., Huang, X., Lo, C. K., Chen, G., & yung Siu-Jong, M. (2024). Evaluating the performance of ChatGPT and GPT-4o in coding classroom discourse data: A study of synchronous online mathematics instruction. Computers and Education: Artificial Intelligence, 7, 100325. https://doi.org/https://doi.org/10.1016/j.caeai.2024.100325. By leveraging human-human interaction as an informative data source, we can expand the capacity of LLMs to foster meaningful, human-centered interactions in diverse real-world applications.
The integration of multimodal data, human-AI interaction data and human-human interaction data can all help LLMs more closely approximate the complexity of human communication, in turn making models more usable and reliable across high-impact domains like healthcare, education, and social services. As we exhaust traditional text data sources, recent efforts, such as MINT-1T, a multimodal text and image interleaved open-source dataset generated by (Awadalla et al., 2024ReferenceAwadalla, A., Xue, L., Lo, O., Shu, M., Lee, H., Guha, E., Jordan, M., Shen, S., Awadalla, M., Savarese, S., Xiong, C., Xu, R., Choi, Y., & Schmidt, L. (2024). MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens. arXiv preprint arXiv:2406.11271v5.) will be fundamental to advance the performance of frontier models.