Expanding Data Sources: Synthetic and Non-Traditional Data

Synthetic Data

We often lack high-quality, diverse, and privacy-compliant data (Almeida, 2024). Filtering methods (Data Provenance) can filter out as much as 90% of raw web text data from the Common Crawl. To replace this data, synthetic generation is one solution employed in Nemotron-CC (Su et al., 2025) and other popular pre-training corpora. Synthetic data generation preserves individuals’ confidentiality, replicating only the statistical properties of real datasets without retaining any personally identifiable information. LLM-generated synthetic text can also serve as fine-tuning and evaluation data (Vongthongsri, 2025), where it is invaluable for addressing class imbalances (Moon et al., 2024), especially in domains like healthcare (Guo & Chen, 2024) where data is sensitive, and mathematical reasoning where gold examples are costly to produce (Chan et al., 2024).

Methods Used to Generate Synthetic Data.

Even medium-size language models can effectively expand pre-training corpora by paraphrasing existing data (Maini et al., 2024). Moreover, LLMs can effectively generate entirely new content from scratch, including textbooks for pre-training (Gunasekar et al., 2023) and instruction-tuning data for post-training (Wang et al., 2023). Procuring high-quality synthetic data with LLMs typically involves three stages: generation, curation and evaluation (Long et al., 2024). Generation often involves prompt engineering to elicit LLM responses in the required format. This involves using strategies such as task definition, conditional prompting, in-context learning, and multi-step generation, which address context limitations and degradation over reasoning steps (Long et al., 2024; C. Wang et al., 2024).

The generated data often contains noise or corrupted samples due to hallucination, and is generally curated using sample filtering and label enhancement techniques. Sample filtering could involve simple heuristic-based strategies or leverage the advanced language-understanding capabilities of LLMs to generate confidence scores for data points based on quality and reject samples with low scores (Chung et al., 2023). Label enhancement strategies could include human inspection and annotation of low-confidence samples. These techniques are described in Pretraining Data.

After curation, the generated data must be evaluated for several components, including the statistical similarity between synthetic and real data, impact on model performance, and ensuring that synthetic data preserves essential patterns and relationships- Xia et al. (2024) capture these requirements in their proposed fidelity, utility, and privacy framework.

Making Synthetic Data More Human-Centric.

A human-centered approach to synthetic data creation should explicitly incorporate human values, perspectives, and audits at all stages of development, from generation to curation and evaluation. First, generation should serve to reflect authentic human interactions and preferences when real data collection proves slow or costly (Hämäläinen et al., 2023). Rather than simply increasing dataset sizes, synthetic data should contain realistic social interactions between individuals with diverse personalities and backgrounds. This requires persona alignment (Steerable HCLLMs) or role-play in which the LLM portrays a consistent identity (Tseng et al., 2024), possibly simulating a person from a particular sociodemographic background (Lutz et al., 2025), or an agent with a role, like a tutor or counselor (Li et al., 2024; Samuel et al., 2024; Shanahan et al., 2023). Persona alignment has been used to generate synthetic dialogues (Andukuri et al., 2024; Occhipinti et al., 2024) and preference data (Castricato et al., 2025).

At the curation stage, stratified sampling should reflect real-world distributions along known axes of variation, such as opinions and preferences (Sorensen et al., 2025). Finally, robust human-in-the-loop validation and auditing is essential. Human annotators and experts can review synthetic outputs, flag problematic patterns, and iteratively refine generation procedures. One major concern is that LLMs may reproduce biases and harms present in their training data, leaking private information or reinforcing existing social inequalities. Das et al. (2024) compare LLM-generated datasets with human-annotated benchmarks and highlight ethical concerns related to disparities in task performance and representational coverage. Chen et al. (2024) identify several failure modes in LLM-generated query—answer pairs, including instruction-following errors. To mitigate these risks, Chen et al. (2024) propose unlearning techniques to improve the reliability of synthetic queries. To preserve privacy, Ramesh et al. (2023) propose decentralized frameworks designed to reduce the likelihood of sensitive information exposure during data synthesis. These steps will ultimately enhance the quality, fairness, and usability of synthetic data to align with ethical standards and user expectations.

Non-traditional Data

Recent progress in LLM research has shown the value of using non-traditional data to make models more human-centered. This discussion focuses on three primary areas. The first is multimodal data, which allows LLMs to work with inputs like speech, images, and touch. The second is human-AI interaction data, such as user feedback, edits, and eye-tracking, which helps improve how well LLMs understand and respond to user needs. Lastly, human-human interaction data uses examples of real human interactions to teach LLMs how people communicate, enabling models to better handle context, complex emotions, and relationships.

Multimodal Data.

Recent research has sought to expand Large Language Models to enable multimodality (C. Li et al., 2024; Yin et al., 2024), significantly enhancing human-LLM interaction by allowing systems to process and respond to a diverse range of input formats beyond text, such as speech (Huang et al., 2024; Rubenstein et al., 2023), sound (Huang et al., 2024; Zhang et al., 2023), vision (Achiam et al., 2023; Fu et al., 2024; B. Li et al., 2025; J. Li et al., 2023; Zhang et al., 2023), and tactile data (Fu et al., 2024; Yu et al., 2024), creating richer and increasingly human-like communication channels. By integrating multiple sensory modalities, AI can better mirror human communication, which could further improve Human-AI interaction. For instance, a multimodal AI assistant could analyze a user’s tone of voice, facial expressions, and spoken words to assess emotional states (Cheng et al., 2024; X. Zhang et al., 2024), tailoring its responses accordingly. Recent works also explore integrating human physiological data (e.g. EEG, BVP) with LLMs to enhance empathic human-AI interaction (Dongre et al., 2024). In applications such as education, healthcare, and accessibility, multimodality fosters inclusivity by accommodating users with diverse needs (Belyaeva et al., 2023; Chang et al., 2024; Yildirim et al., 2024). Ultimately, multimodal AI systems bridge the gap between machine efficiency and human communication, making interactions more seamless, adaptive, and human-centered.

Human-AI Interaction Data.

Expanding the scope of human-AI interaction data has opened new pathways for enhancing Large Language Models through both supervised fine-tuning and reinforcement learning with human feedback (RLHF). For example, Vicuna is trained with massive user-shared conversations with GPT to achieve high quality outputs (Chiang et al., 2023). Another valuable type of human-AI interaction data is human edits, where users adjust the outputs of LLMs to better match their desired results. This data can be leveraged to fine-tune LLMs for improved preference alignment (Shaikh et al., 2024) or to extract user preferences more effectively (Gao et al., 2024). Beyond text-based interaction data, untraditional modalities such as eye-gaze signals offer additional interaction. Eye-gaze data, in particular, provides a real-time, implicit feedback mechanism that enhances context awareness and alignment with user intent (Engel et al., 2023; Konrad et al., 2024; Lopez-Cardona et al., 2024; Prokofieva et al., 2019). These gaze-based interactions have been shown to improve multi-modal conversational understanding and can be leveraged in RLHF workflows to refine LLM outputs dynamically (Lopez-Cardona et al., 2024). The integration of gaze data into multi-modal frameworks would help create richer, contextually adaptive systems, fostering more intuitive, personalized, and effective interactions across diverse applications.

Human-Human Interaction Data.

Real world human-human interaction data captures the nuances of human communication, including implicit cues, turn-taking dynamics, and diverse conversational contexts. Such data has the potential to improve LLMs by fostering deeper understanding of relational and situational context, thereby enabling models to generate responses that feel more natural, empathetic, and contextually appropriate. Recent advancements demonstrate how mining teacher-student interaction data, such as dialogue transcripts and collaborative problem-solving sessions, can align LLM outputs with human cognitive and emotional patterns, which allow LLMs to address complex, interdisciplinary challenges in fields such as education, psychology, and social science by emulating and learning from authentic human interaction styles (R. Wang et al., 2023, 2024; R. Wang & Demszky, 2024; Xu et al., 2024). By leveraging human-human interaction as an informative data source, we can expand the capacity of LLMs to foster meaningful, human-centered interactions in diverse real-world applications.

The integration of multimodal data, human-AI interaction data and human-human interaction data can all help LLMs more closely approximate the complexity of human communication, in turn making models more usable and reliable across high-impact domains like healthcare, education, and social services. As we exhaust traditional text data sources, recent efforts, such as MINT-1T, a multimodal text and image interleaved open-source dataset generated by (Awadalla et al., 2024) will be fundamental to advance the performance of frontier models.

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., & others. (2023). Gpt-4 technical report. ArXiv Preprint, abs/2303.08774. https://arxiv.org/abs/2303.08774

Almeida, D. R. (2024). Synthetic Data Generation (Part 1). Open in GitHub. https://cookbook.openai.com/examples/sdg1

Andukuri, C., Fränken, J.-P., Gerstenberg, T., & Goodman, N. D. (2024). Star-gate: Teaching language models to ask clarifying questions. ArXiv Preprint, abs/2403.19154. https://arxiv.org/abs/2403.19154

Awadalla, A., Xue, L., Lo, O., Shu, M., Lee, H., Guha, E., Jordan, M., Shen, S., Awadalla, M., Savarese, S., Xiong, C., Xu, R., Choi, Y., & Schmidt, L. (2024). MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens. arXiv preprint arXiv:2406.11271v5.

Belyaeva, A., Cosentino, J., Hormozdiari, F., Eswaran, K., Shetty, S., Corrado, G., Carroll, A., McLean, C. Y., & Furlotte, N. A. (2023). Multimodal LLMs for health grounded in individual-specific data. In ArXiv preprint: Vol. abs/2307.09018. https://arxiv.org/abs/2307.09018

Castricato, L., Lile, N., Rafailov, R., Fränken, J.-P., & Finn, C. (2025). Persona: A reproducible testbed for pluralistic alignment. Proceedings of the 31st International Conference on Computational Linguistics, 11348–11368.

Chan, Y.-C., Pu, G., Shanker, A., Suresh, P., Jenks, P., Heyer, J., & Denton, S. (2024). Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs. arXiv preprint arXiv:2409.19759v3 [cs.CL].

Chang, R.-C., Liu, Y., & Guo, A. (2024). WorldScribe: Towards Context-Aware Live Visual Descriptions. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, 1–18.

Chen, J., Zhang, Y., Wang, B., Zhao, W. X., Wen, J.-R., & Chen, W. (2024). Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024, 14855–14865.

Cheng, Z., Cheng, Z.-Q., He, J.-Y., Sun, J., Wang, K., Lin, Y., Lian, Z., Peng, X., & Hauptmann, A. (2024). Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning. ArXiv Preprint, abs/2406.11161. https://arxiv.org/abs/2406.11161

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2023). Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/

Chung, J., Kamar, E., & Amershi, S. (2023). Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 575–593). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.34

Das, D., Langis, K. D., Martin-Boyle, A., Kim, J., Lee, M., Kim, Z. M., Hayati, S. A., Owan, R., Hu, B., Parkar, R., Koo, R., Park, J., Tyagi, A., Ferland, L., Roy, S., Liu, V., & Kang, D. (2024). Under the Surface: Tracking the Artifactuality of LLM-Generated Data. In ArXiv preprint: Vol. abs/2401.14698. https://arxiv.org/abs/2401.14698

Dongre, P., Behravan, M., Gupta, K., Billinghurst, M., & Gračanin, D. (2024). Integrating Physiological Data with Large Language Models for Empathic Human-AI Interaction. In ArXiv preprint: Vol. abs/2404.15351. https://arxiv.org/abs/2404.15351

Engel, J., Somasundaram, K., Goesele, M., Sun, A., Gamino, A., Turner, A., Talattof, A., Yuan, A., Souti, B., Meredith, B., & others. (2023). Project aria: A new tool for egocentric multi-modal ai research. ArXiv Preprint, abs/2308.13561. https://arxiv.org/abs/2308.13561

Fu, L., Datta, G., Huang, H., Panitch, W. C.-H., Drake, J., Ortiz, J., Mukadam, M., Lambeta, M., Calandra, R., & Goldberg, K. (2024). A Touch, Vision, and Language Dataset for Multimodal Alignment. Forty-First International Conference on Machine Learning. https://openreview.net/forum?id=tFEOOH9eH0

Gao, G., Taymanov, A., Salinas, E., Mineiro, P., & Misra, D. (2024). Aligning LLM Agents by Learning Latent Preference from User Edits. In ArXiv preprint: Vol. abs/2404.15269. https://arxiv.org/abs/2404.15269

Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., & others. (2023). Textbooks are all you need. arXiv Preprint arXiv:2306.11644.

Guo, X., & Chen, Y. (2024). Generative AI for Synthetic Data Generation: Methods, Challenges and the Future. arXiv Preprint arXiv:2403.04190. https://doi.org/10.48550/arXiv.2403.04190

Hämäläinen, P., Tavast, M., & Kunnari, A. (2023). Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, A. Peters, S. Mueller, J. R. Williamson, & M. L. Wilson (Eds.), Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, April 23-28, 2023 (p. 433:1-433:19). ACM. https://doi.org/10.1145/3544548.3580688

Huang, R., Li, M., Yang, D., Shi, J., Chang, X., Ye, Z., Wu, Y., Hong, Z., Huang, J., Liu, J., Ren, Y., Zou, Y., Zhao, Z., & Watanabe, S. (2024). AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. In M. J. Wooldridge, J. G. Dy, & S. Natarajan (Eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada (pp. 23802–23804). AAAI Press. https://doi.org/10.1609/AAAI.V38I21.30570

Konrad, R., Padmanaban, N., Buckmaster, J. G., Boyle, K. C., & Wetzstein, G. (2024). Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear. ArXiv Preprint, abs/2401.17217. https://arxiv.org/abs/2401.17217

Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Cahyono, J. A., Yang, J., Li, C., & Liu, Z. (2025). Otter: A multi-modal model with in-context instruction tuning. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J., & others. (2024). Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends® in Computer Graphics and Vision, 16(1–2), 1–214.

Li, J., Li, D., Savarese, S., & Hoi, S. C. H. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Vol. 202, pp. 19730–19742). PMLR. https://proceedings.mlr.press/v202/li23q.html

Li, J., Peris, C., Mehrabi, N., Goyal, P., Chang, K.-W., Galstyan, A., Zemel, R., & Gupta, R. (2024). The steerability of large language models toward data-driven personas. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 7290–7305.

Long, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen, G., & Wang, H. (2024). On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. arXiv preprint arXiv:2406.15126v1 [cs.CL]. https://arxiv.org/abs/2406.15126v1

Lopez-Cardona, A., Segura, C., Karatzoglou, A., Abadal, S., & Arapakis, I. (2024). Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models. ArXiv Preprint, abs/2410.01532. https://arxiv.org/abs/2410.01532

Lutz, M., Sen, I., Ahnert, G., Rogers, E., & Strohmaier, M. (2025). The prompt makes the person (a): A systematic evaluation of sociodemographic persona prompting for large language models. arXiv Preprint arXiv:2507.16076.

Maini, P., Seto, S., Bai, R., Grangier, D., Zhang, Y., & Jaitly, N. (2024). Rephrasing the web: A recipe for compute and data-efficient language modeling. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 14044–14072.

Moon, Y.-B., Nam, H.-W., Choi, W., Kim, N., Kwak, S., & Oh, T.-H. (2024). SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems (Version v3).

Occhipinti, D., Tekiroglu, S., & Guerini, M. (2024). PRODIGy: a PROfile-based DIalogue Generation dataset. In K. Duh, H. Gomez, & S. Bethard (Eds.), Findings of the Association for Computational Linguistics: NAACL 2024 (pp. 3500–3514). Association for Computational Linguistics. https://aclanthology.org/2024.findings-naacl.222

Prokofieva, A., Celikyilmaz, F. A., Hakkani-Tur, D. Z., Heck, L., & Slaney, M. (2019). Eye gaze for spoken language understanding in multi-modal conversational interactions. Google Patents.

Ramesh, V., Zhao, R., & Goel, N. (2023). Decentralised, Scalable and Privacy-Preserving Synthetic Data Generation. In ArXiv preprint: Vol. abs/2310.20062. https://arxiv.org/abs/2310.20062

Rubenstein, P. K., Asawaroengchai, C., Nguyen, D. D., Bapna, A., Borsos, Z., de Chaumont Quitry, F., Chen, P., Badawy, D. E., Han, W., Kharitonov, E., Muckenhirn, H., Padfield, D., Qin, J., Rozenberg, D., Sainath, T., Schalkwyk, J., Sharifi, M., Ramanovich, M. T., Tagliasacchi, M., … Frank, C. (2023). AudioPaLM: A Large Language Model That Can Speak and Listen. In ArXiv preprint: Vol. abs/2306.12925. https://arxiv.org/abs/2306.12925

Samuel, V., Zou, H. P., Zhou, Y., Chaudhari, S., Kalyan, A., Rajpurohit, T., Deshpande, A., Narasimhan, K., & Murahari, V. (2024). Personagym: Evaluating persona agents and llms. arXiv Preprint arXiv:2407.18416.

Shaikh, O., Lam, M., Hejna, J., Shao, Y., Bernstein, M., & Yang, D. (2024). Show, Don’t Tell: Aligning Language Models with Demonstrated Feedback. ArXiv Preprint, abs/2406.00888. https://arxiv.org/abs/2406.00888

Shanahan, M., McDonell, K., & Reynolds, L. (2023). Role play with large language models. Nature, 623(7987), 493–498.

Sorensen, T., Newman, B., Moore, J., Park, C., Fisher, J., Mireshghallah, N., Jiang, L., & Choi, Y. (2025). Spectrum tuning: Post-training for distributional coverage and in-context steerability. arXiv Preprint arXiv:2510.06084.

Su, D., Kong, K., Lin, Y., Jennings, J., Norick, B., Kliegl, M., Patwary, M., Shoeybi, M., & Catanzaro, B. (2025). Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2459–2475.

Tseng, Y.-M., Huang, Y.-C., Hsiao, T.-Y., Chen, W.-L., Huang, C.-W., Meng, Y., & Chen, Y.-N. (2024). Two tales of persona in llms: A survey of role-playing and personalization. Findings of the Association for Computational Linguistics: EMNLP 2024, 16612–16631.

Vongthongsri, K. (2025). Using LLMs For Synthetic Data Generation: The Definitive Guide. https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms

Wang, C., Deng, Y., Lyu, Z., Zeng, L., He, J., Yan, S., & An, B. (2024). Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning. https://arxiv.org/abs/2406.14283v4

Wang, R., & Demszky, D. (2024). Edu-ConvoKit: An Open-Source Library for Education Conversation Data. In K.-W. Chang, A. Lee, & N. Rajani (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations) (pp. 61–69). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-demo.6

Wang, R., Wirawarn, P., Goodman, N., & Demszky, D. (2023). SIGHT: A Large Annotated Dataset on Student Insights Gathered from Higher Education Transcripts. In E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, & T. Zesch (Eds.), Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 315–351). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.27

Wang, R., Zhang, Q., Robinson, C., Loeb, S., & Demszky, D. (2024). Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes. In K. Duh, H. Gomez, & S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 2174–2199). Association for Computational Linguistics. https://aclanthology.org/2024.naacl-long.120

Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2023). Self-Instruct: Aligning Language Models with Self-Generated Instructions. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13484–13508). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.754

Xia, Y., Wang, C.-H., Mabry, J., & Cheng, G. (2024). Advancing Retail Data Science: Comprehensive Evaluation of Synthetic Data. arXiv preprint arXiv:2406.13130v1 [cs.LG].

Xu, S., Huang, X., Lo, C. K., Chen, G., & yung Siu-Jong, M. (2024). Evaluating the performance of ChatGPT and GPT-4o in coding classroom discourse data: A study of synchronous online mathematics instruction. Computers and Education: Artificial Intelligence, 7, 100325. https://doi.org/https://doi.org/10.1016/j.caeai.2024.100325

Yildirim, N., Richardson, H., Wetscherek, M. T., Bajwa, J., Jacob, J., Pinnock, M. A., Harris, S., de Castro, D. C., Bannur, S., Hyland, S. L., Ghosh, P., Ranjit, M., Bouzid, K., Schwaighofer, A., Pérez-Garcı́a, F., Sharma, H., Oktay, O., Lungren, M. P., Alvarez-Valle, J., … Thieme, A. (2024). Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology. In F. “Floyd” Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, & I. Shklovski (Eds.), Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024 (p. 444:1-444:22). ACM. https://doi.org/10.1145/3613904.3642013

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2024). A survey on multimodal large language models. National Science Review, nwae403.

Yu, S., Lin, K., Xiao, A., Duan, J., & Soh, H. (2024). Octopi: Object Property Reasoning with Large Tactile-Language Models. ArXiv Preprint, abs/2405.02794. https://arxiv.org/abs/2405.02794

Zhang, H., Li, X., & Bing, L. (2023). Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In Y. Feng & E. Lefever (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 543–553). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-demo.49

Zhang, X., Liu, H., Xu, K., Zhang, Q., Liu, D., Ahmed, B., & Epps, J. (2024). When LLMs meets acoustic landmarks: An efficient approach to integrate speech into large language models for depression detection. ArXiv Preprint, abs/2402.13276. https://arxiv.org/abs/2402.13276

Expanding Data Sources: Synthetic and Non-Traditional Data

Synthetic Data

Methods Used to Generate Synthetic Data.

Making Synthetic Data More Human-Centric.

Non-traditional Data

Multimodal Data.

Human-AI Interaction Data.

Human-Human Interaction Data.

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Synthetic Data

Methods Used to Generate Synthetic Data.

Making Synthetic Data More Human-Centric.

Non-traditional Data

Multimodal Data.

Human-AI Interaction Data.

Human-Human Interaction Data.

Graph View

Table of Contents

Backlinks