Multilinguality

Beyond being able to capture the needs, values, etc. of users, users ought to also be able to interact with models in their preferred language. Although there are over 7,000 languages spoken worldwide, most of LLM development focuses on English (Held et al., 2023). Expanding the multilingual capabilities plays a critical role in helping democratize access to language models, particularly bridging the gap between high-resource and low-resource language technologies.

Current Approaches

Multilingual large language models (MLLMs) are systems capable of both understanding and generating text in multiple languages. While most LLMs perform best in English, there is a concerted effort to improve their multilingual capabilities. From a data-centric perspective, these efforts focus on curating diverse linguistic resources, including pre-training corpora and post-training datasets for supervised fine-tuning (SFT) and preference learning. Many pre-training corpora, such as RedPajama (Weber et al., 2024) and CC-100 (Conneau et al., 2020), are collected via web scraping. Furthermore, many existing pre-training corpora that are primarily in English already include a small percentage of non-English data; incorporating even this small amount of data during pre-training can improve cross-lingual capabilities (Blevins & Zettlemoyer, 2022). Nonetheless, a core challenge is that the Internet remains heavily English-centric, leading to performance gaps for less-resourced languages that appear infrequently online or, in some cases, lack standardized written forms. For an in-depth survey on multilingual LLMs, please refer to Qin et al. (2025).

Rather than relying solely on naturally occurring multilingual text, researchers have turned to translating high-resource English text into other languages. For example, Wang et al. (2025) translate the pre-training corpus FineWeb into multiple languages for pre-training. Similar approaches exist for creating post-training datasets, such as the Aya (Üstün et al., 2024) and CrossAlpaca (Ranaldi et al., 2024) corpora, which rely on either machine translation or existing MLLMs to assist with translation (Lai et al., 2023; Yue et al., 2024). However, translated datasets can introduce artifacts (referred to as translationese) that shift linguistic patterns and degrade data quality, particularly for downstream reasoning tasks and open-ended generation (Dang et al., 2024; Vanmassenhove et al., 2021). While translation is an efficient way to generate large amounts of multilingual data, another question we must ask is what gets lost in translation? For example, when translating from text in English, the resulting output might lack important cultural context and nuance that would be found in text originating in places that speak the selected language (Qin et al., 2025).

Finally, complementary to training interventions are non-training approaches enabled through prompting. A wide range of prompting strategies for multilingual use has been proposed, including those surveyed in recent work by Vatsal et al. (2025). One common strategy is “translate-test”, or to translate user input into English before performing the task, leveraging the stronger English proficiency of most current LLMs (Artetxe et al., 2023; Etxaniz et al., 2024; H. Huang et al., 2023; L. Huang et al., 2022; Liu et al., 2025). While this approach often boosts performance, it can fall short for tasks requiring cultural nuance, idiomatic understanding, or language-specific world knowledge, where remaining in the original language is critical for faithful interpretation and generation (Liu et al., 2025).

Future of Multilinguality for HCLLMs

Looking forward, efforts toward human-centered multilingual capabilities must consider the following areas. First, there is a growing interest in multilingual safety. As Yong et al. (2025) identify, multilingual safety remains massively underrepresented as a research domain, resulting in safety standards built for English that do not translate effectively to other linguistic contexts. Treating English as the universal reference point obscures sociolinguistic variation and produces a gap between how models behave and how safety norms should operate for real users across languages. Understanding and addressing multilingual safety thus remains an open frontier.

A second challenge involves determining what it means to represent language in ways that do not exploit the communities that speak it. The desire for multilingual NLP is not new. Projects such as Meta’s No Language Left Behind demonstrate longstanding investment in broad language coverage. However, as Bird (2024) argues, these efforts often treat language as a detached artifact rather than something rooted in communities, cultures, and social practices. When language is treated as a pure optimization target, scraped data, or a resource to be “unlocked,” the resulting systems risk extraction without contributing tangible value to the communities whose linguistic labor enables them.

Beyond technical considerations, achieving authentic multilinguality in human-centered systems demands rethinking how data is gathered, whose language practices are modeled, and for what ends. Simply scaling web pre-training and technical fixes may improve multilingual benchmarks, but it does not confront the deeper question of whether these systems advance the needs, agency, and self-determination of speakers. Efforts to “democratize” access to LLMs echo earlier narratives that cast computing as a universal solution. As the Information and Communication Technology for Development (ICT4D) literature reminds us (Harris, 2016; Toyama, 2015), such narratives risk reproducing existing inequities when social, cultural, and political contexts are ignored.

Artetxe, M., Goswami, V., Bhosale, S., Fan, A., & Zettlemoyer, L. (2023). Revisiting Machine Translation for Cross-lingual Classification. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 6489–6499.

Bird, S. (2024). Must NLP be Extractive? Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 14915–14929.

Blevins, T., & Zettlemoyer, L. (2022). Language Contamination Helps Explains the Cross-lingual Capabilities of English Pretrained Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 3563–3574.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440–8451). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747

Dang, J., Ahmadian, A., Marchisio, K., Kreutzer, J., Üstün, A., & Hooker, S. (2024). RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 13134–13156.

Etxaniz, J., Azkune, G., Soroa, A., de Lacalle, O. L., & Artetxe, M. (2024). Do multilingual language models think better in English? Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), 550–564.

Harris, R. W. (2016). How ICT4D research fails the poor. Information Technology for Development, 22(1), 177–192.

Held, W., Harris, C., Best, M., & Yang, D. (2023). A material lens on coloniality in nlp. arXiv Preprint arXiv:2311.08391.

Huang, H., Tang, T., Zhang, D., Zhao, X., Song, T., Xia, Y., & Wei, F. (2023). Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting. In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 12365–12394). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.826

Huang, L., Ma, S., Zhang, D., Wei, F., & Wang, H. (2022). Zero-shot Cross-lingual Transfer of Prompt-based Tuning with a Unified Multilingual Prompt. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11488–11497.

Lai, V., Nguyen, C., Ngo, N., Nguyen, T., Dernoncourt, F., Rossi, R., & Nguyen, T. (2023). Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback. In Y. Feng & E. Lefever (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 318–327). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-demo.28

Liu, C., Zhang, W., Zhao, Y., Tuan, L. A., & Bing, L. (2025). Is translation all you need? a study on solving multilingual tasks with large language models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 9594–9614.

Qin, L., Chen, Q., Zhou, Y., Chen, Z., Li, Y., Liao, L., Li, M., Che, W., & Yu, P. S. (2025). A survey of multilingual large language models. Patterns, 6(1).

Ranaldi, L., Pucci, G., & Freitas, A. (2024). Empowering cross-lingual abilities of instruction-tuned large language models by translation-following demonstrations. Findings of the Association for Computational Linguistics: ACL 2024, 7961–7973.

Toyama, K. (2015). Geek heresy: Rescuing social change from the cult of technology. PublicAffairs.

Üstün, A., Aryabumi, V., Yong, Z., Ko, W.-Y., D’souza, D., Onilude, G., Bhandari, N., Singh, S., Ooi, H.-L., Kayid, A., & others. (2024). Aya model: An instruction finetuned open-access multilingual language model. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15894–15939.

Vanmassenhove, E., Shterionov, D., & Gwilliam, M. (2021). Machine Translationese: Effects of Algorithmic Bias on Linguistic Complexity in Machine Translation. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2203–2213.

Vatsal, S., Dubey, H., & Singh, A. (2025). Multilingual Prompt Engineering in Large Language Models: A Survey Across NLP Tasks. arXiv Preprint arXiv:2505.11665.

Wang, J., Lu, Y., Weber, M., Ryabinin, M., Adelani, D. I., Chen, Y., Tang, R., & Stenetorp, P. (2025). Multilingual language model pretraining using machine-translated data. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 28075–28095.

Weber, M., Fu, D., Anthony, Q., Oren, Y., Adams, S., Alexandrov, A., Lyu, X., Nguyen, H., Yao, X., Adams, V., & others. (2024). Redpajama: an open dataset for training large language models. Advances in Neural Information Processing Systems, 37, 116462–116492.

Yong, Z.-X., Ermis, B., Fadaee, M., Bach, S., & Kreutzer, J. (2025). The state of multilingual llm safety research: From measuring the language gap to mitigating it. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 15856–15871.

Yue, X., Song, Y., Asai, A., Kim, S., de Dieu Nyandwi, J., Khanuja, S., Kantharuban, A., Sutawika, L., Ramamoorthy, S., & Neubig, G. (2024). Pangea: A fully open multilingual multimodal llm for 39 languages. The Thirteenth International Conference on Learning Representations.

Multilinguality

Current Approaches

Future of Multilinguality for HCLLMs

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Current Approaches

Future of Multilinguality for HCLLMs

Graph View

Table of Contents

Backlinks