In NLP, “scaling” refers to the relationship between a model’s performance and factors such as the number of parameters , dataset size , and computational resources (Kaplan et al., 2020ReferenceKaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. ArXiv Preprint, abs/2001.08361. https://arxiv.org/abs/2001.08361). Understanding these scaling laws is important for developing HCLLMs that balance efficiency and performance with accessibility and fairness.
Scaling Laws in LLMs
Kaplan et al. (2020)ReferenceKaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. ArXiv Preprint, abs/2001.08361. https://arxiv.org/abs/2001.08361 conducted foundational research on empirical scaling laws for language model performance, particularly focusing on cross-entropy loss. Their work established that model performance improves predictably with increases in , , and , following a power-law relationship. Importantly, they discovered that returns diminish when either or is held constant, underscoring the need for a strategic, balanced approach to scaling in NLP.
Tay et al. (2023)ReferenceTay, Y., Dehghani, M., Abnar, S., Chung, H., Fedus, W., Rao, J., Narang, S., Tran, V., Yogatama, D., & Metzler, D. (2023). Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 12342–12364). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.825 expand on @kaplan2020scaling’s work to investigate the effect of scaling properties of different inductive biases and model architectures. They find via extensive experiments that (1) architecture is an important consideration, (2) the best performing model can fluctuate at different scales, and (3) the choice of whether to scale depth (number of layers) or width (more neurons per layer) is important, especially in resource-constrained environments. Models often excel at pretraining but underperform on downstream tasks, underscoring the need to evaluate models based on human utility rather than just raw performance metrics. @hoffmann2022training introduced “Chinchilla scaling,” which showed that smaller models trained on larger datasets achieve better performance per compute budget. This finding is applicable in resource-constrained human-centered applications where compute and data availability may be limited due to ethical or logistical constraints.
Ivgi et al. (2022)ReferenceIvgi, M., Carmon, Y., & Berant, J. (2022). Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments. In Y. Goldberg, Z. Kozareva, & Y. Zhang (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 7354–7371). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-emnlp.544 investigate the applicability of scaling laws for different NLP tasks and find that benefits vary. Tasks aligned with pretraining objectives, such as question answering, show clearer scaling behavior compared to specialized tasks like sentiment analysis. Thus, for human-centered applications, practitioners should assess whether scaling laws are applicable to their specific use case or if full-scale testing is necessary. While scaling can improve performance for some tasks, it may not always be the most efficient path - a point further developed by Liang et al. (2022)ReferenceLiang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., & others. (2022). Holistic evaluation of language models. ArXiv Preprint, abs/2211.09110. https://arxiv.org/abs/2211.09110, who show that similar or greater improvements to model accuracy can be achieved through more efficient human-centric means, such as training with human feedback.
Scaling in Human-Centered Domains
Although scaling improves the overall performance of LLMs, it does not proportionally improve performance at the same rate for all subpopulations and human-centered knowledge domains. Representational biases in training data (Data Provenance and Data Representation, Bias and Ethics) can lead to disparities in scaling (Rolf et al., 2021ReferenceRolf, E., Worledge, T. T., Recht, B., & Jordan, M. (2021). Representation matters: Assessing the importance of subgroup allocations in training data. International Conference on Machine Learning, 9040–9051.). However, data is not the only cause of relative disparities in scaling, and may not even be the principal cause. Held et al. (2025)ReferenceHeld, W., Hall, D., Liang, P., & Yang, D. (2025). Relative scaling laws for llms. arXiv Preprint arXiv:2510.24626. found that, when holding the data scale constant, model-size scaling is responsible for widening the performance gap between LLM performance on certain varieties of English relative to other varieties. In addition to dialect, the authors investigated AI risk behaviors (Perez et al., 2023ReferencePerez, E., Ringer, S., Lukosiute, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., & others. (2023). Discovering language model behaviors with model-written evaluations. Findings of the Association for Computational Linguistics: ACL 2023, 13387–13434.) and found that scaling mitigates some risks more than others.
Scaling model size alone cannot address human-domain-specific challenges such as cultural biases, as Bommasani et al. (2021)ReferenceBommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2021). On the Opportunities and Risks of Foundation Models. In ArXiv preprint: Vol. abs/2108.07258. https://arxiv.org/abs/2108.07258 highlight. The effectiveness of scaling laws varies across different domains, with some human-centered areas experiencing diminishing returns when either the number of parameters () or dataset size () is limited. For instance, Brown et al. (2020)ReferenceBrown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.-F. Balcan, & H.-T. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html observed that large models like GPT-3 show reduced performance gains on human-centered tasks unless they are fine-tuned with context-specific data, emphasizing the need for tailored approaches in these applications.
Gururangan et al. (2020)ReferenceGururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8342–8360). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.740 conducted a study to investigate whether it is still helpful to tailor a pretrained model to the domain of a target task in a world where large-scale models, which form the foundation of today’s NLP landscape, have found much success in a broad-coverage approach trained on a wide variety of sources. Overall they found that multi-phase adaptive pretraining offers large gains in task performance, implying that the quality and treatment of the data is more important than the quantity, which is relevant when dealing with sensitive or specialized human domains.
Bender et al. (2021)ReferenceBender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3442188.3445922 contribute by arguing that increasing without considering diverse and ethical data sources can lead to biased or non-representative outcomes, negatively impacting human-centered goals such as equitable access and cultural inclusivity. Furthering the discussion on fine-tuning for specific human-centric domains, Zhang et al. (2024)ReferenceZhang, B., Liu, Z., Cherry, C., & Firat, O. (2024). When scaling meets llm finetuning: The effect of data, model and finetuning method. ArXiv Preprint, abs/2402.17193. https://arxiv.org/abs/2402.17193 examine how different scaling factors influence the fine-tuning performance of LLMs. The study aligns with Kaplan et al. (2020)ReferenceKaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. ArXiv Preprint, abs/2001.08361. https://arxiv.org/abs/2001.08361 and Hoffmann et al. (2022)ReferenceHoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., & others. (2022). Training compute-optimal large language models. ArXiv Preprint, abs/2203.15556. https://arxiv.org/abs/2203.15556, exhibiting a multiplicative joint scaling law that links fine-tuning performance to model size, fine-tuning data size, and other scaling factors. It indicates that fine-tuning performance improves predictably when scaling both dataset size and model size and finds that LLM fine-tuning benefits more from LLM model scaling than pretraining data scaling. The work also highlights that the effectiveness of fine-tuning varies significantly based on the downstream task and the size and quality of the available fine-tuning data, underscoring the need for task-specific approaches in human-centered applications.
Scaling in Human-Centered Goals
Looking at specific human-centered evaluations, such as bias and fairness, there are potentially unexpected results when models are scaled. Ethayarajh & Jurafsky (2020)ReferenceEthayarajh, K., & Jurafsky, D. (2020). Utility is in the Eye of the User: A Critique of NLP Leaderboards. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4846–4853). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.393 show that leaderboard-driven scaling can create misaligned incentives in model development. By analyzing examples like the SNLI leaderboard, they demonstrate that focusing solely on state-of-the-art performance discourages practical models. For instance, with SNLI baselines at 78% (n-gram) and 81% (LSTM) versus a 92% BERT-based SOTA, there is little motivation to develop lightweight models with 85% accuracy, which could balance performance and computational efficiency and be more practically useful.
Ganguli et al. (2022)ReferenceGanguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., & others. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv Preprint arXiv:2209.07858. explored model safety through red teaming, where testers try to provoke harmful outputs. Testing models from 2.7B to 52B parameters, they found that RLHF-trained models became harder to ‘break’ as they scaled, reducing the success of harmful attacks (the harmlessness score increased from approximately -0.5 with 2.7B parameters to 0.5 with 52B parameters). In contrast, other models showed no improvement in resisting such outputs with increased size. This study stresses the importance of incorporating human feedback during training to develop safer AI systems. Larger models exhibit heightened privacy risks. Hernandez et al. (2022)ReferenceHernandez, D., Brown, T., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Henighan, T., Hume, T., & others. (2022). Scaling laws and interpretability of learning from repeated data. arXiv Preprint arXiv:2205.10487. demonstrate that an 800M parameter model could be degraded to that of a 400M model by repeating just 0.1% of the training data 100 times, suggesting that larger models aren’t automatically more robust to certain types of data-based attacks (see Consent and Ownership).
In regards to emulating human values, Biedma et al. (2024)ReferenceBiedma, P., Yi, X., Huang, L., Sun, M., & Xie, X. (2024). Beyond Human Norms: Unveiling Unique Values of Large Language Models through Interdisciplinary Approaches. ArXiv Preprint, abs/2404.12744. https://arxiv.org/abs/2404.12744 showed that as language models get larger, they show an increased preference for the task-oriented values like accuracy and factual consistency at the slight expense of social intelligence or moral fiber and adherence to ethical norms. While larger models may become more capable, their value systems have the potential to become increasingly misaligned with human values.
Transfer learning is often a prerequisite for the application of LLMs in human-centered tasks. Scaling laws suggest that a model’s ability to transfer knowledge improves as its performance increases. This relationship generally holds in human-centered evaluations, with studies showing that transfer learning benefits from scaling in broad tasks (Hernandez et al., 2021ReferenceHernandez, D., Kaplan, J., Henighan, T., & McCandlish, S. (2021). Scaling laws for transfer. ArXiv Preprint, abs/2102.01293. https://arxiv.org/abs/2102.01293; Raffel et al., 2020ReferenceRaffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res., 21, 140:1-140:67. http://jmlr.org/papers/v21/20-074.html). However, Hernandez et al. (2021)ReferenceHernandez, D., Kaplan, J., Henighan, T., & McCandlish, S. (2021). Scaling laws for transfer. ArXiv Preprint, abs/2102.01293. https://arxiv.org/abs/2102.01293 and Raffel et al. (2020)ReferenceRaffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res., 21, 140:1-140:67. http://jmlr.org/papers/v21/20-074.html highlight the importance of domain-specific fine-tuning and alignment techniques to achieve human-centered objectives in specialized or sensitive areas such as legal and medical contexts.
Inference time scaling
Recent advances in inference-time scaling offer pathways to improve HCLLMs without retraining. Now more targeted approaches to inference-time adaptation are emerging that specifically address human-centered concerns. (J. Zhang et al., 2024ReferenceZhang, J., Elgohary, A., Magooda, A., Khashabi, D., & Durme, B. V. (2024). Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements. https://arxiv.org/abs/2410.08968) introduce Controllable Safety Alignment (CoSA), a framework that enables inference-time adaptation to diverse safety requirements without model retraining. Rather than following a one-size-fits-all approach to safety alignment where models refuse any potentially unsafe content, CoSA allows authorized users to modify safety configurations at inference time through natural language descriptions of desired safety behaviors.
Despite these promising directions, there remain open questions about whether increased inference compute might undermine certain human-centered objectives. OpenAI’s emphasis on chain-of-thought (CoT) in their o-series of models, for example, underscores the prevailing focus on inference-time reasoning strategies (OpenAI, 2024ReferenceOpenAI. (2024). Learning To Reason With LLMs. https://openai.com/index/learning-to-reason-with-llms/). However, Shaikh et al. (2023)ReferenceShaikh, O., Zhang, H., Held, W., Bernstein, M., & Yang, D. (2023). On Second Thought, Let`s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 4454–4470). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.244 find that explicitly prompting models to “think step by step” can inadvertently increase harmful biases and toxic outputs, observing an 8.8% rise in biased responses and a 19.4% increase in toxicity across relevant benchmarks. Their study suggests that while inference-time reasoning holds promise for improved performance or controllable safety alignment, it may also expose underlying biases.