Scaling Human Centered LLMs

In NLP, “scaling” refers to the relationship between a model’s performance and factors such as the number of parameters $n$ , dataset size $d$ , and computational resources $c$ (Kaplan et al., 2020). Understanding these scaling laws is important for developing HCLLMs that balance efficiency and performance with accessibility and fairness.

Scaling Laws in LLMs

Kaplan et al. (2020) conducted foundational research on empirical scaling laws for language model performance, particularly focusing on cross-entropy loss. Their work established that model performance improves predictably with increases in $n$ , $d$ , and $c$ , following a power-law relationship. Importantly, they discovered that returns diminish when either $n$ or $d$ is held constant, underscoring the need for a strategic, balanced approach to scaling in NLP.

Tay et al. (2023) expand on @kaplan2020scaling’s work to investigate the effect of scaling properties of different inductive biases and model architectures. They find via extensive experiments that (1) architecture is an important consideration, (2) the best performing model can fluctuate at different scales, and (3) the choice of whether to scale depth (number of layers) or width (more neurons per layer) is important, especially in resource-constrained environments. Models often excel at pretraining but underperform on downstream tasks, underscoring the need to evaluate models based on human utility rather than just raw performance metrics. @hoffmann2022training introduced “Chinchilla scaling,” which showed that smaller models trained on larger datasets achieve better performance per compute budget. This finding is applicable in resource-constrained human-centered applications where compute and data availability may be limited due to ethical or logistical constraints.

Ivgi et al. (2022) investigate the applicability of scaling laws for different NLP tasks and find that benefits vary. Tasks aligned with pretraining objectives, such as question answering, show clearer scaling behavior compared to specialized tasks like sentiment analysis. Thus, for human-centered applications, practitioners should assess whether scaling laws are applicable to their specific use case or if full-scale testing is necessary. While scaling can improve performance for some tasks, it may not always be the most efficient path - a point further developed by Liang et al. (2022), who show that similar or greater improvements to model accuracy can be achieved through more efficient human-centric means, such as training with human feedback.

Scaling in Human-Centered Domains

Although scaling improves the overall performance of LLMs, it does not proportionally improve performance at the same rate for all subpopulations and human-centered knowledge domains. Representational biases in training data (Data Provenance and Data Representation, Bias and Ethics) can lead to disparities in scaling (Rolf et al., 2021). However, data is not the only cause of relative disparities in scaling, and may not even be the principal cause. Held et al. (2025) found that, when holding the data scale constant, model-size scaling is responsible for widening the performance gap between LLM performance on certain varieties of English relative to other varieties. In addition to dialect, the authors investigated AI risk behaviors (Perez et al., 2023) and found that scaling mitigates some risks more than others.

Scaling model size alone cannot address human-domain-specific challenges such as cultural biases, as Bommasani et al. (2021) highlight. The effectiveness of scaling laws varies across different domains, with some human-centered areas experiencing diminishing returns when either the number of parameters ( $n$ ) or dataset size ( $d$ ) is limited. For instance, Brown et al. (2020) observed that large models like GPT-3 show reduced performance gains on human-centered tasks unless they are fine-tuned with context-specific data, emphasizing the need for tailored approaches in these applications.

Gururangan et al. (2020) conducted a study to investigate whether it is still helpful to tailor a pretrained model to the domain of a target task in a world where large-scale models, which form the foundation of today’s NLP landscape, have found much success in a broad-coverage approach trained on a wide variety of sources. Overall they found that multi-phase adaptive pretraining offers large gains in task performance, implying that the quality and treatment of the data $d$ is more important than the quantity, which is relevant when dealing with sensitive or specialized human domains.

Bender et al. (2021) contribute by arguing that increasing $d$ without considering diverse and ethical data sources can lead to biased or non-representative outcomes, negatively impacting human-centered goals such as equitable access and cultural inclusivity. Furthering the discussion on fine-tuning for specific human-centric domains, Zhang et al. (2024) examine how different scaling factors influence the fine-tuning performance of LLMs. The study aligns with Kaplan et al. (2020) and Hoffmann et al. (2022), exhibiting a multiplicative joint scaling law that links fine-tuning performance to model size, fine-tuning data size, and other scaling factors. It indicates that fine-tuning performance improves predictably when scaling both dataset size $d$ and model size $n$ and finds that LLM fine-tuning benefits more from LLM model scaling than pretraining data scaling. The work also highlights that the effectiveness of fine-tuning varies significantly based on the downstream task and the size and quality of the available fine-tuning data, underscoring the need for task-specific approaches in human-centered applications.

Scaling in Human-Centered Goals

Looking at specific human-centered evaluations, such as bias and fairness, there are potentially unexpected results when models are scaled. Ethayarajh & Jurafsky (2020) show that leaderboard-driven scaling can create misaligned incentives in model development. By analyzing examples like the SNLI leaderboard, they demonstrate that focusing solely on state-of-the-art performance discourages practical models. For instance, with SNLI baselines at 78% (n-gram) and 81% (LSTM) versus a 92% BERT-based SOTA, there is little motivation to develop lightweight models with 85% accuracy, which could balance performance and computational efficiency and be more practically useful.

Ganguli et al. (2022) explored model safety through red teaming, where testers try to provoke harmful outputs. Testing models from 2.7B to 52B parameters, they found that RLHF-trained models became harder to ‘break’ as they scaled, reducing the success of harmful attacks (the harmlessness score increased from approximately -0.5 with 2.7B parameters to 0.5 with 52B parameters). In contrast, other models showed no improvement in resisting such outputs with increased size. This study stresses the importance of incorporating human feedback during training to develop safer AI systems. Larger models exhibit heightened privacy risks. Hernandez et al. (2022) demonstrate that an 800M parameter model could be degraded to that of a 400M model by repeating just 0.1% of the training data 100 times, suggesting that larger models aren’t automatically more robust to certain types of data-based attacks (see Consent and Ownership).

In regards to emulating human values, Biedma et al. (2024) showed that as language models get larger, they show an increased preference for the task-oriented values like accuracy and factual consistency at the slight expense of social intelligence or moral fiber and adherence to ethical norms. While larger models may become more capable, their value systems have the potential to become increasingly misaligned with human values.

Transfer learning is often a prerequisite for the application of LLMs in human-centered tasks. Scaling laws suggest that a model’s ability to transfer knowledge improves as its performance increases. This relationship generally holds in human-centered evaluations, with studies showing that transfer learning benefits from scaling in broad tasks (Hernandez et al., 2021; Raffel et al., 2020). However, Hernandez et al. (2021) and Raffel et al. (2020) highlight the importance of domain-specific fine-tuning and alignment techniques to achieve human-centered objectives in specialized or sensitive areas such as legal and medical contexts.

Inference time scaling

Recent advances in inference-time scaling offer pathways to improve HCLLMs without retraining. Now more targeted approaches to inference-time adaptation are emerging that specifically address human-centered concerns. (J. Zhang et al., 2024) introduce Controllable Safety Alignment (CoSA), a framework that enables inference-time adaptation to diverse safety requirements without model retraining. Rather than following a one-size-fits-all approach to safety alignment where models refuse any potentially unsafe content, CoSA allows authorized users to modify safety configurations at inference time through natural language descriptions of desired safety behaviors.

Despite these promising directions, there remain open questions about whether increased inference compute might undermine certain human-centered objectives. OpenAI’s emphasis on chain-of-thought (CoT) in their o-series of models, for example, underscores the prevailing focus on inference-time reasoning strategies (OpenAI, 2024). However, Shaikh et al. (2023) find that explicitly prompting models to “think step by step” can inadvertently increase harmful biases and toxic outputs, observing an 8.8% rise in biased responses and a 19.4% increase in toxicity across relevant benchmarks. Their study suggests that while inference-time reasoning holds promise for improved performance or controllable safety alignment, it may also expose underlying biases.

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3442188.3445922

Biedma, P., Yi, X., Huang, L., Sun, M., & Xie, X. (2024). Beyond Human Norms: Unveiling Unique Values of Large Language Models through Interdisciplinary Approaches. ArXiv Preprint, abs/2404.12744. https://arxiv.org/abs/2404.12744

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2021). On the Opportunities and Risks of Foundation Models. In ArXiv preprint: Vol. abs/2108.07258. https://arxiv.org/abs/2108.07258

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.-F. Balcan, & H.-T. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

Ethayarajh, K., & Jurafsky, D. (2020). Utility is in the Eye of the User: A Critique of NLP Leaderboards. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4846–4853). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.393

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., & others. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv Preprint arXiv:2209.07858.

Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8342–8360). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.740

Held, W., Hall, D., Liang, P., & Yang, D. (2025). Relative scaling laws for llms. arXiv Preprint arXiv:2510.24626.

Hernandez, D., Brown, T., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Henighan, T., Hume, T., & others. (2022). Scaling laws and interpretability of learning from repeated data. arXiv Preprint arXiv:2205.10487.

Hernandez, D., Kaplan, J., Henighan, T., & McCandlish, S. (2021). Scaling laws for transfer. ArXiv Preprint, abs/2102.01293. https://arxiv.org/abs/2102.01293

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., & others. (2022). Training compute-optimal large language models. ArXiv Preprint, abs/2203.15556. https://arxiv.org/abs/2203.15556

Ivgi, M., Carmon, Y., & Berant, J. (2022). Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments. In Y. Goldberg, Z. Kozareva, & Y. Zhang (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 7354–7371). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-emnlp.544

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. ArXiv Preprint, abs/2001.08361. https://arxiv.org/abs/2001.08361

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., & others. (2022). Holistic evaluation of language models. ArXiv Preprint, abs/2211.09110. https://arxiv.org/abs/2211.09110

OpenAI. (2024). Learning To Reason With LLMs. https://openai.com/index/learning-to-reason-with-llms/

Perez, E., Ringer, S., Lukosiute, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., & others. (2023). Discovering language model behaviors with model-written evaluations. Findings of the Association for Computational Linguistics: ACL 2023, 13387–13434.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res., 21, 140:1-140:67. http://jmlr.org/papers/v21/20-074.html

Rolf, E., Worledge, T. T., Recht, B., & Jordan, M. (2021). Representation matters: Assessing the importance of subgroup allocations in training data. International Conference on Machine Learning, 9040–9051.

Shaikh, O., Zhang, H., Held, W., Bernstein, M., & Yang, D. (2023). On Second Thought, Let`s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 4454–4470). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.244

Tay, Y., Dehghani, M., Abnar, S., Chung, H., Fedus, W., Rao, J., Narang, S., Tran, V., Yogatama, D., & Metzler, D. (2023). Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 12342–12364). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.825

Zhang, B., Liu, Z., Cherry, C., & Firat, O. (2024). When scaling meets llm finetuning: The effect of data, model and finetuning method. ArXiv Preprint, abs/2402.17193. https://arxiv.org/abs/2402.17193

Zhang, J., Elgohary, A., Magooda, A., Khashabi, D., & Durme, B. V. (2024). Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements. https://arxiv.org/abs/2410.08968

Scaling Human Centered LLMs

Scaling Laws in LLMs

Scaling in Human-Centered Domains

Scaling in Human-Centered Goals

Inference time scaling

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Scaling Laws in LLMs

Scaling in Human-Centered Domains

Scaling in Human-Centered Goals

Inference time scaling

Graph View

Table of Contents

Backlinks