In recent years, the idea of fine-tuning LLMs on human preferences has seen remarkable success in improving their behavior. Previously, it was believed that training on more samples and increasing model size was sufficient to increase performance. However, researchers found that these scaling rules ignored alignment, defined by Askell et al. (2021)ReferenceAskell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., … Kaplan, J. (2021). A General Language Assistant as a Laboratory for Alignment. In ArXiv preprint: Vol. abs/2112.00861. https://arxiv.org/abs/2112.00861 as helpfulness, harmlessness, and honesty in model responses. To this end, it was discovered that incorporating human feedback directly into the training process achieved massive gains in human preference alignment (Askell et al., 2021ReferenceAskell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., … Kaplan, J. (2021). A General Language Assistant as a Laboratory for Alignment. In ArXiv preprint: Vol. abs/2112.00861. https://arxiv.org/abs/2112.00861; Bai et al., 2022ReferenceBai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., … Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. In ArXiv preprint: Vol. abs/2204.05862. https://arxiv.org/abs/2204.05862; Leike et al., 2018ReferenceLeike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., & Legg, S. (2018). Scalable agent alignment via reward modeling: a research direction. In ArXiv preprint: Vol. abs/1811.07871. https://arxiv.org/abs/1811.07871). As such, in this section we discuss these developments chronologically, starting with reinforcement learning from human feedback (RLHF), its spinoffs, including direct preference optimization (DPO), and recent frameworks like Constitutional AI which aim for a future of fully self-supervised AI alignment.
RL-Based Methods
Much of RLHF is built upon landmark research by Christiano et al. (2017)ReferenceChristiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (pp. 4299–4307). https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html, Stiennon et al. (2020)ReferenceStiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. (2020). Learning to summarize from human feedback. In ArXiv preprint: Vol. abs/2009.01325. https://arxiv.org/abs/2009.01325, and Ouyang et al. (2022)ReferenceOuyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper%5C_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html. Together, these works demonstrated the feasibility of learning a reward function from human preferences and optimizing that function, first in the domain of simple robotics and Atari video games (Christiano et al., 2017ReferenceChristiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (pp. 4299–4307). https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html), then in the ability to improve LLM performance on summarization tasks (Stiennon et al., 2020ReferenceStiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. (2020). Learning to summarize from human feedback. In ArXiv preprint: Vol. abs/2009.01325. https://arxiv.org/abs/2009.01325), and finally in improving LLM behavior on a wide breadth of tasks, including open generation, chatting, and question and answering (Ouyang et al., 2022ReferenceOuyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper%5C_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html).
Today, the canonical algorithm used to perform RLHF is proximal policy optimization (PPO) (Schulman et al., 2017ReferenceSchulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. In ArXiv preprint: Vol. abs/1707.06347. https://arxiv.org/abs/1707.06347), originally introduced as a simpler and more general improvement upon older RL methods like trust region policy optimization (TRPO) (Schulman et al., 2015ReferenceSchulman, J., Levine, S., Abbeel, P., Jordan, M. I., & Moritz, P. (2015). Trust Region Policy Optimization. In F. R. Bach & D. M. Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 (Vol. 37, pp. 1889–1897). JMLR.org. http://proceedings.mlr.press/v37/schulman15.html). However, there now also exists considerable research into exploring alternatives to PPO. To address PPO’s high computational cost and sensitivity to hyperparamater tuning, Ahmadian et al. (2024)ReferenceAhmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., & Hooker, S. (2024). Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs. In ArXiv preprint: Vol. abs/2402.14740. https://arxiv.org/abs/2402.14740 explore breaking PPO into its component pieces, and show that revisiting the formulation of human preferences in RL, discarding aspects that are unnecessarily complex for fine-tuning pre-trained LLMs, and returning to the most basic policy gradient algorithm, has yielded notable performance and efficiency gains. Other works propose alternatives to PPO entirely, such as bringing the process online for online iterative RLHF (Dong et al., 2024ReferenceDong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., & Zhang, T. (2024). RLHF Workflow: From Reward Modeling to Online RLHF. In ArXiv preprint: Vol. abs/2405.07863. https://arxiv.org/abs/2405.07863) or scoring sampled responses from different sources and aligning these with human preferences (RRHF) (Yuan et al., 2023ReferenceYuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., & Huang, F. (2023). RRHF: Rank Responses to Align Language Models with Human Feedback without tears. In ArXiv preprint: Vol. abs/2304.05302. https://arxiv.org/abs/2304.05302).
Other works on extending RLHF focus specifically on the data that goes into aligning LLMs, whether that be improving accessibility by filling gaps in existing datasets or addressing issues of scale. For example, Okapi (Lai et al., 2023ReferenceLai, V., Nguyen, C., Ngo, N., Nguyen, T., Dernoncourt, F., Rossi, R., & Nguyen, T. (2023). Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback. In Y. Feng & E. Lefever (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 318–327). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-demo.28) is introduced as the first system and dataset to focus on RLHF for multiple languages, covering 26 different languages. For issues of scale, researchers from Google’s DeepMind propose reinforced self-training (ReST) (Gulcehre et al., 2023ReferenceGulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., Macherey, W., Doucet, A., Firat, O., & de Freitas, N. (2023). Reinforced Self-Training (ReST) for Language Modeling. In ArXiv preprint: Vol. abs/2308.08998. https://arxiv.org/abs/2308.08998), which takes inspiration from growing batch RL to produce a dataset consisting of samples generated from the policy, which can then be used for offline training.
Beyond PPO and its proposed alternatives, there also exists considerable discussion on other portions of the RLHF pipeline, targeting better alignment through task formulation and dataset augmentation. Safe RLHF, proposed by Dai et al. (2023)ReferenceDai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y., & Yang, Y. (2023). Safe RLHF: Safe Reinforcement Learning from Human Feedback. In ArXiv preprint: Vol. abs/2310.12773. https://arxiv.org/abs/2310.12773, explicitly decouples human preferences around helpfulness and harmlessness into two separate optimization objectives, and uses the Lagrangian method to balance trade-offs between the two.
Non-RL Methods
DPO has gained significant attention as an RLHF alternative because it enables preference tuning without an explicit reward model. It does so by directly including the probability ratio between preferred and dispreferred responses in its loss function (Rafailov et al., 2023ReferenceRafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html). The original authors show that DPO-trained models generate responses that are preferred more frequently than those trained by PPO, and that DPO converges faster.
Another sample-efficient alternative, which not only avoids RL but also requires fewer than 10 samples, is Demonstration Iterated Task Optimization (DITTO) by (Shaikh et al., 2024ReferenceShaikh, O., Lam, M., Hejna, J., Shao, Y., Bernstein, M., & Yang, D. (2024). Show, Don’t Tell: Aligning Language Models with Demonstrated Feedback. ArXiv Preprint, abs/2406.00888. https://arxiv.org/abs/2406.00888). This method uses online imitation learning to create pairwise comparisons, treating user demonstrations as the gold standard and the model’s own outputs as dispreferred. DITTO’s improvement in model alignment was shown through an average improvement of 19% in win rates, compared to few-shot prompting and supervised fine-tuning on various human-centric tasks like news writing, emails, and blog posts.
Beyond Human Feedback
Despite the success of methods like RLHF and DPO, recent research has sought to address the potential drawbacks of only using human-sourced feedback for LLM alignment. One drawback is the incompleteness of human feedback, which may only represent a partial view of collective human values (Kirk et al., 2023ReferenceKirk, H., Bean, A., Vidgen, B., Rottger, P., & Hale, S. (2023). The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 2409–2430). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.148). Furthermore, alignment is difficult to specify with explicit objectives (Bommasani et al., 2021ReferenceBommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2021). On the Opportunities and Risks of Foundation Models. In ArXiv preprint: Vol. abs/2108.07258. https://arxiv.org/abs/2108.07258; Tamkin et al., 2021ReferenceTamkin, A., Brundage, M., Clark, J., & Ganguli, D. (2021). Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. In ArXiv preprint: Vol. abs/2102.02503. https://arxiv.org/abs/2102.02503).Additionally, scaling quality and representative human feedback will become increasingly difficult with larger and more powerful LLMs (Casper et al., 2023ReferenceCasper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., Wang, T., Marks, S., Segerie, C.-R., Carroll, M., Peng, A., Christoffersen, P., Damani, M., Slocum, S., Anwar, U., … Hadfield-Menell, D. (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. In ArXiv preprint: Vol. abs/2307.15217. https://arxiv.org/abs/2307.15217; Santurkar et al., 2023ReferenceSanturkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., & Hashimoto, T. (2023). Whose Opinions Do Language Models Reflect? In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Vol. 202, pp. 29971–30004). PMLR. https://proceedings.mlr.press/v202/santurkar23a.html).
Constitutional AI.
To resolve these intricate issues, recent work has shifted from using RLHF to enlisting AI help along with human collaboration to supervise other AIs to train helpful and harmless AI systems (Bai, Kadavath, et al., 2022; Bowman et al., 2022; Saunders et al., 2022)ReferenceBai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., & others. (2022). Constitutional ai: Harmlessness from ai feedback. ArXiv Preprint, abs/2212.08073. https://arxiv.org/abs/2212.08073ReferenceBowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., Lukošiūtė, K., Askell, A., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Olah, C., Amodei, D., Amodei, D., Drain, D., Li, D., Tran-Johnson, E., … Kaplan, J. (2022). Measuring Progress on Scalable Oversight for Large Language Models. In ArXiv preprint: Vol. abs/2211.03540. https://arxiv.org/abs/2211.03540ReferenceSaunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., & Leike, J. (2022). Self-critiquing models for assisting human evaluators. In ArXiv preprint: Vol. abs/2206.05802. https://arxiv.org/abs/2206.05802. As previously mentioned, one issue with alignment is that it is not clearly defined.
Anthropic’s work on Constitutional AI establishes a framework designed to answer this exact question (Bai, Kadavath, Kundu, Askell, Kaplan, et al., 2022ReferenceBai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. ArXiv, abs/2212.08073. https://api.semanticscholar.org/CorpusID:254823489). Rather than have humans provide explicit feedback, which may be inherently biased or incomplete, Constitutional AI only includes human feedback through the creation of a set of alignment principles (i.e. a “constitution”). Then, a model undergoes a training process similar to RLHF, but where the rewards are given by an LLM fine-tuned according to the values in the constitution (Bai, Jones, et al., 2022; Ouyang et al., 2022)ReferenceBai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., … Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. In ArXiv preprint: Vol. abs/2204.05862. https://arxiv.org/abs/2204.05862ReferenceOuyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper%5C_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html. This approach also uses chain of thought reasoning to maximize LLM self-reasoning capabilities throughout the entire process (Nye et al., 2021ReferenceNye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., & Odena, A. (2021). Show Your Work: Scratchpads for Intermediate Computation with Language Models. In ArXiv preprint: Vol. abs/2112.00114. https://arxiv.org/abs/2112.00114; Wei et al., 2022ReferenceWei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper%5C_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html).
The end goal of Constitutional AI is not to get rid of human involvement or supervision entirely, but rather to have humans involved in only the most necessary aspects to move towards a self-supervised approach to alignment. Although Constitutional AI helped resolve many lingering issues with RLHF, this approach also brings up new questions in the ongoing research of alignment. First, how does the global AI research community come up with a widely accepted constitution that incorporates the pluralistic values of human beings ? Second, how do we ensure a universal understanding and interpretation of the presumed constitution? How do we make sure there is a robust system for editing and improving the principles and rules as the society evolves? And when constitutional guidelines fail in ambiguous situations, how do we ensure that the models with minimal human supervision can still behave in a safe and useful way?