Learning from Human Preferences

In recent years, the idea of fine-tuning LLMs on human preferences has seen remarkable success in improving their behavior. Previously, it was believed that training on more samples and increasing model size was sufficient to increase performance. However, researchers found that these scaling rules ignored alignment, defined by Askell et al. (2021) as helpfulness, harmlessness, and honesty in model responses. To this end, it was discovered that incorporating human feedback directly into the training process achieved massive gains in human preference alignment (Askell et al., 2021; Bai et al., 2022; Leike et al., 2018). As such, in this section we discuss these developments chronologically, starting with reinforcement learning from human feedback (RLHF), its spinoffs, including direct preference optimization (DPO), and recent frameworks like Constitutional AI which aim for a future of fully self-supervised AI alignment.

RL-Based Methods

Much of RLHF is built upon landmark research by Christiano et al. (2017), Stiennon et al. (2020), and Ouyang et al. (2022). Together, these works demonstrated the feasibility of learning a reward function from human preferences and optimizing that function, first in the domain of simple robotics and Atari video games (Christiano et al., 2017), then in the ability to improve LLM performance on summarization tasks (Stiennon et al., 2020), and finally in improving LLM behavior on a wide breadth of tasks, including open generation, chatting, and question and answering (Ouyang et al., 2022).

Today, the canonical algorithm used to perform RLHF is proximal policy optimization (PPO) (Schulman et al., 2017), originally introduced as a simpler and more general improvement upon older RL methods like trust region policy optimization (TRPO) (Schulman et al., 2015). However, there now also exists considerable research into exploring alternatives to PPO. To address PPO’s high computational cost and sensitivity to hyperparamater tuning, Ahmadian et al. (2024) explore breaking PPO into its component pieces, and show that revisiting the formulation of human preferences in RL, discarding aspects that are unnecessarily complex for fine-tuning pre-trained LLMs, and returning to the most basic policy gradient algorithm, has yielded notable performance and efficiency gains. Other works propose alternatives to PPO entirely, such as bringing the process online for online iterative RLHF (Dong et al., 2024) or scoring sampled responses from different sources and aligning these with human preferences (RRHF) (Yuan et al., 2023).

Other works on extending RLHF focus specifically on the data that goes into aligning LLMs, whether that be improving accessibility by filling gaps in existing datasets or addressing issues of scale. For example, Okapi (Lai et al., 2023) is introduced as the first system and dataset to focus on RLHF for multiple languages, covering 26 different languages. For issues of scale, researchers from Google’s DeepMind propose reinforced self-training (ReST) (Gulcehre et al., 2023), which takes inspiration from growing batch RL to produce a dataset consisting of samples generated from the policy, which can then be used for offline training.

Beyond PPO and its proposed alternatives, there also exists considerable discussion on other portions of the RLHF pipeline, targeting better alignment through task formulation and dataset augmentation. Safe RLHF, proposed by Dai et al. (2023), explicitly decouples human preferences around helpfulness and harmlessness into two separate optimization objectives, and uses the Lagrangian method to balance trade-offs between the two.

Non-RL Methods

DPO has gained significant attention as an RLHF alternative because it enables preference tuning without an explicit reward model. It does so by directly including the probability ratio between preferred and dispreferred responses in its loss function (Rafailov et al., 2023). The original authors show that DPO-trained models generate responses that are preferred more frequently than those trained by PPO, and that DPO converges faster.

Another sample-efficient alternative, which not only avoids RL but also requires fewer than 10 samples, is Demonstration Iterated Task Optimization (DITTO) by (Shaikh et al., 2024). This method uses online imitation learning to create pairwise comparisons, treating user demonstrations as the gold standard and the model’s own outputs as dispreferred. DITTO’s improvement in model alignment was shown through an average improvement of 19% in win rates, compared to few-shot prompting and supervised fine-tuning on various human-centric tasks like news writing, emails, and blog posts.

Beyond Human Feedback

Despite the success of methods like RLHF and DPO, recent research has sought to address the potential drawbacks of only using human-sourced feedback for LLM alignment. One drawback is the incompleteness of human feedback, which may only represent a partial view of collective human values (Kirk et al., 2023). Furthermore, alignment is difficult to specify with explicit objectives (Bommasani et al., 2021; Tamkin et al., 2021).Additionally, scaling quality and representative human feedback will become increasingly difficult with larger and more powerful LLMs (Casper et al., 2023; Santurkar et al., 2023).

Constitutional AI.

To resolve these intricate issues, recent work has shifted from using RLHF to enlisting AI help along with human collaboration to supervise other AIs to train helpful and harmless AI systems (Bai, Kadavath, et al., 2022; Bowman et al., 2022; Saunders et al., 2022). As previously mentioned, one issue with alignment is that it is not clearly defined.

Anthropic’s work on Constitutional AI establishes a framework designed to answer this exact question (Bai, Kadavath, Kundu, Askell, Kaplan, et al., 2022). Rather than have humans provide explicit feedback, which may be inherently biased or incomplete, Constitutional AI only includes human feedback through the creation of a set of alignment principles (i.e. a “constitution”). Then, a model undergoes a training process similar to RLHF, but where the rewards are given by an LLM fine-tuned according to the values in the constitution (Bai, Jones, et al., 2022; Ouyang et al., 2022). This approach also uses chain of thought reasoning to maximize LLM self-reasoning capabilities throughout the entire process (Nye et al., 2021; Wei et al., 2022).

The end goal of Constitutional AI is not to get rid of human involvement or supervision entirely, but rather to have humans involved in only the most necessary aspects to move towards a self-supervised approach to alignment. Although Constitutional AI helped resolve many lingering issues with RLHF, this approach also brings up new questions in the ongoing research of alignment. First, how does the global AI research community come up with a widely accepted constitution that incorporates the pluralistic values of human beings (Hendrycks et al., 2021)? Second, how do we ensure a universal understanding and interpretation of the presumed constitution? How do we make sure there is a robust system for editing and improving the principles and rules as the society evolves? And when constitutional guidelines fail in ambiguous situations, how do we ensure that the models with minimal human supervision can still behave in a safe and useful way?

Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., & Hooker, S. (2024). Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs. In ArXiv preprint: Vol. abs/2402.14740. https://arxiv.org/abs/2402.14740

Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., … Kaplan, J. (2021). A General Language Assistant as a Laboratory for Alignment. In ArXiv preprint: Vol. abs/2112.00861. https://arxiv.org/abs/2112.00861

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., … Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. In ArXiv preprint: Vol. abs/2204.05862. https://arxiv.org/abs/2204.05862

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. ArXiv, abs/2212.08073. https://api.semanticscholar.org/CorpusID:254823489

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., & others. (2022). Constitutional ai: Harmlessness from ai feedback. ArXiv Preprint, abs/2212.08073. https://arxiv.org/abs/2212.08073

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2021). On the Opportunities and Risks of Foundation Models. In ArXiv preprint: Vol. abs/2108.07258. https://arxiv.org/abs/2108.07258

Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., Lukošiūtė, K., Askell, A., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Olah, C., Amodei, D., Amodei, D., Drain, D., Li, D., Tran-Johnson, E., … Kaplan, J. (2022). Measuring Progress on Scalable Oversight for Large Language Models. In ArXiv preprint: Vol. abs/2211.03540. https://arxiv.org/abs/2211.03540

Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., Wang, T., Marks, S., Segerie, C.-R., Carroll, M., Peng, A., Christoffersen, P., Damani, M., Slocum, S., Anwar, U., … Hadfield-Menell, D. (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. In ArXiv preprint: Vol. abs/2307.15217. https://arxiv.org/abs/2307.15217

Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (pp. 4299–4307). https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html

Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y., & Yang, Y. (2023). Safe RLHF: Safe Reinforcement Learning from Human Feedback. In ArXiv preprint: Vol. abs/2310.12773. https://arxiv.org/abs/2310.12773

Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., & Zhang, T. (2024). RLHF Workflow: From Reward Modeling to Online RLHF. In ArXiv preprint: Vol. abs/2405.07863. https://arxiv.org/abs/2405.07863

Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., Macherey, W., Doucet, A., Firat, O., & de Freitas, N. (2023). Reinforced Self-Training (ReST) for Language Modeling. In ArXiv preprint: Vol. abs/2308.08998. https://arxiv.org/abs/2308.08998

Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., & Steinhardt, J. (2021). Aligning AI With Shared Human Values. 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. https://openreview.net/forum?id=dNy%5C_RKzJacY

Kirk, H., Bean, A., Vidgen, B., Rottger, P., & Hale, S. (2023). The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 2409–2430). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.148

Lai, V., Nguyen, C., Ngo, N., Nguyen, T., Dernoncourt, F., Rossi, R., & Nguyen, T. (2023). Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback. In Y. Feng & E. Lefever (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 318–327). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-demo.28

Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., & Legg, S. (2018). Scalable agent alignment via reward modeling: a research direction. In ArXiv preprint: Vol. abs/1811.07871. https://arxiv.org/abs/1811.07871

Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., & Odena, A. (2021). Show Your Work: Scratchpads for Intermediate Computation with Language Models. In ArXiv preprint: Vol. abs/2112.00114. https://arxiv.org/abs/2112.00114

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper%5C_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., & Hashimoto, T. (2023). Whose Opinions Do Language Models Reflect? In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Vol. 202, pp. 29971–30004). PMLR. https://proceedings.mlr.press/v202/santurkar23a.html

Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., & Leike, J. (2022). Self-critiquing models for assisting human evaluators. In ArXiv preprint: Vol. abs/2206.05802. https://arxiv.org/abs/2206.05802

Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., & Moritz, P. (2015). Trust Region Policy Optimization. In F. R. Bach & D. M. Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 (Vol. 37, pp. 1889–1897). JMLR.org. http://proceedings.mlr.press/v37/schulman15.html

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. In ArXiv preprint: Vol. abs/1707.06347. https://arxiv.org/abs/1707.06347

Shaikh, O., Lam, M., Hejna, J., Shao, Y., Bernstein, M., & Yang, D. (2024). Show, Don’t Tell: Aligning Language Models with Demonstrated Feedback. ArXiv Preprint, abs/2406.00888. https://arxiv.org/abs/2406.00888

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. (2020). Learning to summarize from human feedback. In ArXiv preprint: Vol. abs/2009.01325. https://arxiv.org/abs/2009.01325

Tamkin, A., Brundage, M., Clark, J., & Ganguli, D. (2021). Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. In ArXiv preprint: Vol. abs/2102.02503. https://arxiv.org/abs/2102.02503

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. http://papers.nips.cc/paper%5C_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., & Huang, F. (2023). RRHF: Rank Responses to Align Language Models with Human Feedback without tears. In ArXiv preprint: Vol. abs/2304.05302. https://arxiv.org/abs/2304.05302

Learning from Human Preferences

RL-Based Methods

Non-RL Methods

Beyond Human Feedback

Constitutional AI.

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

RL-Based Methods

Non-RL Methods

Beyond Human Feedback

Constitutional AI.

Graph View

Table of Contents

Backlinks