diff --git a/chapters/01-introduction.md b/chapters/01-introduction.md index 3eccde3..eca5a8a 100644 --- a/chapters/01-introduction.md +++ b/chapters/01-introduction.md @@ -12,7 +12,7 @@ Finally, the language model can be optimized with a RL optimizer of choice, by s This book details key decisions and basic implementation examples for each step in this process. RLHF has been applied to many domains successfully, with complexity increasing as the techniques have matured. -Early breakthrough experiments with RLHF were applied to deep reinforcement learning [@christiano2017deep], summarization [@stiennon2020learning], follow instructions [@ouyang2022training], parse web information for question answering [@nakano2021webgpt], and ``alignment'' [@bai2022training]. +Early breakthrough experiments with RLHF were applied to deep reinforcement learning [@christiano2017deep], summarization [@stiennon2020learning], follow instructions [@ouyang2022training], parse web information for question answering [@nakano2021webgpt], and "alignment" [@bai2022training]. ## Scope of This Book @@ -79,7 +79,7 @@ He has written extensively on RLHF, including [many blog posts](https://www.inte With the investment in language modeling, many variations on the traditional RLHF methods emerged. RLHF colloquially has become synonymous with multiple overlapping approaches. RLHF is a subset of preference fine-tuning (PreFT) techniques, including Direct Alignment Algorithms (See Chapter 12). -RLHF is the tool most associated with rapid progress in ``post-training'' of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. +RLHF is the tool most associated with rapid progress in "post-training" of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. This textbook is a broad overview of RLHF and its directly neighboring methods, such as instruction tuning and other implementation details needed to set up a model for RLHF training. As more successes of fine-tuning language models with RL emerge, such as OpenAI's o1 reasoning models, RLHF will be seen as the bridge that enabled further investment of RL methods for fine-tuning large base models. diff --git a/chapters/02-preferences.md b/chapters/02-preferences.md index a923ef4..be71b75 100644 --- a/chapters/02-preferences.md +++ b/chapters/02-preferences.md @@ -1,6 +1,35 @@ # [Incomplete] Human Preferences for RLHF -## Questioning the Ability of Preferences +The core of reinforcement learning from human feedback, also referred to as reinforcement learning from human preferences in early literature, is designed to optimize machine learning models in domains where specifically designing a reward function is hard. +The motivation for using humans as the reward signals is to obtain a indirect metric for the target reward. -TODO [@lambert2023entangled]. \ No newline at end of file +The use of human labeled feedback data integrates the history of many fields. +Using human data alone is a well studied problem, but in the context of RLHF it is used at the intersection of multiple long-standing fields of study [@lambert2023entangled]. + +As an approximation, modern RLHF is the convergence of three areas of development: + +1. Philosophy, psychology, economics, decision theory, and the nature of human preferences; +2. Optimal control, reinforcement learning, and maximizing utility; and +3. Modern deep learning systems. + +Together, each of these areas brings specific assumptions at what a preference is and how it can be optimized, which dictates the motivations and design of RLHF problems. + +## The Origins of Reward Models: Costs vs. Rewards vs. Preferences + +### Specifying objectives: from logic of utility to reward functions + +### Implementing optimal utility + +### Steering preferences + +### Value alignment's role in RLHF + +## From Design to Implementation + +Many of the principles discussed earlier in this chapter are further specified in the process of implementing the modern RLHF stack, adjusting the meaning of RLHF. + +## Limitations of RLHF + +The specifics of obtaining data for RLHF is discussed further in Chapter 6. +For an extended version of this chapter, see [@lambert2023entangled]. \ No newline at end of file diff --git a/chapters/04-related-works.md b/chapters/04-related-works.md index 2890232..6fc6010 100644 --- a/chapters/04-related-works.md +++ b/chapters/04-related-works.md @@ -1,18 +1,44 @@ -# [Incomplete] Key Related Works +# Key Related Works In this chapter we detail the key papers and projects that got the RLHF field to where it is today. This is not intended to be a comprehensive review on RLHF and the related fields, but rather a starting point and retelling of how we got to today. +It is intentionally focused on recent work that lead to ChatGPT. +There is substantial further work in the RL literature on learning from preferences [@wirth2017survey]. +For a more exhaustive list, you should use a proper survey paper [@kaufmann2023survey],[@casper2023open]. -## Early RL on Preferences +## Origins to 2018: RL on Preferences -Christriano et al etc +The field was recently popularized with the growth of Deep Reinforcement Learning and has grown into a broader study of the applications of LLMs from many large technology companies. +Still, many of the techniques used today are deeply related to core techniques from early literature on RL from preferences. -## RLHP on Language Models +*TAMER: Training an Agent Manually via Evaluative Reinforcement,* Proposed a learned agent where humans provided scores on the actions taken iteratively to learn a reward model [@knox2008tamer]. Other concurrent or soon after work proposed an actor-critic algorithm, COACH, where human feedback (both positive and negative) is used to tune the advantage function [@macglashan2017interactive]. -Learning to summarize, first work on language models (zieglar et al) +The primary reference, Christiano et al. 2017, is application of RLHF applied on preferences between Atari trajectories [@christiano2017deep]. The work shows that humans choosing between trajectories can be more effective in some domains than directly interacting with the environment. This uses some clever conditions, but is impressive nonetheless. +TAMER was adapted to deep learning with Deep TAMER just one year later [@warnell2018deep]. -## Pre Modern Models +This era began to transition as reward models as a general notion were proposed as a method for studying alignment, rather than just a tool for solving RL problems [@leike2018scalable]. -InstructGPT, WebgGPT, Sparrow, Etc +## 2019 to 2022: RL from Human Preferences on Language Models -## ChatGPT +Reinforcement learning from human feedback, also referred to regularly as reinforcement learning from human preferences in its early days, was quickly adopted by AI labs increasingly turning to scaling large language models. +A large portion of this work began between GPT-2, in 2018, and GPT-3, in 2020. +The earliest work in 2019, *Fine-Tuning Language Models from Human Preferences* has many striking similarities to modern work on RLHF [@ziegler2019fine]. Learning reward models, KL distances, feedback diagrams, etc -- just the evaluation tasks, and capabilities, were different. +From here, RLHF was applied to a variety of tasks. +The popular applications were the ones that worked at the time. +Important examples include general summarization [@stiennon2020learning], recursive summarization of books [@wu2021recursively], instruction following (InstructGPT) [@ouyang2022training], browser-assisted question-answering (WebGPT) [@nakano2021webgpt], supporting answers with citations (GopherCite) [@menick2022teaching], and general dialogue (Sparrow) [@glaese2022improving]. + +Aside from applications, a number of seminal papers defined key areas for the future of RLHF, including those on: + +1. Reward model over-optimization [@gao2023scaling]: The ability for RL optimizers to over-fit to models trained on preference data, +2. Language models as a general area of study for alignment [@askell2021general], and +3. Red teaming [@ganguli2022red] -- the process of assessing safety of a language model. + +Work continued on refining RLHF for application to chat models. +Anthropic continued to use it extensively for early versions of Claude [@bai2022training] and early RLHF open-source tools emerged [@ramamurthy2022reinforcement],[@havrilla-etal-2023-trlx],[@vonwerra2022trl]. + +## 2023 to Present: ChatGPT Eta + +Since OpenAI launched ChatGPT [@openai2022chatgpt], RLHF has been used extensively in leading language models. +It is well known to be used in Anthropic's Constitutional AI for Claude [@bai2022constitutional], Meta's Llama 2 [@touvron2023llama] and Llama 3 [@dubey2024llama], Nvidia's Nemotron [@adler2024nemotron], and more. + +Today, RLHF is growing into a broader field of preference fine-tuning (PreFT), including new applications such as process reward for intermediate reasoning steps [@lightman2023let], direct alignment algorithms inspired by Direct Preference Optimization (DPO) [@rafailov2024direct], learning from execution feedback from code or math [@kumar2024training],[@singh2023beyond], and other online reasoning methods inspired by OpenAI's o1 [@openai2024o1]. diff --git a/chapters/08-regularization.md b/chapters/08-regularization.md index d70e423..05fbdab 100644 --- a/chapters/08-regularization.md +++ b/chapters/08-regularization.md @@ -26,12 +26,14 @@ Recall that KL distance is defined as follows: $$ D_{KL}(P || Q) = \sum_{x \in \mathcal{X}} P(x) \log \left(\frac{P(x)}{Q(x)}\right) $$ +In RLHF, the two distributions of interest are often the distribution of the new model version, say $P(x)$, and a distribution of the reference policy, say $Q(x)$. + ### Reference Model to Generations The most common implementation of KL penalities are by comparing the distance between the generated tokens during training to a static reference model. The intuition is that the model you're training from has a style that you would like to stay close to. This reference model is most often the instruction tuned model, but can also be a previous RL checkpoint. -With simple substitution, the model we are sampling from becomes $P(x)^{\text{RL}}$ and $P(x)^{\text{Ref.}}, shown above in @eq:kl_standard. +With simple substitution, the model we are sampling from becomes $P^{\text{RL}}(x)$ and $P^{\text{Ref.}}(x)$, shown above in @eq:kl_standard. Such KL distance was first applied to dialogue agents well before the popularity of large language models [@jaques2017sequence], yet KL control was quickly established as a core technique for fine-tuning pretrained models [@jaques2020human]. ### Implementation Example diff --git a/chapters/bib.bib b/chapters/bib.bib index fbcac31..ebe1ebe 100644 --- a/chapters/bib.bib +++ b/chapters/bib.bib @@ -6,6 +6,15 @@ @article{lambert2023entangled year={2023} } +@article{wirth2017survey, + title={A survey of preference-based reinforcement learning methods}, + author={Wirth, Christian and Akrour, Riad and Neumann, Gerhard and F{\"u}rnkranz, Johannes}, + journal={Journal of Machine Learning Research}, + volume={18}, + number={136}, + pages={1--46}, + year={2017} +} ################################################################################################ # AI General #################################################################### @@ -18,7 +27,37 @@ @book{russell2016artificial ################################################################################################ - +# RL related lit +@inproceedings{knox2008tamer, + title={Tamer: Training an agent manually via evaluative reinforcement}, + author={Knox, W Bradley and Stone, Peter}, + booktitle={2008 7th IEEE international conference on development and learning}, + pages={292--297}, + year={2008}, + organization={IEEE} +} +@inproceedings{macglashan2017interactive, + title={Interactive learning from policy-dependent human feedback}, + author={MacGlashan, James and Ho, Mark K and Loftin, Robert and Peng, Bei and Wang, Guan and Roberts, David L and Taylor, Matthew E and Littman, Michael L}, + booktitle={International conference on machine learning}, + pages={2285--2294}, + year={2017}, + organization={PMLR} +} +@inproceedings{warnell2018deep, + title={Deep tamer: Interactive agent shaping in high-dimensional state spaces}, + author={Warnell, Garrett and Waytowich, Nicholas and Lawhern, Vernon and Stone, Peter}, + booktitle={Proceedings of the AAAI conference on artificial intelligence}, + volume={32}, + number={1}, + year={2018} +} +@article{kaufmann2023survey, + title={A survey of reinforcement learning from human feedback}, + author={Kaufmann, Timo and Weng, Paul and Bengs, Viktor and H{\"u}llermeier, Eyke}, + journal={arXiv preprint arXiv:2312.14925}, + year={2023} +} # RLHF Methods #################################################################### @article{gilks1992adaptive, title={Adaptive rejection sampling for Gibbs sampling}, @@ -48,6 +87,34 @@ @inproceedings{jaques2017sequence year={2017}, organization={PMLR} } +@inproceedings{havrilla-etal-2023-trlx, + title = "trl{X}: A Framework for Large Scale Reinforcement Learning from Human Feedback", + author = "Havrilla, Alexander and + Zhuravinskyi, Maksym and + Phung, Duy and + Tiwari, Aman and + Tow, Jonathan and + Biderman, Stella and + Anthony, Quentin and + Castricato, Louis", + booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", + month = dec, + year = "2023", + address = "Singapore", + publisher = "Association for Computational Linguistics", + url = "https://aclanthology.org/2023.emnlp-main.530", + doi = "10.18653/v1/2023.emnlp-main.530", + pages = "8578--8595", +} +@misc{vonwerra2022trl, + author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec}, + title = {TRL: Transformer Reinforcement Learning}, + year = {2020}, + publisher = {GitHub}, + journal = {GitHub repository}, + howpublished = {\url{https://github.com/huggingface/trl}} +} + # RLHF Core #################################################################### @article{christiano2017deep, title={Deep reinforcement learning from human preferences}, @@ -56,6 +123,18 @@ @article{christiano2017deep volume={30}, year={2017} } +@article{leike2018scalable, + title={Scalable agent alignment via reward modeling: a research direction}, + author={Leike, Jan and Krueger, David and Everitt, Tom and Martic, Miljan and Maini, Vishal and Legg, Shane}, + journal={arXiv preprint arXiv:1811.07871}, + year={2018} +} +@article{ziegler2019fine, + title={Fine-tuning language models from human preferences}, + author={Ziegler, Daniel M and Stiennon, Nisan and Wu, Jeffrey and Brown, Tom B and Radford, Alec and Amodei, Dario and Christiano, Paul and Irving, Geoffrey}, + journal={arXiv preprint arXiv:1909.08593}, + year={2019} +} @article{stiennon2020learning, title={Learning to summarize with human feedback}, author={Stiennon, Nisan and Ouyang, Long and Wu, Jeffrey and Ziegler, Daniel and Lowe, Ryan and Voss, Chelsea and Radford, Alec and Amodei, Dario and Christiano, Paul F}, @@ -64,6 +143,13 @@ @article{stiennon2020learning pages={3008--3021}, year={2020} } +@article{wu2021recursively, + title={Recursively summarizing books with human feedback}, + author={Wu, Jeff and Ouyang, Long and Ziegler, Daniel M and Stiennon, Nisan and Lowe, Ryan and Leike, Jan and Christiano, Paul}, + journal={arXiv preprint arXiv:2109.10862}, + year={2021} +} + @article{askell2021general, title={A general language assistant as a laboratory for alignment}, @@ -86,13 +172,42 @@ @article{ouyang2022training pages={27730--27744}, year={2022} } - +@article{askell2021general, + title={A general language assistant as a laboratory for alignment}, + author={Askell, Amanda and Bai, Yuntao and Chen, Anna and Drain, Dawn and Ganguli, Deep and Henighan, Tom and Jones, Andy and Joseph, Nicholas and Mann, Ben and DasSarma, Nova and others}, + journal={arXiv preprint arXiv:2112.00861}, + year={2021} +} @article{bai2022training, title={Training a helpful and harmless assistant with reinforcement learning from human feedback}, author={Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and others}, journal={arXiv preprint arXiv:2204.05862}, year={2022} } +@article{ganguli2022red, + title={Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned}, + author={Ganguli, Deep and Lovitt, Liane and Kernion, Jackson and Askell, Amanda and Bai, Yuntao and Kadavath, Saurav and Mann, Ben and Perez, Ethan and Schiefer, Nicholas and Ndousse, Kamal and others}, + journal={arXiv preprint arXiv:2209.07858}, + year={2022} +} +@article{glaese2022improving, + title={Improving alignment of dialogue agents via targeted human judgements}, + author={Glaese, Amelia and McAleese, Nat and Tr{\k{e}}bacz, Maja and Aslanides, John and Firoiu, Vlad and Ewalds, Timo and Rauh, Maribeth and Weidinger, Laura and Chadwick, Martin and Thacker, Phoebe and others}, + journal={arXiv preprint arXiv:2209.14375}, + year={2022} +} +@article{menick2022teaching, + title={Teaching language models to support answers with verified quotes}, + author={Menick, Jacob and Trebacz, Maja and Mikulik, Vladimir and Aslanides, John and Song, Francis and Chadwick, Martin and Glaese, Mia and Young, Susannah and Campbell-Gillingham, Lucy and Irving, Geoffrey and others}, + journal={arXiv preprint arXiv:2203.11147}, + year={2022} +} +@article{bai2022constitutional, + title={Constitutional ai: Harmlessness from ai feedback}, + author={Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others}, + journal={arXiv preprint arXiv:2212.08073}, + year={2022} +} @article{lightman2023let, title={Let's verify step by step}, @@ -107,6 +222,14 @@ @article{touvron2023llama journal={arXiv preprint arXiv:2307.09288}, year={2023} } +@inproceedings{gao2023scaling, + title={Scaling laws for reward model overoptimization}, + author={Gao, Leo and Schulman, John and Hilton, Jacob}, + booktitle={International Conference on Machine Learning}, + pages={10835--10866}, + year={2023}, + organization={PMLR} +} @article{adler2024nemotron, title={Nemotron-4 340B Technical Report}, @@ -114,7 +237,19 @@ @article{adler2024nemotron journal={arXiv preprint arXiv:2406.11704}, year={2024} } - +@article{dubey2024llama, + title={The llama 3 herd of models}, + author={Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and others}, + journal={arXiv preprint arXiv:2407.21783}, + year={2024} +} +@article{rafailov2024direct, + title={Direct preference optimization: Your language model is secretly a reward model}, + author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea}, + journal={Advances in Neural Information Processing Systems}, + volume={36}, + year={2024} +} # RLHF More ######################################################################## @article{pang2024iterative, title={Iterative reasoning preference optimization}, @@ -122,6 +257,18 @@ @article{pang2024iterative journal={arXiv preprint arXiv:2404.19733}, year={2024} } +@article{cohen2022dynamic, + title={Dynamic planning in open-ended dialogue using reinforcement learning}, + author={Cohen, Deborah and Ryu, Moonkyung and Chow, Yinlam and Keller, Orgad and Greenberg, Ido and Hassidim, Avinatan and Fink, Michael and Matias, Yossi and Szpektor, Idan and Boutilier, Craig and others}, + journal={arXiv preprint arXiv:2208.02294}, + year={2022} +} +@article{ramamurthy2022reinforcement, + title={Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization}, + author={Ramamurthy, Rajkumar and Ammanabrolu, Prithviraj and Brantley, Kiant{\'e} and Hessel, Jack and Sifa, Rafet and Bauckhage, Christian and Hajishirzi, Hannaneh and Choi, Yejin}, + journal={arXiv preprint arXiv:2210.01241}, + year={2022} +} @article{gao2024rebel, title={Rebel: Reinforcement learning via regressing relative rewards}, @@ -129,7 +276,32 @@ @article{gao2024rebel journal={arXiv preprint arXiv:2404.16767}, year={2024} } - +@article{casper2023open, + title={Open problems and fundamental limitations of reinforcement learning from human feedback}, + author={Casper, Stephen and Davies, Xander and Shi, Claudia and Gilbert, Thomas Krendl and Scheurer, J{\'e}r{\'e}my and Rando, Javier and Freedman, Rachel and Korbak, Tomasz and Lindner, David and Freire, Pedro and others}, + journal={arXiv preprint arXiv:2307.15217}, + year={2023} +} +@article{kumar2024training, + title={Training language models to self-correct via reinforcement learning}, + author={Kumar, Aviral and Zhuang, Vincent and Agarwal, Rishabh and Su, Yi and Co-Reyes, John D and Singh, Avi and Baumli, Kate and Iqbal, Shariq and Bishop, Colton and Roelofs, Rebecca and others}, + journal={arXiv preprint arXiv:2409.12917}, + year={2024} +} +@article{singh2023beyond, + title={Beyond human data: Scaling self-training for problem-solving with language models}, + author={Singh, Avi and Co-Reyes, John D and Agarwal, Rishabh and Anand, Ankesh and Patil, Piyush and Liu, Peter J and Harrison, James and Lee, Jaehoon and Xu, Kelvin and Parisi, Aaron and others}, + journal={arXiv preprint arXiv:2312.06585}, + year={2023} +} +@misc{openai2024o1, + title = {Introducing OpenAI o1-preview}, + author = {{OpenAI}}, + year = {2024}, + month = sep, + url = {https://openai.com/index/introducing-openai-o1-preview/}, + note = {Accessed: 2024-10-18} +} # LLM as a Judge #################################################################### @article{zheng2023judging, title={Judging llm-as-a-judge with mt-bench and chatbot arena}, @@ -159,4 +331,11 @@ @misc{schulman2016klapprox year = {2016}, howpublished = {\url{http://joschu.net/blog/kl-approx.html}}, note = {Accessed: 2024-10-01} -} \ No newline at end of file +} +@misc{openai2022chatgpt, + title = {ChatGPT: Optimizing Language Models for Dialogue}, + author = {{OpenAI}}, + year = {2022}, + howpublished = {\url{https://openai.com/blog/chatgpt/}}, + note = {Training a LM with RLHF for suitable use as an all-purpose chat bot.} +} diff --git a/metadata.yml b/metadata.yml index 3e118a6..c575c02 100644 --- a/metadata.yml +++ b/metadata.yml @@ -8,7 +8,7 @@ lang: en-US mainlang: english otherlang: english tags: [rlhf, ebook, ai, ml] -date: 21 September 2024 +date: 6 October 2024 abstract: | Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to the deploy of the lastest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background.