Preferences content, other nits (#19)

natolambert · Oct 19, 2024 · e12e698 · e12e698
1 parent ad46209
commit e12e698
Show file tree

Hide file tree

Showing 6 changed files with 255 additions and 19 deletions.
diff --git a/chapters/01-introduction.md b/chapters/01-introduction.md
@@ -12,7 +12,7 @@ Finally, the language model can be optimized with a RL optimizer of choice, by s
 This book details key decisions and basic implementation examples for each step in this process.
 
 RLHF has been applied to many domains successfully, with complexity increasing as the techniques have matured.
-Early breakthrough experiments with RLHF were applied to deep reinforcement learning [@christiano2017deep], summarization [@stiennon2020learning], follow instructions [@ouyang2022training], parse web information for question answering [@nakano2021webgpt], and ``alignment'' [@bai2022training].
+Early breakthrough experiments with RLHF were applied to deep reinforcement learning [@christiano2017deep], summarization [@stiennon2020learning], follow instructions [@ouyang2022training], parse web information for question answering [@nakano2021webgpt], and "alignment" [@bai2022training].
 
 ## Scope of This Book
 
@@ -79,7 +79,7 @@ He has written extensively on RLHF, including [many blog posts](https://www.inte
 With the investment in language modeling, many variations on the traditional RLHF methods emerged.
 RLHF colloquially has become synonymous with multiple overlapping approaches. 
 RLHF is a subset of preference fine-tuning (PreFT) techniques, including Direct Alignment Algorithms (See Chapter 12).
-RLHF is the tool most associated with rapid progress in ``post-training'' of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. 
+RLHF is the tool most associated with rapid progress in "post-training" of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. 
 This textbook is a broad overview of RLHF and its directly neighboring methods, such as instruction tuning and other implementation details needed to set up a model for RLHF training.
 
 As more successes of fine-tuning language models with RL emerge, such as OpenAI's o1 reasoning models, RLHF will be seen as the bridge that enabled further investment of RL methods for fine-tuning large base models.

diff --git a/chapters/02-preferences.md b/chapters/02-preferences.md
@@ -1,6 +1,35 @@
 
 # [Incomplete] Human Preferences for RLHF
 
-## Questioning the Ability of Preferences
+The core of reinforcement learning from human feedback, also referred to as reinforcement learning from human preferences in early literature, is designed to optimize machine learning models in domains where specifically designing a reward function is hard.
+The motivation for using humans as the reward signals is to obtain a indirect metric for the target reward.
 
-TODO [@lambert2023entangled].
+The use of human labeled feedback data integrates the history of many fields.
+Using human data alone is a well studied problem, but in the context of RLHF it is used at the intersection of multiple long-standing fields of study [@lambert2023entangled].
+
+As an approximation, modern RLHF is the convergence of three areas of development:
+
+1. Philosophy, psychology, economics, decision theory, and the nature of human preferences;
+2. Optimal control, reinforcement learning, and maximizing utility; and
+3. Modern deep learning systems.
+
+Together, each of these areas brings specific assumptions at what a preference is and how it can be optimized, which dictates the motivations and design of RLHF problems.
+
+## The Origins of Reward Models: Costs vs. Rewards vs. Preferences
+
+### Specifying objectives: from logic of utility to reward functions
+
+### Implementing optimal utility
+
+### Steering preferences
+
+### Value alignment's role in RLHF
+
+## From Design to Implementation
+
+Many of the principles discussed earlier in this chapter are further specified in the process of implementing the modern RLHF stack, adjusting the meaning of RLHF.
+
+## Limitations of RLHF
+
+The specifics of obtaining data for RLHF is discussed further in Chapter 6.
+For an extended version of this chapter, see [@lambert2023entangled].
diff --git a/chapters/04-related-works.md b/chapters/04-related-works.md
@@ -1,18 +1,44 @@
-# [Incomplete] Key Related Works
+# Key Related Works
 
 In this chapter we detail the key papers and projects that got the RLHF field to where it is today.
 This is not intended to be a comprehensive review on RLHF and the related fields, but rather a starting point and retelling of how we got to today.
+It is intentionally focused on recent work that lead to ChatGPT.
+There is substantial further work in the RL literature on learning from preferences [@wirth2017survey]. 
+For a more exhaustive list, you should use a proper survey paper [@kaufmann2023survey],[@casper2023open].
 
-## Early RL on Preferences
+## Origins to 2018: RL on Preferences
 
-Christriano et al etc
+The field was recently popularized with the growth of Deep Reinforcement Learning and has grown into a broader study of the applications of LLMs from many large technology companies.
+Still, many of the techniques used today are deeply related to core techniques from early literature on RL from preferences.
 
-## RLHP on Language Models
+*TAMER: Training an Agent Manually via Evaluative Reinforcement,* Proposed a learned agent where humans provided scores on the actions taken iteratively to learn a reward model [@knox2008tamer]. Other concurrent or soon after work proposed an actor-critic algorithm, COACH, where human feedback (both positive and negative) is used to tune the advantage function [@macglashan2017interactive].
 
-Learning to summarize, first work on language models (zieglar et al)
+The primary reference, Christiano et al. 2017, is application of RLHF applied on preferences between Atari trajectories [@christiano2017deep]. The work shows that humans choosing between trajectories can be more effective in some domains than directly interacting with the environment. This uses some clever conditions, but is impressive nonetheless.
+TAMER was adapted to deep learning with Deep TAMER just one year later [@warnell2018deep].
 
-## Pre Modern Models
+This era began to transition as reward models as a general notion were proposed as a method for studying alignment, rather than just a tool for solving RL problems [@leike2018scalable].
 
-InstructGPT, WebgGPT, Sparrow, Etc
+## 2019 to 2022: RL from Human Preferences on Language Models
 
-## ChatGPT
+Reinforcement learning from human feedback, also referred to regularly as reinforcement learning from human preferences in its early days, was quickly adopted by AI labs increasingly turning to scaling large language models.
+A large portion of this work began between GPT-2, in 2018, and GPT-3, in 2020.
+The earliest work in 2019, *Fine-Tuning Language Models from Human Preferences* has many striking similarities to modern work on RLHF [@ziegler2019fine]. Learning reward models, KL distances, feedback diagrams, etc -- just the evaluation tasks, and capabilities, were different.
+From here, RLHF was applied to a variety of tasks.
+The popular applications were the ones that worked at the time.
+Important examples include general summarization [@stiennon2020learning], recursive summarization of books [@wu2021recursively], instruction following (InstructGPT) [@ouyang2022training], browser-assisted question-answering (WebGPT) [@nakano2021webgpt], supporting answers with citations (GopherCite) [@menick2022teaching], and general dialogue (Sparrow) [@glaese2022improving].
+
+Aside from applications, a number of seminal papers defined key areas for the future of RLHF, including those on:
+
+1. Reward model over-optimization [@gao2023scaling]: The ability for RL optimizers to over-fit to models trained on preference data,
+2. Language models as a general area of study for alignment [@askell2021general], and
+3. Red teaming [@ganguli2022red] -- the process of assessing safety of a language model.
+
+Work continued on refining RLHF for application to chat models.
+Anthropic continued to use it extensively for early versions of Claude [@bai2022training] and early RLHF open-source tools emerged [@ramamurthy2022reinforcement],[@havrilla-etal-2023-trlx],[@vonwerra2022trl].
+
+## 2023 to Present: ChatGPT Eta
+
+Since OpenAI launched ChatGPT [@openai2022chatgpt], RLHF has been used extensively in leading language models. 
+It is well known to be used in Anthropic's Constitutional AI for Claude [@bai2022constitutional], Meta's Llama 2 [@touvron2023llama] and Llama 3 [@dubey2024llama], Nvidia's Nemotron [@adler2024nemotron], and more.
+
+Today, RLHF is growing into a broader field of preference fine-tuning (PreFT), including new applications such as process reward for intermediate reasoning steps [@lightman2023let], direct alignment algorithms inspired by Direct Preference Optimization (DPO) [@rafailov2024direct], learning from execution feedback from code or math [@kumar2024training],[@singh2023beyond], and other online reasoning methods inspired by OpenAI's o1 [@openai2024o1].
diff --git a/chapters/08-regularization.md b/chapters/08-regularization.md
@@ -26,12 +26,14 @@ Recall that KL distance is defined as follows:
 
 $$ D_{KL}(P || Q) = \sum_{x \in \mathcal{X}} P(x) \log \left(\frac{P(x)}{Q(x)}\right) $$
 
+In RLHF, the two distributions of interest are often the distribution of the new model version, say $P(x)$, and a distribution of the reference policy, say $Q(x)$.
+
 ### Reference Model to Generations
 
 The most common implementation of KL penalities are by comparing the distance between the generated tokens during training to a static reference model.
 The intuition is that the model you're training from has a style that you would like to stay close to.
 This reference model is most often the instruction tuned model, but can also be a previous RL checkpoint.
-With simple substitution, the model we are sampling from becomes $P(x)^{\text{RL}}$ and $P(x)^{\text{Ref.}}, shown above in @eq:kl_standard.
+With simple substitution, the model we are sampling from becomes $P^{\text{RL}}(x)$ and $P^{\text{Ref.}}(x)$, shown above in @eq:kl_standard.
 Such KL distance was first applied to dialogue agents well before the popularity of large language models [@jaques2017sequence], yet KL control was quickly established as a core technique for fine-tuning pretrained models [@jaques2020human].
 
 ### Implementation Example