-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Preferences content, other nits (#19)
- Loading branch information
1 parent
ad46209
commit e12e698
Showing
6 changed files
with
255 additions
and
19 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,35 @@ | ||
|
||
# [Incomplete] Human Preferences for RLHF | ||
|
||
## Questioning the Ability of Preferences | ||
The core of reinforcement learning from human feedback, also referred to as reinforcement learning from human preferences in early literature, is designed to optimize machine learning models in domains where specifically designing a reward function is hard. | ||
The motivation for using humans as the reward signals is to obtain a indirect metric for the target reward. | ||
|
||
TODO [@lambert2023entangled]. | ||
The use of human labeled feedback data integrates the history of many fields. | ||
Using human data alone is a well studied problem, but in the context of RLHF it is used at the intersection of multiple long-standing fields of study [@lambert2023entangled]. | ||
|
||
As an approximation, modern RLHF is the convergence of three areas of development: | ||
|
||
1. Philosophy, psychology, economics, decision theory, and the nature of human preferences; | ||
2. Optimal control, reinforcement learning, and maximizing utility; and | ||
3. Modern deep learning systems. | ||
|
||
Together, each of these areas brings specific assumptions at what a preference is and how it can be optimized, which dictates the motivations and design of RLHF problems. | ||
|
||
## The Origins of Reward Models: Costs vs. Rewards vs. Preferences | ||
|
||
### Specifying objectives: from logic of utility to reward functions | ||
|
||
### Implementing optimal utility | ||
|
||
### Steering preferences | ||
|
||
### Value alignment's role in RLHF | ||
|
||
## From Design to Implementation | ||
|
||
Many of the principles discussed earlier in this chapter are further specified in the process of implementing the modern RLHF stack, adjusting the meaning of RLHF. | ||
|
||
## Limitations of RLHF | ||
|
||
The specifics of obtaining data for RLHF is discussed further in Chapter 6. | ||
For an extended version of this chapter, see [@lambert2023entangled]. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,44 @@ | ||
# [Incomplete] Key Related Works | ||
# Key Related Works | ||
|
||
In this chapter we detail the key papers and projects that got the RLHF field to where it is today. | ||
This is not intended to be a comprehensive review on RLHF and the related fields, but rather a starting point and retelling of how we got to today. | ||
It is intentionally focused on recent work that lead to ChatGPT. | ||
There is substantial further work in the RL literature on learning from preferences [@wirth2017survey]. | ||
For a more exhaustive list, you should use a proper survey paper [@kaufmann2023survey],[@casper2023open]. | ||
|
||
## Early RL on Preferences | ||
## Origins to 2018: RL on Preferences | ||
|
||
Christriano et al etc | ||
The field was recently popularized with the growth of Deep Reinforcement Learning and has grown into a broader study of the applications of LLMs from many large technology companies. | ||
Still, many of the techniques used today are deeply related to core techniques from early literature on RL from preferences. | ||
|
||
## RLHP on Language Models | ||
*TAMER: Training an Agent Manually via Evaluative Reinforcement,* Proposed a learned agent where humans provided scores on the actions taken iteratively to learn a reward model [@knox2008tamer]. Other concurrent or soon after work proposed an actor-critic algorithm, COACH, where human feedback (both positive and negative) is used to tune the advantage function [@macglashan2017interactive]. | ||
|
||
Learning to summarize, first work on language models (zieglar et al) | ||
The primary reference, Christiano et al. 2017, is application of RLHF applied on preferences between Atari trajectories [@christiano2017deep]. The work shows that humans choosing between trajectories can be more effective in some domains than directly interacting with the environment. This uses some clever conditions, but is impressive nonetheless. | ||
TAMER was adapted to deep learning with Deep TAMER just one year later [@warnell2018deep]. | ||
|
||
## Pre Modern Models | ||
This era began to transition as reward models as a general notion were proposed as a method for studying alignment, rather than just a tool for solving RL problems [@leike2018scalable]. | ||
|
||
InstructGPT, WebgGPT, Sparrow, Etc | ||
## 2019 to 2022: RL from Human Preferences on Language Models | ||
|
||
## ChatGPT | ||
Reinforcement learning from human feedback, also referred to regularly as reinforcement learning from human preferences in its early days, was quickly adopted by AI labs increasingly turning to scaling large language models. | ||
A large portion of this work began between GPT-2, in 2018, and GPT-3, in 2020. | ||
The earliest work in 2019, *Fine-Tuning Language Models from Human Preferences* has many striking similarities to modern work on RLHF [@ziegler2019fine]. Learning reward models, KL distances, feedback diagrams, etc -- just the evaluation tasks, and capabilities, were different. | ||
From here, RLHF was applied to a variety of tasks. | ||
The popular applications were the ones that worked at the time. | ||
Important examples include general summarization [@stiennon2020learning], recursive summarization of books [@wu2021recursively], instruction following (InstructGPT) [@ouyang2022training], browser-assisted question-answering (WebGPT) [@nakano2021webgpt], supporting answers with citations (GopherCite) [@menick2022teaching], and general dialogue (Sparrow) [@glaese2022improving]. | ||
|
||
Aside from applications, a number of seminal papers defined key areas for the future of RLHF, including those on: | ||
|
||
1. Reward model over-optimization [@gao2023scaling]: The ability for RL optimizers to over-fit to models trained on preference data, | ||
2. Language models as a general area of study for alignment [@askell2021general], and | ||
3. Red teaming [@ganguli2022red] -- the process of assessing safety of a language model. | ||
|
||
Work continued on refining RLHF for application to chat models. | ||
Anthropic continued to use it extensively for early versions of Claude [@bai2022training] and early RLHF open-source tools emerged [@ramamurthy2022reinforcement],[@havrilla-etal-2023-trlx],[@vonwerra2022trl]. | ||
|
||
## 2023 to Present: ChatGPT Eta | ||
|
||
Since OpenAI launched ChatGPT [@openai2022chatgpt], RLHF has been used extensively in leading language models. | ||
It is well known to be used in Anthropic's Constitutional AI for Claude [@bai2022constitutional], Meta's Llama 2 [@touvron2023llama] and Llama 3 [@dubey2024llama], Nvidia's Nemotron [@adler2024nemotron], and more. | ||
|
||
Today, RLHF is growing into a broader field of preference fine-tuning (PreFT), including new applications such as process reward for intermediate reasoning steps [@lightman2023let], direct alignment algorithms inspired by Direct Preference Optimization (DPO) [@rafailov2024direct], learning from execution feedback from code or math [@kumar2024training],[@singh2023beyond], and other online reasoning methods inspired by OpenAI's o1 [@openai2024o1]. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.