Skip to content

Commit

Permalink
Contrib16 (#30)
Browse files Browse the repository at this point in the history
  • Loading branch information
natolambert authored Jan 8, 2025
1 parent 969d4e4 commit 6b5bb9b
Show file tree
Hide file tree
Showing 7 changed files with 58 additions and 10 deletions.
1 change: 1 addition & 0 deletions .github/workflows/static.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ jobs:
- name: Clean build directory
run: |
rm -rf build
make clean
mkdir -p build/html
- name: Build book
Expand Down
34 changes: 31 additions & 3 deletions chapters/06-preference-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ next-chapter: "Reward Modeling"
next-url: "07-reward-models.html"
---

# [Incomplete] Preference Data
# Preference Data

## Collecting Preference Data

Expand All @@ -15,8 +15,36 @@ Given the sensitivity, processes that work and improve the models are extracted

## Rankings vs. Ratings




[@likert1932technique]

For example, a 5 point Likert scale would look like the following:

| A$>>$B | A$>$B | Tie | B$>$A | B$>>$A |
|:------:|:-----:|:-----:|:-----:|:------:|
| 1 | 2 | 3 | 4 | 5 |

Table: An example 5-wise Likert scale between two responses, A and B. {#tbl:likert5}

Some early RLHF for language modeling works uses an 8-step Likert scale with levels of preference between the two responses [@bai2022training].
An even scale removes the possibility of ties:

Here's a markdown table formatted as an 8-point Likert scale:

| A$>>>$B | | | A$>$B | B$>$A | | | B$>>>$A |
|:-------:|:-----:|:-----:|:-----:|:------:|:-----:|:-----:|:-------:|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |

Table: An example 8-wise Likert scale between two responses, A and B. {#tbl:likert8}

In this case [@bai2022training], and in other works, this information is still reduced to a binary signal for the training of a reward model.


TODO example of thumbs up / down with synthetic data or KTO


### Sourcing and Contracts

The first step is sourcing the vendor to provide data (or ones own annotators).
Expand All @@ -30,15 +58,15 @@ Once a contract is settled the data buyer and data provider agree upon instructi

An example interface is shown below from [@bai2022training]:

![Example preference data collection interface.](images/anthropic-interface.pdf){#fig:preference-interface}
![Example preference data collection interface.](images/anthropic-interface.png){#fig:preference-interface width=600px .center}

Depending on the domains of interest in the data, timelines for when the data can be labeled or curated vary. High-demand areas like mathematical reasoning or coding must be locked into a schedule weeks out. Simple delays of data collection don’t always work — Scale AI et al. are managing their workforces like AI research labs manage the compute-intensive jobs on their clusters.

Once everything is agreed upon, the actual collection process is a high-stakes time for post-training teams. All the infrastructure, evaluation tools, and plans for how to use the data and make downstream decisions must be in place.

The data is delivered in weekly batches with more data coming later in the contract. For example, when we bought preference data for on-policy models we were training at HuggingFace, we had a 6 week delivery period. The first weeks were for further calibration and the later weeks were when we hoped to most improve our model.

![Overview of the multi-batch cycle for obtaining human preference data from a vendor.](images/pref-data-timeline.png){#fig:preferences}
![Overview of the multi-batch cycle for obtaining human preference data from a vendor.](images/pref-data-timeline.png){#fig:preferences width=600px .center}

The goal is that by week 4 or 5 we can see the data improving our model. This is something some frontier models have mentioned, such as the 14 stages in the Llama 2 data collection [@touvron2023llama], but it doesn’t always go well. At HuggingFace, trying to do this for the first time with human preferences, we didn’t have the RLHF preparedness to get meaningful bumps on our evaluations. The last weeks came and we were forced to continue to collect preference data generating from endpoints we weren’t confident in.

Expand Down
Binary file removed images/anthropic-interface.pdf
Binary file not shown.
Binary file added images/anthropic-interface.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion metadata.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ lang: en-US
mainlang: english
otherlang: english
tags: [rlhf, ebook, ai, ml]
date: 19 December 2024
date: 5 January 2025
abstract: |
Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems.
In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background.
Expand Down
12 changes: 6 additions & 6 deletions templates/nav.js
Original file line number Diff line number Diff line change
Expand Up @@ -30,25 +30,25 @@ class NavigationDropdown extends HTMLElement {
<h3>Introductions</h3>
<ol>
<li><a href="https://rlhfbook.com/c/01-introduction.html">Introduction</a></li>
<li><a href="https://rlhfbook.com/c/02-preferences.html">What are preferences?</a></li>
<li><a href="https://rlhfbook.com/c/03-optimization.html">Optimization and RL</a></li>
<li><a href="https://rlhfbook.com/c/04-related-works.html">Seminal (Recent) Works</a></li>
<li><a href="https://rlhfbook.com/c/02-related-works.html">Seminal (Recent) Works</a></li>
<li><a href="https://rlhfbook.com/c/03-setup.html">Definitions</a></li>
</ol>
</div>
<div class="section">
<h3>Problem Setup</h3>
<ol>
<li><a href="https://rlhfbook.com/c/05-setup.html">Definitions</a></li>
<li><a href="https://rlhfbook.com/c/04-optimization.html">Optimization and RL</a></li>
<li><a href="https://rlhfbook.com/c/05-preferences.html">What are preferences?</a></li>
<li><a href="https://rlhfbook.com/c/06-preference-data.html">Preference Data</a></li>
<li><a href="https://rlhfbook.com/c/07-reward-models.html">Reward Modeling</a></li>
<li><a href="https://rlhfbook.com/c/08-regularization.html">Regularization</a></li>
</ol>
</div>
<div class="section">
<h3>Optimization</h3>
<ol>
<li><a href="https://rlhfbook.com/c/07-reward-models.html">Reward Modeling</a></li>
<li><a href="https://rlhfbook.com/c/08-regularization.html">Regularization</a></li>
<li><a href="https://rlhfbook.com/c/09-instruction-tuning.html">Instruction Tuning</a></li>
<li><a href="https://rlhfbook.com/c/10-rejection-sampling.html">Rejection Sampling</a></li>
<li><a href="https://rlhfbook.com/c/11-policy-gradients.html">Policy Gradients</a></li>
Expand Down
19 changes: 19 additions & 0 deletions templates/style.css
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,25 @@ table td {
display: none;
}

/* for html tables */
table {
margin-left: auto;
margin-right: auto;
border-collapse: collapse;
box-shadow: none;
border: none;
}
th, td {
padding: 8px;
text-align: center;
border: 1px solid #ddd;
}

thead {
background-color: #f5f5f5;
}


.dropdown-content.open {
display: block;
}
Expand Down

0 comments on commit 6b5bb9b

Please sign in to comment.