Contrib16 (#30)

natolambert · Jan 8, 2025 · 6b5bb9b · 6b5bb9b
1 parent 969d4e4
commit 6b5bb9b
Show file tree

Hide file tree

Showing 7 changed files with 58 additions and 10 deletions.
diff --git a/.github/workflows/static.yml b/.github/workflows/static.yml
@@ -49,6 +49,7 @@ jobs:
       - name: Clean build directory
         run: |
           rm -rf build
+          make clean
           mkdir -p build/html
 
       - name: Build book

diff --git a/chapters/06-preference-data.md b/chapters/06-preference-data.md
@@ -5,7 +5,7 @@ next-chapter: "Reward Modeling"
 next-url: "07-reward-models.html"
 ---
 
-# [Incomplete] Preference Data
+# Preference Data
 
 ## Collecting Preference Data
 
@@ -15,8 +15,36 @@ Given the sensitivity, processes that work and improve the models are extracted
 
 ## Rankings vs. Ratings
 
+
+
+
 [@likert1932technique]
 
+For example, a 5 point Likert scale would look like the following:
+
+| A$>>$B | A$>$B | Tie | B$>$A | B$>>$A |
+|:------:|:-----:|:-----:|:-----:|:------:|
+| 1    | 2   | 3   | 4   | 5    |
+
+Table: An example 5-wise Likert scale between two responses, A and B. {#tbl:likert5}
+
+Some early RLHF for language modeling works uses an 8-step Likert scale with levels of preference between the two responses [@bai2022training]. 
+An even scale removes the possibility of ties:
+
+Here's a markdown table formatted as an 8-point Likert scale:
+
+| A$>>>$B |     |     | A$>$B | B$>$A  |     |     | B$>>>$A |
+|:-------:|:-----:|:-----:|:-----:|:------:|:-----:|:-----:|:-------:|
+| 1     | 2   | 3   | 4   | 5    | 6   | 7   | 8     |
+
+Table: An example 8-wise Likert scale between two responses, A and B. {#tbl:likert8}
+
+In this case [@bai2022training], and in other works, this information is still reduced to a binary signal for the training of a reward model.
+
+
+TODO example of thumbs up / down with synthetic data or KTO
+
+
 ### Sourcing and Contracts
 
 The first step is sourcing the vendor to provide data (or ones own annotators). 
@@ -30,15 +58,15 @@ Once a contract is settled the data buyer and data provider agree upon instructi
 
 An example interface is shown below from [@bai2022training]:
 
-![Example preference data collection interface.](images/anthropic-interface.pdf){#fig:preference-interface}
+![Example preference data collection interface.](images/anthropic-interface.png){#fig:preference-interface width=600px .center}
 
 Depending on the domains of interest in the data, timelines for when the data can be labeled or curated vary. High-demand areas like mathematical reasoning or coding must be locked into a schedule weeks out. Simple delays of data collection don’t always work — Scale AI et al. are managing their workforces like AI research labs manage the compute-intensive jobs on their clusters.
 
 Once everything is agreed upon, the actual collection process is a high-stakes time for post-training teams. All the infrastructure, evaluation tools, and plans for how to use the data and make downstream decisions must be in place.
 
 The data is delivered in weekly batches with more data coming later in the contract. For example, when we bought preference data for on-policy models we were training at HuggingFace, we had a 6 week delivery period. The first weeks were for further calibration and the later weeks were when we hoped to most improve our model.
 
-![Overview of the multi-batch cycle for obtaining human preference data from a vendor.](images/pref-data-timeline.png){#fig:preferences}
+![Overview of the multi-batch cycle for obtaining human preference data from a vendor.](images/pref-data-timeline.png){#fig:preferences width=600px .center}
 
 The goal is that by week 4 or 5 we can see the data improving our model. This is something some frontier models have mentioned, such as the 14 stages in the Llama 2 data collection [@touvron2023llama], but it doesn’t always go well. At HuggingFace, trying to do this for the first time with human preferences, we didn’t have the RLHF preparedness to get meaningful bumps on our evaluations. The last weeks came and we were forced to continue to collect preference data generating from endpoints we weren’t confident in.
 

diff --git a/images/anthropic-interface.pdf b/images/anthropic-interface.pdf
diff --git a/images/anthropic-interface.png b/images/anthropic-interface.png
diff --git a/metadata.yml b/metadata.yml
@@ -8,7 +8,7 @@ lang: en-US
 mainlang: english
 otherlang: english
 tags: [rlhf, ebook, ai, ml]
-date: 19 December 2024
+date: 5 January 2025
 abstract: |
   Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems.
   In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background.

diff --git a/templates/nav.js b/templates/nav.js
@@ -30,25 +30,25 @@ class NavigationDropdown extends HTMLElement {
         <h3>Introductions</h3>
         <ol>
           <li><a href="https://rlhfbook.com/c/01-introduction.html">Introduction</a></li>
-          <li><a href="https://rlhfbook.com/c/02-preferences.html">What are preferences?</a></li>
-          <li><a href="https://rlhfbook.com/c/03-optimization.html">Optimization and RL</a></li>
-          <li><a href="https://rlhfbook.com/c/04-related-works.html">Seminal (Recent) Works</a></li>
+          <li><a href="https://rlhfbook.com/c/02-related-works.html">Seminal (Recent) Works</a></li>
+          <li><a href="https://rlhfbook.com/c/03-setup.html">Definitions</a></li>
         </ol>
       </div>
 
       <div class="section">
         <h3>Problem Setup</h3>
         <ol>
-          <li><a href="https://rlhfbook.com/c/05-setup.html">Definitions</a></li>
+          <li><a href="https://rlhfbook.com/c/04-optimization.html">Optimization and RL</a></li>
+          <li><a href="https://rlhfbook.com/c/05-preferences.html">What are preferences?</a></li>
           <li><a href="https://rlhfbook.com/c/06-preference-data.html">Preference Data</a></li>
-          <li><a href="https://rlhfbook.com/c/07-reward-models.html">Reward Modeling</a></li>
-          <li><a href="https://rlhfbook.com/c/08-regularization.html">Regularization</a></li>
         </ol>
       </div>
 
       <div class="section">
         <h3>Optimization</h3>
         <ol>
+          <li><a href="https://rlhfbook.com/c/07-reward-models.html">Reward Modeling</a></li>
+          <li><a href="https://rlhfbook.com/c/08-regularization.html">Regularization</a></li>
           <li><a href="https://rlhfbook.com/c/09-instruction-tuning.html">Instruction Tuning</a></li>
           <li><a href="https://rlhfbook.com/c/10-rejection-sampling.html">Rejection Sampling</a></li>
           <li><a href="https://rlhfbook.com/c/11-policy-gradients.html">Policy Gradients</a></li>

diff --git a/templates/style.css b/templates/style.css
@@ -345,6 +345,25 @@ table td {
   display: none;
 }
 
+/* for html tables */
+table {
+    margin-left: auto;
+    margin-right: auto;
+    border-collapse: collapse;
+    box-shadow: none;
+    border: none;
+}
+th, td {
+    padding: 8px;
+    text-align: center;
+    border: 1px solid #ddd;
+}
+
+thead {
+    background-color: #f5f5f5;
+}
+
+
 .dropdown-content.open {
   display: block;
 }