Skip to content

Commit

Permalink
Update index.html
Browse files Browse the repository at this point in the history
  • Loading branch information
NamrataRShivagunde authored May 8, 2024
1 parent 6b616b9 commit 2f820e6
Showing 1 changed file with 20 additions and 19 deletions.
39 changes: 20 additions & 19 deletions docs/2024/pept_relora_n_galore/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -162,14 +162,14 @@ <h2 id="relora">ReLoRA: High-Rank Training Through Low-Rank Updates</h2>
<figcaption> </figcaption>
</figure>

<p>To increase the total rank of the updates, ReLoRA uses the property of the rank of the sum of two matrices: rank(A + B) ≤ rank(A) + rank(B). ReLoRA merges the LoRA matrices with the original matrices multiple times during training leading to the total rank of the update (&Delta;W).</p>
<p>To increase the total rank of the updates, ReLoRA uses the property of the rank of the sum of two matrices: rank(A + B) ≤ rank(A) + rank(B). ReLoRA merges the LoRA matrices with the original matrices multiple times during training leading to the high total rank of the update (&Delta;W).</p>

<figure>
<img src="/blog/assets/images/galore_sum_of_updates.png" />
<figcaption> </figcaption>
</figure>

<p>While ReLoRA's low-rank updates and merge-and-reinitialize approach offer efficiency gains and high-rank updates, there are a few challenges. Since the optimizer relies on ADAM, the updates are still highly correlated. ReLoRA performs a partial reset (>90%) of the optimizer state, focusing on pruning magnitudes. This helps break the correlation between updates and ensures the optimization process remains stable. However, this led to an exploding loss. A solution to this problem is to use a jagged learning rate scheduler where on every optimizer reset, the learning rate is set to zero and a quick (50-100 steps) learning rate warm-up is performed to bring it back to the cosine schedule (Figure 1). This prevents the loss function from diverging significantly after the optimizer reset. Additionally, ReLoRA uses a warm start to gain better performance. During warm start, the model begins with a full-rank training phase for a portion of the training process (typically around 25%) before switching to the low-rank training phase. </p>
<p>While ReLoRA's low-rank updates and merge-and-reinitialize approach offer efficiency gains and high-rank updates, there are a few challenges. Since the optimizer relies on ADAM, the updates are still highly correlated. ReLoRA performs a partial reset (>90%) of the optimizer state, focusing on pruning magnitudes. This helps break the correlation between updates and ensures the optimization process remains stable. However, this led to an exploding loss. A solution to this problem is to use a jagged learning rate scheduler, where on every optimizer reset, the learning rate is set to zero and a quick (50-100 steps) learning rate warm-up is performed to bring it back to the cosine schedule (Figure 1). This prevents the loss function from diverging significantly after the optimizer reset. Additionally, ReLoRA uses a warm start to gain better performance. During warm start, the model begins with a full-rank training phase for a portion of the training process (typically around 25%) before switching to the low-rank training phase. </p>

<figure>
<img src="/blog/assets/images/relora-jagged lr.png" />
Expand All @@ -185,19 +185,19 @@ <h2 id="relora">ReLoRA: High-Rank Training Through Low-Rank Updates</h2>
<h2 id="galore">GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection</h2>
<p>GaLore is a memory-efficient pre-training technique gradients of the weight matrices are projected into low-rank form, updated using an optimizer, and projected back to the original gradient shape which is used to update the model weights. This technique is based on lemmas and theorems mentioned in the paper. The main lemma and theorem are described below.</p>
<ul>
<li><b>Lemma: Gradient becomes low-rank during training</b><be />
If the gradient is of form G<sub>t</sub> = A - BW<sub>t</sub>C, with constant A and PSD matrices B and C, the G converges to rank-1 exponentially, suggesting that the gradient of the given form becomes low rank during training.</li>
<li><b>Theorem: Gradient Form of reversible models</b><be />
<li><b>Lemma: Gradient becomes low-rank during training</b><br />
If the gradient is of form G<sub>t</sub> = A - BW<sub>t</sub>C, with constant A and PSD matrices B and C, the gradient G converges to rank-1 exponentially, suggesting that the gradient of the given form becomes low rank during training.</li>
<li><b>Theorem: Gradient Form of reversible models</b><br />
A reversible network with L2 objective has the gradient of form G<sub>t</sub> = A - BW<sub>t</sub>C. The definition of reversible networks is mentioned in the paper. The paper proves that Feed Forward networks and softmax loss function are reversible networks thus having a gradient of the given form. The paper does not discuss if attention is a reversible network</li>
</ul>
<p>As LLMs are made of feed-forward networks and activation functions, based on the above lemma and theorem and its proof, it is implied that LLMs have a gradient of form G<sub>t</sub> = A - BW<sub>t</sub>C and the gradient becomes low rank and training progresses.</p>
<p>As LLMs are made of feed-forward networks and activation functions, based on the above lemma and theorem and its proof, it is implied that LLMs have a gradient of form G<sub>t</sub> = A - BW<sub>t</sub>C and the gradient becomes low rank as training progresses.</p>

<figure>
<img src="/blog/assets/images/galore-decomposition.png" />
<figcaption></figcaption>
</figure>

<p>GaLore decomposes the gradient G<sup></sup> into P, low-rank gradient G, and Q matrices using SVD. P is m x r, Q is r x n projection matrices, where r is the rank. At every step, either P or Q is used depending on if m ≤ n. G is updated using an optimizer (e.g. AdamW). The updated G is then projected back to the original space using the transpose of projection matrices P or Q. </p>
<p>GaLore decomposes the gradient G<sup></sup> into P, low-rank gradient G, and Q matrices using SVD. P is m x r, Q is r x n projection matrices, where r is the rank. At every step, either P or Q is used depending on if m ≤ n. G<sup></sup> is updated using an optimizer (e.g. AdamW). The updated G<sup></sup> is then projected back to the original space using the transpose of projection matrices P or Q. </p>
<p>GaLore switches subspaces by reinitializing the projection matrices after a certain number of steps i.e. update frequency. The idea is the model learns in a subspace for a certain number of steps and then switches to another subspace using different initialization of the projection matrices. The projection matrices are re-initialized using the current gradient. Figure 2 shows a geometric interpretation of the low-rank subspace updates.</p>

<figure>
Expand All @@ -214,12 +214,13 @@ <h2 id="galore">GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Proje

<h2 id="comparison">Comparison between ReLoRA and GaLore</h2>


<table>
<thead>
<tr>
<th></th>
<th>ReLoRA</th>
<th>GaLore</th>
<th><b>ReLoRA</b></th>
<th><b>GaLore</b></th>
</tr>
</thead>
<tbody>
Expand All @@ -243,8 +244,8 @@ <h2 id="comparison">Comparison between ReLoRA and GaLore</h2>
</tr>
<tr>
<td>Weight equation</td>
<td>W = W + AB</td>
<td>W = W + P<sup>T</sup>GP if m<=n</td>
<td>W<sub>t</sub> = W<sub>t-1</sub> + AB</td>
<td>W<sub>t</sub> = W<sub>t-1</sub> + P<sup>T</sup>GP if m<=n</td>
</tr>
<tr>
<td>Gradient form</td>
Expand Down Expand Up @@ -302,14 +303,14 @@ <h2 id="comparison">Comparison between ReLoRA and GaLore</h2>
<p>This table summarizes the key differences between ReLoRA and GaLore, two parameter-efficient pre-training techniques discussed earlier. Here's a breakdown of the table:</p>

<ul>
<li><b>Decomposition</b>: ReLoRA uses LoRA decomposition, while GaLore uses Singular Value Decomposition (SVD).</li>
<li><b>Perplexity difference (Full rank vs. Low rank)</b>: This metric measures how well the model predicts the next word in a sequence. Lower perplexity indicates better performance. The table shows the difference in perplexity achieved by each method when trained with a full-rank model compared to a lower-rank model. ReLoRA shows a larger difference (0.44) for a 1.3B parameter model, while GaLore shows a smaller difference (0.08) for a 1B parameter model.</li>
<li><b>Tokens trained on</b>: This indicates the number of words used to train the model. GaLore uses fewer tokens than ReLoRA in the examples shown.</li>
<li><b>Weight equation</b>: This shows how the model weights are updated during training.</li>
<li><b>Gradient form</b>: ReLoRA has no specific conditions on the gradient form, while GaLore requires the gradient to be in a specific form (Gt = A - BWtC).</li>
<li><b>Changes subspace using</b>: ReLoRA changes the subspace it focuses on by resetting the optimizer state, while GaLore does this by re-initializing a projection matrix (P).</li>
<li><b>Decomposition</b>: ReLoRA uses LoRA decomposition to approximate low rank updates, while GaLore uses Singular Value Decomposition (SVD).</li>
<li><b>Perplexity difference (Full rank vs. Low rank)</b>: This metric measures how well the model predicts the next word in a sequence. Lower perplexity indicates better performance. The table shows the difference in perplexity achieved by each method when trained with a full-rank model compared to a lower-rank method. ReLoRA shows a larger difference (0.44) for a 1.3B parameter model, while GaLore shows a smaller difference (0.08) for a 1B parameter model.</li>
<li><b>Tokens trained on</b>: This indicates the number of words used to train the model. The perplexity comparison is done when the 1B scale model was trained on the given number of tokens.</li>
<li><b>Weight equation</b>: This shows how the model weights are updated during training using respective decomposition techniques.</li>
<li><b>Gradient form</b>: ReLoRA has no specific conditions on the gradient form, while GaLore requires the gradient to be in a specific form (G<sub>t</sub> = A - BW<sub>t</sub>C).</li>
<li><b>Changes subspace using</b>: ReLoRA changes the subspace by resetting the optimizer state, while GaLore does this by re-initializing a projection matrix (P).</li>
<li><b>Number of matrices trained</b>: ReLoRA trains two matrices (A and B), while GaLore trains one matrix (G). GaLore can potentially use a higher rank for this matrix since it only trains one.</li>
<li><b>Additional hyperparameters</b>: These are tuning knobs that control the training process. ReLoRA has three hyperparameters, while GaLore also has three.</li>
<li><b>Additional hyperparameters</b>: These are tuning knobs that control the training process. Both methods adds three additional hyperparameters.</li>
<li><b>Memory required</b>: This shows the amount of memory needed to train the model with each method (for a 1 billion parameter model). GaLore requires less memory than ReLoRA.</li>
<li><b>Throughput</b>: Throughput refers to the number of examples the model can process per second. This is measured on specific hardware (one RTX 3090 with 25G network bandwidth). ReLoRA shows higher throughput in this case.</li>
<li><b>Warmup required</b>: Whether a full-rank training phase is needed before switching to low-rank training. ReLoRA requires a warmup, while GaLore does not.</li>
Expand All @@ -318,7 +319,7 @@ <h2 id="comparison">Comparison between ReLoRA and GaLore</h2>
<li><b>Optimizers</b>: These are the optimization algorithms used to train the models. GaLore offers a wider range of compatible optimizers.</li>
</ul>

<p>As you can see, both ReLoRA and GaLore offer advantages and disadvantages. The best choice for a particular application may depend on factors such as the desired level of accuracy, memory constraints, and available hardware.</p>
<p>Both ReLoRA and GaLore offer advantages and disadvantages for pre-training LLMs. Overall, GaLore saves on memory whereas ReLoRA provides more speed up in pre-training LLMs.</p>

<!-- AddToAny BEGIN -->
<script async src="https://static.addtoany.com/menu/page.js"></script>
Expand Down

0 comments on commit 2f820e6

Please sign in to comment.