Updates to support u-muP, as the new default behaviour #58

DouglasOrr · 2024-07-16T12:16:32Z

Demo notebook.

README header.

Changes:

Add mults to nonlinear U.* ops and uu.* modules
Add U.readout_linear, uu.ReadoutLinear to apply 1/fan_in scaling
Add U.rms_norm, uu.RMSNorm, U.silu, uu.SiLU, U.silu_glu, U.mse_loss
Add uu.Parameter, uu.optim.* to provide LR scaling
Add uu.Trunk, uu.TransformerStack to set mup_scaling_depth and (for TransformerStack) apply appropriate residual mults
Add demo notebook examples/demo.ipynb
Change the default constraint from "gmean" -> "to_output_scale"
Change default uu.Linear to bias=False & uu.*Norm to elementwise_affine=False
Change U.softmax and U.scaled_dot_product_attention scaling rules to use the empirical fit
Change definition of tau in residual_split/residual_add
Change uu.MHSA to use U.scaled_dot_product_attention
Change uu.MLP to use SwiGLU
Change uu.TransformerLayer, uu.TransformerDecoder to use RMSNorm and various other tweaks
Update top of README (other docs updates are lagging)
+probably more

DouglasOrr · 2024-07-23T06:58:53Z

@lyprince, thank you for agreeing to review!

@thecharlieblake FYI

This is still WIP, but (I think/hope) the major changes are in.

examples/scale_analysis.py

unit_scaling/_modules.py

lyprince · 2024-07-23T10:09:26Z

unit_scaling/_modules.py

-        )
+        self.is_causal = is_causal
+        self.mult = mult
+        self.linear_qkv = Linear(hidden_size, 3 * hidden_size, constraint=constraint)


Without a linear specialisation (like linear_readout), this will have an extra sqrt(3) factor in the scale.

Our default implementation does not fuse the qkv matmul accordingly

Ah, yes, I wondered about this; the cheeky thing I was thinking is that when using the new-default constraint "to_output_scale" it shouldn't matter, as the scale just depends on fan_in. But maybe this is a bit cheeky as it can be overridden.

the scale just depends on fan_in

Won't the output scale of this op be 3 * fan_in rather than fan_in though? I see in the demo notebook that attn_qkv.weight.grad.std = 0.62, which is pretty close to 1/sqrt(3) = 0.58.

unit_scaling/_modules.py

unit_scaling/functional.py

unit_scaling/optim.py

unit_scaling/functional.py

lyprince

Looks good! I've done my best to check for correctness. I have some uncertainty around residual scaling. A test for unit scale preserved across depth with the residual scaling scheme would give me more confidence.

Other than that, my main comments are around

Documentation that points to relevant parts of paper for justifying design choices / changes.
Doubts over whether use of nn.Sequential is the right choice for keeping track of depth, given common patterns for implementing mask and positional embeddings.
Clarification on whether it is intentional to fuse qkv projection in MHSA.
Testing compatibility of Parameter class with torch.compile (does FakeTensor trigger deepcopy failure).

unit_scaling/core/functional.py

unit_scaling/parameter.py

…paper

lyprince · 2024-07-24T14:17:06Z

examples/demo.ipynb

+    "# Config & helpers\n",
+    "torch.backends.cuda.matmul.allow_tf32 = True\n",
+    "torch.backends.cudnn.allow_tf32 = True\n",
+    "def show_layer_stats(layer: nn.Module, input_shape: Tuple[int, ...]) -> None:\n",


Demo notebook looks good! Only thing that broke my attention was searching for show_layer_stats. Was easy to miss when colocated with imports. My preference would be to have it defined next to first use, or at least be in its own cell.

Change the default constraint from gmean -> to_output_scale

dfefb0d

DouglasOrr self-assigned this Jul 16, 2024

DouglasOrr mentioned this pull request Jul 16, 2024

Add a package version and unit_scaling.__version__ (0.1) #59

Merged

DouglasOrr added 10 commits July 18, 2024 15:13

Add silu and add 'mult's to nonlinear unit scaling functions

46d269d

Add RMSNorm

c1d91d0

Split out & test RMS as a separate function

34e3411

Add an empirical scaling analysis notebook

0f4bc2a

Allow tagging of mup_type, mup_scaling_depth on model parameters

de97850

Add u-muP optimizers and U.mse_loss

a9825ef

Switch MHSA to use U.scaled_dot_product_attention

a2c23db

Update residual scheme, and various uu.* module default behaviours

95fb966

Add uu.LinearReadout and U.linear_readout

1daf18b

Move modules tests to use a u-muP optimiser

2b4d294

DouglasOrr requested a review from lyprince July 23, 2024 06:55

DouglasOrr marked this pull request as ready for review July 23, 2024 06:57

DouglasOrr added 2 commits July 23, 2024 07:59

Merge remote-tracking branch 'origin/main' into umup-updates

3ea9d3f

Increment _version

fc41319