Attention Issue for DiT-like model #84

kliyer-ai · 2024-12-10T12:56:59Z

Hey, thanks for this great work.

I'm currently trying to integrate the unit-scaling library in my DiT-like codebase. Unfortunately, the training is not really working, i.e. it is training extremely slowly if at all. I looked at the standard deviation of all the intermediate activation and I think I have localized the issue to the attention operation.

The code is looking like this:

x, skip = self.residual_split(x)
x = self.norm(x)
qkv = self.qkv_proj(x)
q, k, v = rearrange(qkv, "n l (t nh e) -> t n nh l e", t=3, e=self.d_head)
print(q.std(), k.std(), v.std())
x = U.scaled_dot_product_attention(q, k, v)  
print(x.std())

When I look at the std of the q, k and v activation, I get a value very close to 1, which is what I am expecting. However, when I look at the std of the attention output, I often get values larger than 10. Any idea why this could be? Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention Issue for DiT-like model #84

Attention Issue for DiT-like model #84

kliyer-ai commented Dec 10, 2024

Attention Issue for DiT-like model #84

Attention Issue for DiT-like model #84

Comments

kliyer-ai commented Dec 10, 2024