You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently trying to integrate the unit-scaling library in my DiT-like codebase. Unfortunately, the training is not really working, i.e. it is training extremely slowly if at all. I looked at the standard deviation of all the intermediate activation and I think I have localized the issue to the attention operation.
The code is looking like this:
x, skip=self.residual_split(x)
x=self.norm(x)
qkv=self.qkv_proj(x)
q, k, v=rearrange(qkv, "n l (t nh e) -> t n nh l e", t=3, e=self.d_head)
print(q.std(), k.std(), v.std())
x=U.scaled_dot_product_attention(q, k, v)
print(x.std())
When I look at the std of the q, k and v activation, I get a value very close to 1, which is what I am expecting. However, when I look at the std of the attention output, I often get values larger than 10. Any idea why this could be? Thanks!
The text was updated successfully, but these errors were encountered:
Hey, thanks for this great work.
I'm currently trying to integrate the
unit-scaling
library in my DiT-like codebase. Unfortunately, the training is not really working, i.e. it is training extremely slowly if at all. I looked at the standard deviation of all the intermediate activation and I think I have localized the issue to the attention operation.The code is looking like this:
When I look at the
std
of theq
,k
andv
activation, I get a value very close to 1, which is what I am expecting. However, when I look at thestd
of the attention output, I often get values larger than 10. Any idea why this could be? Thanks!The text was updated successfully, but these errors were encountered: