You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been working on a unit-scaled mamba block and wanted to share my work as well as ask a couple of questions. I used the https://github.com/johnma2006/mamba-minimal implementation as a skeleton.
All output scales seem to be properly scaled apart for deltaA due to the torch.exp operation. When the torch.exp operation is not used, deltaA is properly scaled since it uses UF.add. How would you recommend I handle this. Thank you very much for your time.
Note: unit_scaling as U, unit_scaling.functional as UF
The text was updated successfully, but these errors were encountered:
This part seems to be properly scaled. Not using weighted add scaling and instead just using UF.add seems to do well for forward scale. Would using the weighted add rule for scale , described in unit_scaling paper 1, lead to better scaled outputs?
These are all the components that aren't already implemented in the unit-scaling library, that are needed for mamba. Thanks for making all of this possible. I will be checking out how well scales are for bwd before working on full model.
I've been working on a unit-scaled mamba block and wanted to share my work as well as ask a couple of questions. I used the https://github.com/johnma2006/mamba-minimal implementation as a skeleton.
Softplus:
SSM:
part 1
All output scales seem to be properly scaled apart for deltaA due to the torch.exp operation. When the torch.exp operation is not used, deltaA is properly scaled since it uses UF.add. How would you recommend I handle this. Thank you very much for your time.
Note: unit_scaling as U, unit_scaling.functional as UF
The text was updated successfully, but these errors were encountered: