-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intrinsic Dimensionality #4
Comments
How familiar are you with the Intrinsic Dimension paper? It's been a while, but I seem to recall the basic idea is that one can replace an existing network parameterisation "W" with one that looks like We then optimise the new network, but only alter V. If we can get the network to train 'well', then we know that 'V' is big enough - so we can try smaller sizes of 'V'. At some point, the network won't train well, and we know we've gone "too far" in restricting the size of V. Just before that, the size of 'V' is what we'll call the intrinsic_dimension. So : The IntrinsicDimensionWrapper takes a module (in the notebook I tested on a single Linear layer first, and then a whole MNIST CNN), and goes through all the parameter blocks, replacing them with their initial value, and a dependency on a single 'V'. It then cleans out all the old parameters, so that when PyTorch thinks about optimisation, it only sees 'V'. Does this make sense? I made the notebook for a presentation I gave in Singapore, a short while after the paper came out : https://blog.mdda.net/ai/2018/05/15/presentation-at-pytorch Hope this helps |
@mdda Thank you so much for the explanation. This solves most of my doubts. In the paper, they have mentioned 3 ways of generating the random matrix (
From your code, I can understand that you went with the naive dense method for random matrix generation. You have used Also, after the wrapped is applied, the model seems to have only 1 |
I guess I should first point out that this was hacked together just a few hours before I gave the talk... But my self-justification for this is that if I've got a vector, and I multiply it by a matrix, there's a kind of 'impedence mismatch' in terms of scaling. To some extent, I'll be adding together things O( V_i ) * N(0,1) * size_of_V. So if the elements of V are "about the right size", then I need to downscale the matrix by the square-root of something relevant... (same would go for the attention-head factor for Transformers). I'm not claiming this is exactly right, but the factor would be irrelevant after training anyway : I was just trying to slice off an approximate scale factor to enable easier optimisation. |
@mdda This repo is amazing. Thank you so much for that. I am trying to play with the Intrinsic Dimensionality code (link). I am not quite able to understand the
class IntrinsicDimensionWrapper(torch.nn.Module)
. Can you walk me through it, especially the for loop in__init__
andforward
?The text was updated successfully, but these errors were encountered: