-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inaccuracy in randomized SVD at moderate ranks #58
Comments
Thanks for looking into this @stanleyjs. This is a hectic week for me but I'll make a note to come back to this. Fundamentally I have no issue with directly calling |
@dburkhardt Here is a plot of the errors Clearly it's an issue. However, I was thinking of maybe just submitting a PR upstream in sklearn to add |
TBH I think sklearn might be the right place to fix this. If the PR is
rejected then we should add it here, but no reason why more people
shouldn't benefit from this fix.
…On Thu, 18 Nov 2021 at 14:45, Jay Stanley ***@***.***> wrote:
@dburkhardt <https://github.com/dburkhardt> Here is a plot of the errors
[image: image]
<https://user-images.githubusercontent.com/16860172/142485537-583b4b42-6b5b-4814-b214-bb7517a6b142.png>
And you can see the notebook that created it here
https://gist.github.com/stanleyjs/cb223cedb913942c4f9349b53f800ced
Clearly it's an issue. However, I was thinking of maybe just submitting a
PR upstream in sklearn to add n_oversamples to TruncatedSVD
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#58 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACA3DXZN76MAIX6YASFT6W3UMVJV7ANCNFSM5IF4H3RQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@scottgigante @dburkhardt it appears that sklearn will probably fix this gap, but not without some internal discussion over the details of the API. I am wondering if we should go ahead and patch in the randomized svd kwargs. Also, I notice that we'd only have to patch in a workaround for sparse matrices / TruncatedSVD - it looks like PCA (the dense matrix class) has the n_oversamples argument. |
Fine by me if you want to write the patch. Probably easiest is to
monkey-patch with a maximum version on sklearn (set to the current
version+1)
…On Wed, 15 Dec 2021, 8:04 am Jay Stanley, ***@***.***> wrote:
@scottgigante <https://github.com/scottgigante> @dburkhardt
<https://github.com/dburkhardt> it appears that sklearn will probably fix
this gap, but not without some internal discussion over the details of the
API.
I am wondering if we should go ahead and patch in the randomized svd
kwargs. Also, I notice that we'd only have to patch in a workaround for
sparse matrices / TruncatedSVD - it looks like PCA (the dense matrix class)
has the n_oversamples argument.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#58 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACA3DX22KPTASGSZ4SBFCBTURCG4ZANCNFSM5IF4H3RQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I agree with Scott here. Thanks for doing this @stanleyjs! Let me know if you need any help. |
Hi,
Randomized SVD is not accurate
Currently most (if not all) of the PCA / linear dimensionality reduction / SVD is first routed through either
TruncatedSVD
orPCA(svd_solver='randomized')
. It turns out that this solver can be pretty bad at computing even moderate rank SVDs. Consider this pathological example in which we create a 1000 x 500 matrix withnp.hstack([np.zeros(249,),np.arange(250,501)])
as its spectrum. The matrix is rank 250. We will also consider its rank-50 reconstruction and its rank 1 approximation.It is clear that there is a problem
It turns out that we can increase
k
and our estimate gets betterWe can also decrease the rank of the underlying approximation to get better accuracy. What is happening here is that
randomized_svd
gets more accurate when there are more singular vectors requested, proportional to the rank of the matrix. Asn_components
gets closer to (and larger than) to the rank of the matrix, the algorithm gets more accurate. Let's finally look at the extreme case and compare our rank 1 approximation. The task here is to only estimate a single singular pair.We can make the algorithm more accurate
It turns out that there are a lot of edge cases and examples where randomized svd will fail either because the matrix is too large, ill-conditioned, the rank is too high, etc. However, there are a few parameters that can be tweaked in the inner function of randomized svd,
sklearn.utils.extmath.randomized_svd
to make things more accurate. The biggest one isn_oversamples
, and thenn_iters
How to change graphtools
I propose that we replace all calls to PCA and Truncated SVD with explicit calls to
randomized_svd
and we set sensiblen_oversamples
as a factor of the requestedn_pca
. The default is not very good. The sklearn documentation suggests for a rankk
matrixn_oversamples
should be2*k-n_components
or just simplyn_components
whenn_components >= k
, but I have found for hard problems this is not enough. We can also add ansvd_kwargs
keyword argument to the graph constructors to allow passing through kwargs to randomized SVD to increase accuracy or trade accuracy for performance.@scottgigante @dburkhardt
The text was updated successfully, but these errors were encountered: