Inaccuracy in randomized SVD at moderate ranks #58

stanleyjs · 2021-11-17T03:26:02Z

Hi,

Randomized SVD is not accurate

Currently most (if not all) of the PCA / linear dimensionality reduction / SVD is first routed through either TruncatedSVD or PCA(svd_solver='randomized'). It turns out that this solver can be pretty bad at computing even moderate rank SVDs. Consider this pathological example in which we create a 1000 x 500 matrix with np.hstack([np.zeros(249,),np.arange(250,501)]) as its spectrum. The matrix is rank 250. We will also consider its rank-50 reconstruction and its rank 1 approximation.

import numpy as np
from sklearn.decomposition import TruncatedSVD

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 50
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(250,501)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(449,),np.arange(450,501)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed@pca_operator.components_ # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?

>>> print(np.mean(L2_sv_error))
55.77929710543534
>>> print(np.mean(F_lowrank_error))
1930.877814905553

It is clear that there is a problem

It turns out that we can increase k and our estimate gets better

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 100
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?

>>> print(np.mean(L2_sv_error))
7.755560034045646
>>> print(np.mean(F_lowrank_error))
729.3025096061267

We can also decrease the rank of the underlying approximation to get better accuracy. What is happening here is that randomized_svd gets more accurate when there are more singular vectors requested, proportional to the rank of the matrix. As n_components gets closer to (and larger than) to the rank of the matrix, the algorithm gets more accurate. Let's finally look at the extreme case and compare our rank 1 approximation. The task here is to only estimate a single singular pair.

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 1
r = 250
lowrank_r = 1
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
print(np.mean(L2_sv_error))
print(np.mean(F_lowrank_error))

>>> print(np.mean(L2_sv_error))
10.74597935589037
>>> print(np.mean(F_lowrank_error))
772.1048200820816

We can make the algorithm more accurate

It turns out that there are a lot of edge cases and examples where randomized svd will fail either because the matrix is too large, ill-conditioned, the rank is too high, etc. However, there are a few parameters that can be tweaked in the inner function of randomized svd, sklearn.utils.extmath.randomized_svd to make things more accurate. The biggest one is n_oversamples, and then n_iters

How to change graphtools

I propose that we replace all calls to PCA and Truncated SVD with explicit calls to randomized_svd and we set sensible n_oversamples as a factor of the requested n_pca. The default is not very good. The sklearn documentation suggests for a rank k matrix n_oversamples should be 2*k-n_components or just simply n_components when n_components >= k, but I have found for hard problems this is not enough. We can also add an svd_kwargs keyword argument to the graph constructors to allow passing through kwargs to randomized SVD to increase accuracy or trade accuracy for performance.

@scottgigante @dburkhardt

The text was updated successfully, but these errors were encountered:

dburkhardt · 2021-11-17T16:01:48Z

Thanks for looking into this @stanleyjs. This is a hectic week for me but I'll make a note to come back to this. Fundamentally I have no issue with directly calling randomized_svd and setting parameters to provide better estimates so long as it doesn't lead to using dramatically more memory or compute time. Have you looked into the resource usage with higher n_oversamples or n_iters?

stanleyjs · 2021-11-18T19:45:23Z

@dburkhardt Here is a plot of the errors

And you can see the notebook that created it here
https://gist.github.com/stanleyjs/cb223cedb913942c4f9349b53f800ced

Clearly it's an issue. However, I was thinking of maybe just submitting a PR upstream in sklearn to add n_oversamples to TruncatedSVD

scottgigante · 2021-11-18T20:39:16Z

TBH I think sklearn might be the right place to fix this. If the PR is rejected then we should add it here, but no reason why more people shouldn't benefit from this fix.

…

On Thu, 18 Nov 2021 at 14:45, Jay Stanley ***@***.***> wrote: @dburkhardt <https://github.com/dburkhardt> Here is a plot of the errors [image: image] <https://user-images.githubusercontent.com/16860172/142485537-583b4b42-6b5b-4814-b214-bb7517a6b142.png> And you can see the notebook that created it here https://gist.github.com/stanleyjs/cb223cedb913942c4f9349b53f800ced Clearly it's an issue. However, I was thinking of maybe just submitting a PR upstream in sklearn to add n_oversamples to TruncatedSVD — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#58 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACA3DXZN76MAIX6YASFT6W3UMVJV7ANCNFSM5IF4H3RQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

stanleyjs · 2021-11-18T20:50:20Z

See scikit-learn/scikit-learn#21705

stanleyjs · 2021-12-15T13:04:02Z

@scottgigante @dburkhardt it appears that sklearn will probably fix this gap, but not without some internal discussion over the details of the API.

I am wondering if we should go ahead and patch in the randomized svd kwargs. Also, I notice that we'd only have to patch in a workaround for sparse matrices / TruncatedSVD - it looks like PCA (the dense matrix class) has the n_oversamples argument.

scottgigante · 2021-12-20T21:15:06Z

Fine by me if you want to write the patch. Probably easiest is to monkey-patch with a maximum version on sklearn (set to the current version+1)

…

On Wed, 15 Dec 2021, 8:04 am Jay Stanley, ***@***.***> wrote: @scottgigante <https://github.com/scottgigante> @dburkhardt <https://github.com/dburkhardt> it appears that sklearn will probably fix this gap, but not without some internal discussion over the details of the API. I am wondering if we should go ahead and patch in the randomized svd kwargs. Also, I notice that we'd only have to patch in a workaround for sparse matrices / TruncatedSVD - it looks like PCA (the dense matrix class) has the n_oversamples argument. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#58 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACA3DX22KPTASGSZ4SBFCBTURCG4ZANCNFSM5IF4H3RQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

dburkhardt · 2021-12-22T00:42:27Z

I agree with Scott here. Thanks for doing this @stanleyjs! Let me know if you need any help.

stanleyjs linked a pull request Dec 22, 2021 that will close this issue

Randomized svd patch, some DM features #59

Closed

stanleyjs mentioned this issue Jan 3, 2022

Patches: tasklogger log_x, randomized_svd arguments, deprecated graph_shortest_path #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inaccuracy in randomized SVD at moderate ranks #58

Inaccuracy in randomized SVD at moderate ranks #58

stanleyjs commented Nov 17, 2021 •

edited

Loading

dburkhardt commented Nov 17, 2021

stanleyjs commented Nov 18, 2021 •

edited

Loading

scottgigante commented Nov 18, 2021 via email

stanleyjs commented Nov 18, 2021

stanleyjs commented Dec 15, 2021

scottgigante commented Dec 20, 2021 via email

dburkhardt commented Dec 22, 2021

Inaccuracy in randomized SVD at moderate ranks #58

Inaccuracy in randomized SVD at moderate ranks #58

Comments

stanleyjs commented Nov 17, 2021 • edited Loading

Randomized SVD is not accurate

We can make the algorithm more accurate

How to change graphtools

dburkhardt commented Nov 17, 2021

stanleyjs commented Nov 18, 2021 • edited Loading

scottgigante commented Nov 18, 2021 via email

stanleyjs commented Nov 18, 2021

stanleyjs commented Dec 15, 2021

scottgigante commented Dec 20, 2021 via email

dburkhardt commented Dec 22, 2021

stanleyjs commented Nov 17, 2021 •

edited

Loading

stanleyjs commented Nov 18, 2021 •

edited

Loading