Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccuracy in randomized SVD at moderate ranks #58

Open
stanleyjs opened this issue Nov 17, 2021 · 7 comments
Open

Inaccuracy in randomized SVD at moderate ranks #58

stanleyjs opened this issue Nov 17, 2021 · 7 comments

Comments

@stanleyjs
Copy link
Collaborator

stanleyjs commented Nov 17, 2021

Hi,

Randomized SVD is not accurate

Currently most (if not all) of the PCA / linear dimensionality reduction / SVD is first routed through either TruncatedSVD or PCA(svd_solver='randomized'). It turns out that this solver can be pretty bad at computing even moderate rank SVDs. Consider this pathological example in which we create a 1000 x 500 matrix with np.hstack([np.zeros(249,),np.arange(250,501)]) as its spectrum. The matrix is rank 250. We will also consider its rank-50 reconstruction and its rank 1 approximation.

import numpy as np
from sklearn.decomposition import TruncatedSVD

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 50
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(250,501)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(449,),np.arange(450,501)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed@pca_operator.components_ # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
>>> print(np.mean(L2_sv_error))
55.77929710543534
>>> print(np.mean(F_lowrank_error))
1930.877814905553

It is clear that there is a problem

It turns out that we can increase k and our estimate gets better

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 100
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
>>> print(np.mean(L2_sv_error))
7.755560034045646
>>> print(np.mean(F_lowrank_error))
729.3025096061267

We can also decrease the rank of the underlying approximation to get better accuracy. What is happening here is that randomized_svd gets more accurate when there are more singular vectors requested, proportional to the rank of the matrix. As n_components gets closer to (and larger than) to the rank of the matrix, the algorithm gets more accurate. Let's finally look at the extreme case and compare our rank 1 approximation. The task here is to only estimate a single singular pair.

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 1
r = 250
lowrank_r = 1
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
print(np.mean(L2_sv_error))
print(np.mean(F_lowrank_error))
>>> print(np.mean(L2_sv_error))
10.74597935589037
>>> print(np.mean(F_lowrank_error))
772.1048200820816

We can make the algorithm more accurate

It turns out that there are a lot of edge cases and examples where randomized svd will fail either because the matrix is too large, ill-conditioned, the rank is too high, etc. However, there are a few parameters that can be tweaked in the inner function of randomized svd, sklearn.utils.extmath.randomized_svd to make things more accurate. The biggest one is n_oversamples, and then n_iters

How to change graphtools

I propose that we replace all calls to PCA and Truncated SVD with explicit calls to randomized_svd and we set sensible n_oversamples as a factor of the requested n_pca. The default is not very good. The sklearn documentation suggests for a rank k matrix n_oversamples should be 2*k-n_components or just simply n_components when n_components >= k, but I have found for hard problems this is not enough. We can also add an svd_kwargs keyword argument to the graph constructors to allow passing through kwargs to randomized SVD to increase accuracy or trade accuracy for performance.

@scottgigante @dburkhardt

@dburkhardt
Copy link
Member

Thanks for looking into this @stanleyjs. This is a hectic week for me but I'll make a note to come back to this. Fundamentally I have no issue with directly calling randomized_svd and setting parameters to provide better estimates so long as it doesn't lead to using dramatically more memory or compute time. Have you looked into the resource usage with higher n_oversamples or n_iters?

@stanleyjs
Copy link
Collaborator Author

stanleyjs commented Nov 18, 2021

@dburkhardt Here is a plot of the errors
image
And you can see the notebook that created it here
https://gist.github.com/stanleyjs/cb223cedb913942c4f9349b53f800ced

Clearly it's an issue. However, I was thinking of maybe just submitting a PR upstream in sklearn to add n_oversamples to TruncatedSVD

@scottgigante
Copy link
Contributor

scottgigante commented Nov 18, 2021 via email

@stanleyjs
Copy link
Collaborator Author

@stanleyjs
Copy link
Collaborator Author

@scottgigante @dburkhardt it appears that sklearn will probably fix this gap, but not without some internal discussion over the details of the API.

I am wondering if we should go ahead and patch in the randomized svd kwargs. Also, I notice that we'd only have to patch in a workaround for sparse matrices / TruncatedSVD - it looks like PCA (the dense matrix class) has the n_oversamples argument.

@scottgigante
Copy link
Contributor

scottgigante commented Dec 20, 2021 via email

@dburkhardt
Copy link
Member

I agree with Scott here. Thanks for doing this @stanleyjs! Let me know if you need any help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants