You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey there!
First, fabulous work on speeding up Sklearn! I'm the author of Hyperlearn (https://github.com/danielhanchen/hyperlearn), where I also tried investigating how to speed up algos on the CPU (all the way from SSE4 to AVX512) and GPUs (was previously at NVIDIA's RAPIDS team - mainly responsible for 2000x faster TSNE, 50% less memory for SVD, and was working on sparse randomized svd, fast NMF etc)
I was trying to understand the main methods behind some of the algorithms.
Linear regression Tall skinny matrices can be dramatically sped up using the normal equations inv(X.T @ X) @ (X.T @ y). Sklearn sadly just uses the slow SVD approach gelsd. Sklearn guarantees stability though. During my investigation, approximately fast linear regression for tall skinny matrices is:
XTX = X.T @ X = ssyrk(X)
XTy = X.T @ y = sgemm(X.T, y) / sgemv(X.T, y)
U = spotrf(XTX)
z = U@b = strtrs(U.T, Xty) [U.T@(U@b) = XTy]
b = strtrs(U, z) [U@b = z]
Cholesky can cause issues, so Ridge Regularization can help. This approach is probably the least amount of FLOPs possible to find a solution. Some workspace memory can be reduced as well.
However, in my findings, pivoted cholesky (spstrf) can make the solution in fact solve all solutions, even ill conditioned ones, albeit with a higher condition number. The pivoting is much more tricky to handle.
Is this approximately what Intel's version is using to speedup linear regression?
PCA PCA requires (X - mean(X)) first. However, we show surprisingly that rather if you use syevd / syevr to find the eigenvectors, you can in fact bypass copying the matrix X for mean removal. You can use a smart matrix trick after expanding the formula to save O(NP) memory. [N = rows, P = columns]
PCA Scipy's syed / syevr comparison funnily was one my old hyperlearn results. EIGH very very slow --> suggesting an easy fix scipy/scipy#9212 A long time agao I mentioned how large matrices must use syevd for superior performance, and syevr is good for smaller matrices. We found a heuristic of selecting which to use (syevr / syevd).
KNN. Nearest neighbors is actually quite interesting to improve on CPUs. Pairwise distances can easily be made faster directly for Euclidean and Cosine similarity / distance using some matrix algebra. Both require the computation of X @ X.T, which you can use again ssyrk again.
Randomized SVD. Not sure if Intel's package has dramatically sped this one up. We spent a lot of time into the weeds. We found by replacing the range finder from QR (we can even toggle the finding of R off) to stable LU decomposition then using stable cholesky + some other sneaking tricks, huge speedups was possible with few loss in accuracy. Likewise Randomized PCA on sparse data was 100% possible using our special PCA trick I said.
Anyways I was just curious on the details behind the algorithms Intel has devised :)) All my code is available for use - though now I'm running a startup Moonshot (www.moonshotai.org) were our goal is to make all software faster!
Thanks for your time!
The text was updated successfully, but these errors were encountered:
Hey there!
First, fabulous work on speeding up Sklearn! I'm the author of Hyperlearn (https://github.com/danielhanchen/hyperlearn), where I also tried investigating how to speed up algos on the CPU (all the way from SSE4 to AVX512) and GPUs (was previously at NVIDIA's RAPIDS team - mainly responsible for 2000x faster TSNE, 50% less memory for SVD, and was working on sparse randomized svd, fast NMF etc)
I was trying to understand the main methods behind some of the algorithms.
inv(X.T @ X) @ (X.T @ y)
. Sklearn sadly just uses the slow SVD approach gelsd. Sklearn guarantees stability though. During my investigation, approximately fast linear regression for tall skinny matrices is:Cholesky can cause issues, so Ridge Regularization can help. This approach is probably the least amount of FLOPs possible to find a solution. Some workspace memory can be reduced as well.
However, in my findings, pivoted cholesky (spstrf) can make the solution in fact solve all solutions, even ill conditioned ones, albeit with a higher condition number. The pivoting is much more tricky to handle.
Is this approximately what Intel's version is using to speedup linear regression?
PCA PCA requires (X - mean(X)) first. However, we show surprisingly that rather if you use syevd / syevr to find the eigenvectors, you can in fact bypass copying the matrix X for mean removal. You can use a smart matrix trick after expanding the formula to save O(NP) memory. [N = rows, P = columns]
PCA Scipy's syed / syevr comparison funnily was one my old hyperlearn results. EIGH very very slow --> suggesting an easy fix scipy/scipy#9212 A long time agao I mentioned how large matrices must use syevd for superior performance, and syevr is good for smaller matrices. We found a heuristic of selecting which to use (syevr / syevd).
KNN. Nearest neighbors is actually quite interesting to improve on CPUs. Pairwise distances can easily be made faster directly for Euclidean and Cosine similarity / distance using some matrix algebra. Both require the computation of X @ X.T, which you can use again ssyrk again.
Randomized SVD. Not sure if Intel's package has dramatically sped this one up. We spent a lot of time into the weeds. We found by replacing the range finder from QR (we can even toggle the finding of R off) to stable LU decomposition then using stable cholesky + some other sneaking tricks, huge speedups was possible with few loss in accuracy. Likewise Randomized PCA on sparse data was 100% possible using our special PCA trick I said.
Anyways I was just curious on the details behind the algorithms Intel has devised :)) All my code is available for use - though now I'm running a startup Moonshot (www.moonshotai.org) were our goal is to make all software faster!
Thanks for your time!
The text was updated successfully, but these errors were encountered: