svt: Singular Value Thresholding in MATLAB

Many statistical learning methods such as matrix completion, matrix regression, and multiple response regression estimate a matrix of parameters. The nuclear norm regularization is frequently employed to achieve shrinkage and low rank solutions. To minimize a nuclear norm regularized loss function, a vital and most time-consuming step is singular value thresholding, which seeks the singular values of a large matrix exceeding a threshold and their associated singular vectors. Currently MATLAB lacks a function for singular value thresholding. Its built-in svds function computes the top r singular values/vectors by Lanczos iterative method but is only efficient for sparse matrix input, while aforementioned statistical learning algorithms perform singular value thresholding on dense but structured matrices. To address this issue, we provide a MATLAB wrapper function svt that implements singular value thresholding. It encompasses both top singular value decomposition and thresholding, handles both large sparse matrices and structured matrices, and reduces the computation cost in matrix learning algorithms.


Introduction
Many modern statistical learning problems concern estimating a matrix-valued parameter. Examples include matrix completion, regression with matrix covariates, and multivariate response regression. Matrix completion (Candès and Recht 2009;Mazumder, Hastie, and Tibshirani 2010) aims to recover a large matrix of which only a small fraction of entries are observed. The problem has sparked intensive research in recent years and is enjoying a broad range of applications such as personalized recommendation system (ACM SIGKDD and Netflix 2007) and imputation of massive genomics data (Chi, Zhou, Chen, Del Vecchyo, and Lange 2013). In matrix regression (Zhou and Li 2014), the predictors are two dimensional arrays such as images or measurements on a regular grid. Thus it requires a regression coefficient array of same size to completely capture the effects of matrix predictors. Another example is regression with multiple responses (Yuan, Ekici, Lu, and Monteiro 2007;Zhang, Zhou, Zhou, and Sun 2017), which involves a matrix of regression coefficients instead of a regression coefficient vector.
In these matrix estimation problems, the nuclear norm regularization is often employed to achieve a low rank solution and shrinkage simultaneously. This leads to a general optimization problem minimize ℓ(B) + λ B * , (1) where ℓ is a relevant loss function, B ∈ ℝ m × n is a matrix parameter, B * = ∑ i σ i (B) = σ(B) 1 (sum of singular values of B) is the nuclear norm of B, and λ is a positive tuning parameter that balances the trade-off between model fit and model parsimony. The nuclear norm plays the same role in low-rank matrix approximation that the ℓ 1 norm plays in sparse regression. Generic optimization methods such as accelerated proximal gradient algorithm, majorization-minorization (MM) algorithm, and alternating direction method of multipliers (ADMM) have been invoked to solve optimization problem (1). See, e.g., Mazumder et al. (2010);Boyd, Parikh, Chu, Peleato, and Eckstein (2011);Parikh and Boyd (2013); Chi et al. (2013); Lange, Chi, and Zhou (2014) for matrix completion algorithms and Zhou and Li (2014); Zhang et al. (2017) for the accelerated proximal gradient method for solving nuclear norm penalized regression. All these algorithms involve repeated singular value thresholding, which is the proximal mapping associated with the nuclear norm regularization term A arg min 1 2 X − A F 2 + λ X * . (2) Let the singular value decomposition of A be Udiag σ i V ⊤ = ∑ i σ i u i v i ⊤ . The solution of (2) is given by ∑ i σ i − λ + u i v i ⊤ (Cai, Candès, and Shen 2010). Some common features characterize the singular value thesholding operator in applications. First the involved matrices are often large. For matrix completion problems, m, n can be at order of 10 3 ~ 10 6 . Second only the singular values that exceed λ and their associated singular vectors are needed. Third the involved matrix is often structured. In this article, we say a matrix is structured if matrixvector multiplication is fast. For example, in matrix completion problems, A is of the form "sparse + low rank". That is A = M + LR ⊤ , where M is sparse and L ∈ ℝ m × r and R ∈ ℝ n × r are low rank r ≪ min{m, n}. Although A is not sparse itself, matrix-vector multiplications Av and w ⊤ A cost O(m+n) flops instead of O(mn). Storing the sparse matrix M and L and R also takes much less memory than the full matrix A. All these characteristics favor the iterative algorithms for singular value decomposition such as the Lanczos bidiagonalization method (Golub and Van Loan 1996).
Most algorithms for aforementioned applications are developed in MATLAB (The MathWorks Inc. 2013), which however lacks a convenient singular value thresholding functionality. The most direct approach for SVT is applying full SVD through svd and then soft-threshold the singular values. This approach is in practice used in many matrix learning problems according to the distributed code, e.g., Kalofolias, Bresson, Bronstein, and Vandergheynst (2014); Chi et al. (2013); Parikh and Boyd (2013); Yang, Wang, Zhang, and Zhao (2013); Zhou, Liu, Wan, and Yu (2014); Zhou and Li (2014); Zhang et al. (2017); Otazo, Candès, and Sodickson (2015); Goldstein, Studer, and Baraniuk (2015), to name a few. However, the built-in function svd is for full SVD of a dense matrix, and hence is very time-consuming and computationally expensive for large-scale problems. Another built-in function svds wraps the eigs function to calculate top singular triplets using iterative algorithms. However the current implementation of svds is efficient only for sparse matrix input, while the matrix estimation algorithm involves singular value thresholding of dense but structured matrices. Another layer of difficulty is that the number of singular values exceeding a threshold is often unknown. Therefore singular value thresholding involves successively computing more and more top singular values and vectors until hitting below the threshold.
To address these issues, we develop a MATLAB wrapper function svt for the SVT computation. It is compatible with MATLAB's svds function in terms of computing a fixed number of top singular values and vectors of sparse matrices. However it is able to take functional handle input, offering the flexibility to exploit matrix structure. More importantly, it automatically performs singular value thresholding with a user-supplied threshold and can be easily used as a plug-in subroutine in many matrix learning algorithms.
We discuss implementation details in Section 2 and describe syntax and example usage in Section 3. Section 4 evaluates numerical performance of the svt function in various situations. We conclude with a discussion in Section 5.

Algorithm and implementation
Our implementation hinges upon a well-known relationship between the singular value decomposition of a matrix A ∈ ℝ m × n , m ≥ n, and the eigenvalue decomposition of the symmetric augmented matrix 0 A ⊤ A 0 (Golub and Van Loan 1996, Section 8.6). Let the singular value decomposition of A be UΣV ⊤ , where U ∈ ℝ m × n , Σ ∈ ℝ n × n and V ∈ ℝ n × n . Then Therefore the SVD of A can be computed via the eigen-decomposition of the augmented matrix. Our wrapper function utilizes MATLAB's built-in eigs function for computing the top eigenvalues and eigenvectors of large, sparse or structured matrices.
In absence of a threshold, svt is similar to svds and calculates the top singular values and vectors. Since we allow function handle input, users can always take advantage of special structure in matrices by writing a user defined function for calculating matrix-vector multiplication. This is one merit of svt compared with MATLAB's svds.
With a user input threshold, svt does singular value thresholding in a sequential manner. It first computes the top k (default is 6) singular values and vectors. Two methods have been implemented to gradually build up the requested subspace. Let U r , V r and σ i , i = 1,…,r, be the singular values and vectors accrued so far. In the deflation method (Algorithm 1), we obtain next batch of incre (default is 5) singular values and vectors by working on the deflated matrix A−U r diag(σ 1 ,…,σ r )V r ⊤ . In the succession method (Algorithm 2), originally hinted in Cai et al. (2010), we work on A directly and retrieve top k, k+incre, k+2·incre, … singular values and vectors of the original matrix A successively. Both algorithms terminate as soon as a singular value below the threshold is identified. Efficiency of these two algorithms are compared in Section 4.5. Li and Zhou Page 4 J Stat Softw. Author manuscript; available in PMC 2020 June 10.

The MATLAB function aspect
We demonstrate various usages of svt in this section. A complete demonstration script with output is available on the software web page http://hua-zhou.github.io/svt/. Users can also supply a function handle, instead of the matrix itself, that computes matrixvector multiplication. This allows svt to utilize a special structure other than sparsity. For example, suppose A is a 1000-by-1000 "sparse plus low rank" matrix M + LR ⊤ , where M is sparse and L, R ∈ ℝ 1000 × 5 are two skinny and tall matrices. To compute the top 15 singular values and vectors, we first define a function that computes Av or w ⊤ A for arbitrary vectors Note the function Afun needs to have access to the variables M, L and R and is best declared as a sub-function in the main computation routine. The dimensions of matrix are required when using a functional handle. 'm' is the number of rows and 'n' is the number of columns.

To find the top k singular values and vectors of a matrix
Great convenience of svt comes from singular value thresholding. That is to compute the singular values that exceed a threshold λ and associated singular vectors. The code

Numerical experiments
In this section, we evaluate the numerical performance of svt in different scenarios and compare it with the MATLAB built-in functions svd and svds. We conduct these experiments on a desktop with an Intel Quad Core CPU @ 3.20 GHz and 12 GB of RAM. Computing environment is Linux MATLAB R2013a 64-bit version. For testing purpose, we use 5 square sparse matrices and 4 rectangular sparse matrices of varying sizes downloaded from the University of Florida sparse matrix collection (Davis and Hu 2011). For each numerical task, 10 replicate runs are performed and the average run time and standard error are reported, unless stated otherwise. Sparsity of a matrix A is defined as the proportion of zero entries, 1 − nnz(A)/numel(A). Table 1 reports the run times of svt, svds and svd for computing the top 6 singular values and associated vectors of sparse matrices. In this case, svt internally calls svds thus their run times should be indistinguishable. The huge gain of svt/svds in large sparse matrices simply demonstrates the advantage of the iterative method over the full decomposition method implemented in svd.

Top k singular values and vectors of "sparse + low rank" matrices
This example tests the capability of svt to take functional handle input. We generate structured matrices by adding a low rank perturbation to a sparse matrix. Let M ∈ ℝ n × n be a sparse test matrix. We form a "sparse + low rank" matrix A = M + LR ⊤ , where L, R ∈ ℝ n × 10 have independent standard normal distributed entries. Table 2 shows the average run times of svt with function handle input and svds with input A itself to compute the top 6 singular values and vectors based on 10 simulation replicates. It clearly shows the advantage of exploiting the special matrix structure over applying the iterative algorithm to the full matrix directly. The speed-up is up to 100 fold for large matrices.

Singular value thresholding of sparse matrices
In this example we compare the singular value thresholding capability of svt with the strategy of full singular value decomposition by svd followed by thresholding on sparse test matrices. The threshold value is pre-determined such that the top 50 singular values are above threshold. By default, svt starts with k = 6 singular values and then add more than 5 in each subsequent iteration. Results are presented in Table 3. For matrices of size less than 1000, svt is less efficient due to the overhead of repeated calling iterative algorithms until hitting the threshold. For large matrices, svt shows 100 ~ 1000 fold speed-ups.
show roughly the same pattern as in Table 3. Speed-up of svt is most eminent for large matrices. To evaluate the effectiveness of exploiting structure in singular value thresholding, we also call svt with input A directly, which apparently compromises efficiency. Table 5 compares the efficiency of the deflation and succession strategies for singular value thresholding of sparse test matrices. The threshold value is pre-determined such that the top 50 singular values are above the threshold. Both methods start with k = 6 singular values and then add 5 more in each subsequent iteration. The deflation method is in general faster than the succession method.

Deflation versus succession method for singular value thresholding
A similar comparison is done on "sparse + low rank" structured matrices, which are generated in the same way as in Section 4.2. The threshold is again set at the 50th singular value of each matrix. The average run time and standard error are reported in Table 6. We found non-convergence of the underlying ARPACK routine when applying the deflation method to the rdb800l and mhd4800 matrices. The non-convergence is caused by clustered eigenvalues. It is well known that ARPACK works best for finding eigenvalues with large separation between unwanted ones, and non-convergence is typical when dealing with ill conditioned matrices (Lehoucq and Sorensen 1996). When this happens, we restart with the succession method and continue from the current subspace.

Large-scale singular value thresholding
The purpose of this section is to demonstrate the performance of svt on large rectangular matrices. For the first two test matrices (bibd_20_10 and bibd_22_8), "sparse + low rank" matrices are generated by the same mechanism as in Section 4.2. For the other two matrices (sotrmG2_1000 and tp-6), singular value thresholding is performed on the original sparse matrices. The threshold is set at the 5th, 20th, and 50th singular value of each matrix respectively. Table 7 displays the run time of svt from one replicate. The full singular value decomposition svd takes excessively long time for these 4 problems so its results are not reported.

Application to matrix completion problem
To demonstrate the effectiveness of svt as a plug-in computational routine in practice, we conduct a numerical experiment on the spectral regularization algorithm for matrix completion (Mazumder et al. 2010), which minimizes at a grid of tuning parameter values λ.
Here Ω indexes the observed entries y ij and X = (x ij ) is the completed matrix. Algorithm 3 lists the computational algorithm, which involves repeated singular value thresholding (lines 4-6). See Chi et al. (2013) for a derivation from the majorization-minimization (MM) point of view. Although A (t) is is a dense matrix, it can be written as Li and Zhou Page 8 J Stat Softw. Author manuscript; available in PMC 2020 June 10.
where X (t) is a low rank matrix at large values of λ (only few singular values survive after thresholding), and P Ω (·) is a binary projection operator onto the observed entries. Fortunately, in many applications, large values of λ are the regime of interest, which encourages low rank solutions. That means most of the time A (t) is of the special form "sparse + low rank" that enables extremely fast matrix-vector multiplication.
In the numerical experiment, we generate a rank-5 matrix by multiplying two matrices M = LR ⊤ , where L, R ∈ ℝ n × 5 have independent standard normal distributed entries. Then, we add independent standard Gaussian noise to corrupt the original parameter matrix M, that is Y = M + ϵ. 5% entries of Y are randomly chosen to be observed. The dimension n of our synthetic data ranges from 500 to 5000. For each n, we minimize (4) at a grid of 20 points.
The grid is set up in a linear manner as in Mazumder et al. (2010). lambdas = linspace(maxlambda * 0.9, maxlambda / 5, 20) Here maxlambda is the largest singular value of the input matrix Y with missing entries set at 0. Warm start strategy is used. That is the solution at a previous λ is used as the start point for the next λ. Path following is terminated whenever all 20 grid points are exhausted or the rank of the solution exceeds 10 (twice the true rank). Three methods for singular value thresholding are tested: svt using functional handle input, svt using matrix input A (t) , and full singular value thresholding by svd followed by thresholding. Table 8 shows the run time in minutes for obtaining the whole solution path. Speed-up of svt increases with matrix size and utilizing the "sparse + low rank" structure via functional handle boosts the performance.

Discussion
We develop a MATLAB wrapper function svt for singular value thresholding. When a fixed number of top singular values and vectors are requested, svt expands the capability of MATLAB's built-in function svds by allowing function handle input. This enables application of the iterative method to dense but structured large matrices. More conveniently, svt provides a simple interface for singular value thresholding, the key step in many matrix learning algorithms. Our numerical examples have demonstrated efficiency of svt in various situations. The svt package is continuously developed and maintained at GitHub http://huazhou.github.io/svt/.
We describe a few future directions here. Our wrapper function utilizes the well-known relationship between SVD and eigen-decomposition of the augmented matrix (3) and builds on the MATLAB's eigs function, which in turn calls the ARPACK subroutines (Lehoucq, Sorensen, and Yang 1997) for solving large scale eigenproblems. An alternative is to use the PROPACK library (Larsen 1998), an efficient package for singular value decomposition of sparse or structured matrices. This involves distributing extra source code or compiled programs but may further improve efficiency. Both ARPACK and PROPACK implement Krylov subspace method and compute a fixed number of top eigenvalues or singular values. Thus singular value thresholding has to be done in a sequential manner. The recent FEAST package (Plizzi and Kestyn 2012) is an innovative method for solving standard or generalized eigenvalue problems, and is able to compute all the eigenvalues and eigenvectors within a given search interval, which is particularly attractive for the singular value thresholding task. However users must provide an initial value for the number of eigenvalues in the search interval. If the initial guess is too small, the program will exit. In real applications of singular value thresholding, such an estimate may be hard to obtain. Further investigation of the feasibility of using FEAST for singular value thresholding is underway. Li and Zhou Page 12     Singular value thresholding of "sparse + low rank" matrices. Reported are the average run time (in seconds) and standard error (in parentheses) based on 10 simulation replicates. Structured matrices are formed by adding a random rank-10 matrix to the original sparse test matrix. The threshold value is pre-determined to catch the top 50 singular values.  Run time of matrix completion problem using different singular value thresholding methods. Reported are the run time (in minutes) for the whole solution path. Path following is terminated whenever 20 grid points are exhausted or the rank of solution goes beyond 10 (twice the true rank).