Simulated Data for Linear Regression with Structured and Sparse Penalties: Introducing pylearn-simulate

A currently very active ﬁeld of research is how to incorporate structure and prior knowledge in machine learning methods. It has lead to numerous developments in the ﬁeld of non-smooth convex minimization. With recently developed methods it is possible to perform an analysis in which the computed model can be linked to a given structure of the data and simultaneously do variable selection to ﬁnd a few important features in the data. However, there is still no way to unambiguously simulate data to test proposed algorithms, since the exact solutions to such problems are unknown. The main aim of this paper is to present a theoretical framework for generating simulated data. These simulated data are appropriate when comparing optimization algorithms in the context of linear regression problems with sparse and structured penalties. Additionally, this approach allows the user to control the signal-to-noise ratio, the correlation structure of the data and the optimization problem to which they are the solution. The traditional approach is to simulate random data without taking into account the actual model that will be ﬁt to the data. But when using such an approach it is not possible to know the exact solution of the underlying optimization problem. With our contribution, it is possible to know the exact theoretical solution of a penalized linear regression problem, and it is thus possible to compare algorithms without the need to use, e.g., cross-validation.Wealsopresentour implementation, the Python package pylearn-simulate , available at https://github.com/neurospin/pylearn-simulate and released under the BSD 3-clause license. We describe the package and give examples at the end of the paper.


Introduction
Simulated data are widely used to assess optimization methods. This is because of their ability to evaluate certain aspects of the methods under study; and these aspects are impossible to look into when using real data sets. In the context of convex optimization, it is never possible to know the exact solution of the minimization problem with real data and it is a difficult problem even with simulated data. We propose to generalize an approach originally published by Nesterov (2013), for LASSO regression, to a broader family of penalized regression problems.
We would like to generate simulated data for which we know the exact solution of a given function. The inputs are: the minimizer β * (p × 1), a candidate data set X 0 (n × p), a residual vector ε (n × 1), regularization parameters (in our case there are up to three of these), the signal-to-noise ratio σ, and the expression of a function, f (β), that is to be minimized.
The candidate version of the data set may for instance be X 0 ∼ N (µ, Σ), and the residual vector may be ε ∼ N (0, 1). X 0 can alternatively be a data set relevant for the application in mind: a set of images, microarrays, etc. Using real data as input could provide a more realistic correlation structure, and this correlation structure is not affected by our method.
The procedure proposed here outputs X and y such that with f a convex function of β that depends on the parameters that define the simulated data, (X 0 , β, ε, σ), and on the regularization parameters. In a nutshell, X is obtained by scaling each column j = 1, . . . , p of X 0 by a factor ω j . We give the expression of ω j as a function of the inputs and the subdifferential of f in Section 3.
In linear regression, simulated data are often generated such that If we want to evaluate an algorithm to minimize a penalized regression problem f (β) = 1 2 Xβ − y 2 2 + P (β), then we need to use, e.g., cross-validation to find an acceptable value for any regularization parameter that could be hidden in the penalty P . But the found values are very likely suboptimal, and in any case, we are forced to compare the solution to Equation 2.
Our contribution is thus to give the solution that actually minimizes Equation 3, instead of Equation 2, namely β * from Equation 1. This means we are able to compare speed, sensitivity to noise, correlation, etc., and the actual solutions found by different minimization algorithms.
An R markdown document (Example_in_R.rmd) is also provided as a supplementary file in order to illustrate to the reader familiar with R (R Core Team 2018) the simplicity of our method and its impact on a realistic gene expression data set. This R markdown example was developed using R version 3.1.1 and RStudio version 0.98.1062 (RStudio 2013).

Installation
pylearn-simulate requires some external packages in order to work. They are, however, very few, and are freely available online.
• NumPy version 1.6.1 or newer. This is the fundamental package for scientific computing in Python and contains, among other things, linear algebra functions and matrix/vector objects (van der Walt, Colbert, and Varoquaux 2011).
• SciPy version 0.9.0 or newer. A library of open-source software for mathematics, science and engineering (Jones, Oliphant, and Peterson 2001).
• pylearn-parsimony version 0.2.1 or newer; optional. A library for structured and sparse machine learning. Contains several algorithms for minimizing loss functions with complex structured penalties. While optional, the examples require this package in order to run fully (Löfstedt, Guillemot, Hadj-Selem, Li, Frouin, and Duchesnay 2014b).
Once the dependencies are installed, pylearn-simulate can be installed by obtaining a release version from https://github.com/neurospin/pylearn-simulate. Unpack the file, go to the pylearn-simulate directory and type: $ python setup.py install --user for a local installation in the user's userbase directory (usually in ∼/.local/lib/python2.7/site-packages on Unix-like/-based operating systems, such as Linux and OS X, and in %AppData%\Python\Python27\site-packages on Windows), or $ sudo python setup.py install for a global installation accessible to all users. You will need to have administrator rights on your computer in order to install software for all users.

Notations
We place ourselves in the context of linear regression models. Let X ∈ R n×p be a matrix of n samples, where each sample lies in a p-dimensional space; and let y ∈ R n denote the ndimensional response vector. The p-vector that contains the regression coefficients is denoted β. In the following, we denote by · q the standard q -norm on R p with dual norm · q .
For a smooth real function f , we denote by ∇f (β) its gradient. Finally, the subdifferential, i.e. the set of subgradients, of a convex function f is denoted ∂f (β).

Method
The objective is to generate X and y such that where P is a penalty that can be expressed in the form in which Π is the set of all our penalties, λ π > 0 is the regularization parameter associated with penalty π, and ∂π(β * ), used below, is the subdifferential of penalty π at β * . This is a general notation to represent the fact that we may have several different penalties.
If the subdifferential of each penalty is known, the scaling factors, ω j , that we are looking for can each be expressed as the jth component of π∈Π −λ π ∂π(β * ) divided by X 0,j ε. More details are given below.
The penalties that we consider in the examples in this work are where TV is the total variation function (Rudin, Osher, Fatemi, and Monica 1992), and where GL is the overlapping group LASSO penalty (Yuan and Lin 2006) and, as we show below, we know their subdifferential. Nesterov (2013) addressed how to simulate data for the simpler case of the LASSO regression, and we will therefore not go into the details here. The algorithm given in Nesterov (2013) is adapted to our notations and is detailed in Appendix A.

Algorithm
The principle behind Nesterov's idea, that we generalize for complex penalties, is as follows: First, define the residual to be ε = Xβ − y, in the model between Xβ and y, such that it is independent from β * . Then, select acceptable values for the columns of X such that zero belongs to the subdifferential of f at point β * . It requires then to have a first unscaled version of the dataset, X 0 , whose columns X 0,j (j = 1, . . . , p) are each scaled by a factor ω j .
The algorithm is detailed in Algorithm 1. This algorithm is used to generate a simulated data set that is the solution to a complex optimization problem.
The vector r π belongs to the subdifferential of penalty π at β * . We chose to generate such vectors randomly in our implementation, but it is up to the user to generate them according to certain additional constraints on the obtained data set. When X 0 is a real-world data set, the purpose of such constraints could be to keep certain properties that are not directly in relation to the optimization problem. For example, when transforming images, certain desirable shapes have to remain visible in the resulting data set.
Algorithm 1 Generate a simulated data set. Require: β * , X 0 , ε Ensure: X and y such that β * = arg min β f (β) 1: for π ∈ Π do 2: Generate r π ∈ ∂π(β * ) 3: end for 4: for i = 1, . . . , p do 5: While this algorithm is fairly general, it can only be used when the subdifferential of f is explicit. We show in the following that it is possible to have an analytical expression for a wide variety of complex convex penalties.

Subdifferential of complex penalties
The complex penalties that we consider in this work can be written in the form While any q-norm is possible, we will in this work only be interested in the case when q = q = 2, i.e., the Euclidean norm. This is the case when π is, e.g., the total variation or (overlapping) group LASSO penalties.
We need the following two lemmas in order to derive the subdifferential of the complex penalties.
Lemma 4.1 (Subdifferential of the sum). If f 1 and f 2 are convex functions with domain R p , then Theorem 4.3 (Subdifferential of π). If π has the form given in Equation 5, then Proof.
Before we show the application to some actual penalties we will mention that the subdifferential of the 2 -norm is i.e., the second case (when x 2 = 0) corresponds to the set of points in the unit 2 ball.

Smoothed penalties
The authors have recently published an article detailing an algorithm that utilizes a smoothing technique proposed by Nesterov (2004). The authors described an efficient minimization of a convex non-differentiable function involving the total variation penalty (Hadj-Selem, Löfstedt, Dohmatob, Frouin, Dubois, Guillemot, and Duchesnay 2018).
Nesterov's smoothing technique is a very efficient way of approximating non-smooth functions by a smoothed function. The smoothed function is regularized such that when the regularization, or smoothing, parameter approaches zero, the approximation approaches the original function. We will not describe this technique here, but refer to Nesterov (2004) or Hadj-Selem et al. (2018) for details.
When a complex penalty function, π, is smoothed using Nesterov's method, the smoothed function is defined as where µ is a parameter that controls the smoothing, the operator ·, · denotes the inner product of the arguments, K is a compact convex set in a finite-dimensional vector space, and A is a linear operator that transforms between two finite-dimensional vector spaces.
The gradient of π µ (β) is defined as It is thus straight-forward to simulate such data. The A matrices for total variation and (overlapping) group LASSO are defined in Appendix B. The smoothing, or regularization parameter µ is user-defined, but usually selected to be small. The computation of subgradients for TV and (overlapping) group LASSO are detailed in Appendix B.

Application
We apply the aforementioned algorithm to generate a data set and associate it to the exact solution of a linear regression problem with elastic net and a Nesterov-smoothed complex convex penalty.

Linear regression with elastic net and a complex penalty
We will here give an example with elastic net and a complex penalty, such as, e.g., the total variation or group LASSO penalties. The function we are working with is where π is a complex penalty, and λ 2 , λ 1 and λ π are regularization parameters. The subdifferential in this case has where ε is distributed like the residuals, Xβ − y. We rearrange and note that we seek where we denote by (·) j the jth component of the vector within parentheses. Furthermore, since we define X j = ω j X 0,j , we have We note that in the case when β j = 0, adding the smooth ridge penalty to the LASSO has no effect on the generated data.
We can show that both complex penalties, i.e. total variation and group LASSO, can be formulated as in Equation 5 (detailed in Appendix B). It is then possible to use Theorem 4.3 to obtain an expression for the sub-differential of π as Replacing this expression in Equation 10, we obtain for each variable j = 1, . . . , p and with X j = ω j X 0,j . We also note from Equation 6 that ∂|x| Further, if the non-smooth complex penalty, π, be substituted for a Nesterov-smoothed complex penalty, π µ , the subdifferential of π in Equation 10 could simply be replaced by the gradient of π µ , defined in Equation 7.

Intercept
It is very common to include an intercept term in the model, and thus instead of Equation 4 solve β * = arg min where β 0 is the intercept term, and P defines all penalties. The intercept term is usually included in the original problem by extending β and X such that where 1 n is an n × 1 vector of ones.
We note that the penalties do not include the intercept term, and, therefore, that the gradient of the penalties, with respect to β 0 , is zero. We also note that the intercept column may not change from X 0 to X by this procedure, and thus that ω 0 = 1. We remember Equation 9 and that the intercept column is a column of ones. Then, we write Equation 9 for the intercept as Thus, we have properly included the intercept term if the residual terms sum to zero. Therefore, to handle the intercept, we change β and X as described above and make sure that n i=1 ε i = 0. If this is not the case, we make it so by subtracting the mean; i.e., we let

Signal-to-noise ratio
We use the same definition of signal-to-noise ratio as Bach, Jenatton, Mairal, and Obozinski (2011), namely that where X(β) is the data generated from β when using the simulation process described above.
With this definition of signal-to-noise ratio, and with the definition of the simulated data given above we may scale the regression vector such that If the user provides a desired signal-to-noise ratio, σ, it is reasonable to ask if we are able to find an a such that SNR(a) = σ. We have the following theorem.
Proof. We rearrange the signal-to-noise ratio as and square both sides to get We let X j be the jth column of X(βa), remember that X j = ω j X 0,j , and let β j be the jth element of β. The left-hand side of Equation 16 is then Using Equation 6 and a > 0, noting that where by ∂ x we denote the subdifferential with respect to x, and hence the partial subdifferential of the 1 and complex penalties are independent of any scaling factor a.
If we add all the penalties described above, i.e., 1 , 2 and a complex penalty such as TV (or GL), as has been the case throughout, and compute the partial derivatives of Equation 11 with respect to aβ, we obtain Hence, we may thus write ω j as a function of a by We continue to expand Equation 17 and obtain We note that this is a fourth order polynomial of a and write it in the generic form Now, since we seek a solution a > 0 such that X(βa)βa 2 2 = s > 0, we seek positive roots of the quartic equation This fourth order polynomial, in the left-hand side of Equation 19, has a minimum of −s when a → 0, since Equation 18 is non-negative for all values of a.
We note that the coefficient for the quartic term, A, is non-negative, since where the elements of β k are β j k j , for j = 1, . . . , p.
In case A = β k X 0 X 0 β k = 0, then X 0 β k 2 = 0 and we note that B is also equal to 0, because where the elements of β m are β j m j , for j = 1, . . . , p. Hence, in order for there to be roots when A = 0, the coefficient of the quadratic must be positive. We note that since s > 0, if A = 0, then we must have that C > 0 which implies that X 0 β m 2 > 0.
To sum up, if A > 0, the quartic equation tends to infinity when a tends to infinity. Further, if A = 0, then B = 0 and C > 0, which means that the equation will also tend to infinity when a tends to infinity.
Thus, by the intermediate value theorem there is a value of a for which X(βa)βa 2 2 − s = 0 and thus also that SNR(a) = σ.
We may use Equation 19 above to find the roots of this fourth order polynomial analytically. This may, however, be tedious because of the many terms of the function. Instead, because of the above theorem, we know that we can successfully apply the bisection method to Equation 15 to find a root of this function. The authors have tested this successfully, even with data sets with hundreds of thousands of variables. Also, we may use either root, if there are more than one, since they all give SNR(a) = σ, although we may want to find the one that minimizes |a − 1|.
Thus, we would encapsulate Algorithm 1 in a bisection loop in order to control the signal-tonoise ratio.

Correlation
We control the correlation structure of X 0 by, e.g., letting X 0 ∼ N (µ, Σ). Since we let X j = ω j X 0,j for all 1 ≤ i ≤ p, it follows that cor(X l , X m ) = cor(ω l X 0,l , ω m X 0,m ).

The package pylearn-simulate
The package pylearn-simulate contains five main modules. This section describes the five modules, and how they implement the theory described above.
For the sake of clarity, the modules are presented in alphabetical order. However, we advise a reader interested in directly using our package to jump directly to Section 6.4, presenting pylearn-simulate's main module, or to Section 7 in which three examples are detailed.

beta.py
The first module allows the generation of random (weight) vectors. It can, e.g., be used to generate the true minimizer, β * , mentioned in Section 1.
Its main function is defined as: def random(shape, density = 1.0, rng = utils.RandomUniform(0, 1), sort = False, normalise = False): where shape is the shape of the underlying data, e.g., p-by-1 would be shape = (p, 1); density is a value between 0 and 1 that determines the fraction of non-zero elements of the returned vector (default is density = 1.0, a completely dense vector; density = 0.0 would be a vector of zeros); rng determines the random number generator to use (rng is a function or callable that takes a number of integers as input, the shape of the array of random numbers to return); sort is a Boolean that determines whether or not to sort the output vector (will sort each axis in turn in ascending order); normalise is a Boolean that determines whether or not to normalize the output vector. If normalise = True, the output vector will have unit 2 -norm.

correlation_matrices.py
The second module is used to generate random correlation matrices with a particular structure. When applied to the generation of a correlation matrix Σ, the initial matrix, X 0 , can be sampled from N (µ, Σ), for some mean vector µ.
This module contains two ways to generate correlation matrices, described by Hardin, Ramon Garcia, and Golan (2013). The two ways to generate correlation matrices are: 1. A correlation matrix with a constant correlation structure, i.e., such that where Σ k is a block diagonal element of Σ, and ρ k is the average correlation.
2. A correlation matrix with a Toeplitz correlation structure, i.e., such that where Σ k is a block diagonal element of Σ, and ρ k is the average correlation between adjacent variables. The correlation thus decreases exponentially as a function of the distance between the variables.
Background noise and off-block-diagonal noise may also be added to the correlation matrices. The k stands for the kth group of variables; the module indeed allows the user to generate correlation matrices with blocks of variables, each one of them having a different correlation structure. See Hardin et al. (2013) for details.
This module contains the following functions: def constant_correlation(p = [100], rho = [0.05], delta = 0.05, eps = 0.5, random_state = None): where p is a list or tuple with the number of variables for each group; rho is a list or tuple with the average correlation between variables in a group such that rho[k]∈ [0, 1); delta ∈ [0, min(rho)) defines the baseline noise between groups; eps ∈ [0, 1−max(rho)) is the entry-wise random noise with mean delta and variance eps 2 /10 (where 10 is the dimension of the noise space, selected arbitrarily); and random_state is an instance of numpy.random.RandomState (if omitted, or None, the default NumPy random number generator will be used instead). where p is a list or tuple with the number of variables for each group; rho is a list or tuple with the average correlation between variables in a group such that rho[k]∈ [0, 1); eps ∈ (0, (1−max(rho))/(1+max(rho)) defines the maximum entry-wise random noise. The noise is approximately normally distributed with zero mean and variance eps 2 /10; and random_state is an instance of numpy.random.RandomState (if omitted, or None, the default NumPy random number generator will be used instead).

functions.py
This module defines a set of penalties that can be combined with (added to) the loss function. The functions are implemented as classes and they all inherit from Function, which requires them to implement a function grad that returns the gradient (or a subgradient).
The different penalties currently implemented are detailed below.
The class class L1(Function): def __init__(self, l, rng = utils.RandomUniform(-1, 1)) represents l β 1 , the 1 -norm penalty, where l is the regularization constant (denoted λ 1 above); rng is a random number generator to use when computing a subgradient. rng is a function or callable that takes a number of integers as input, the shape of the array of random numbers to return. Default rng is random uniform numbers between −1 and 1. represents l β, α * − mu 2 α * 2 2 , the Nesterov-smoothed 1 penalty, where l is the regularization constant and mu is the regularization constant for the Nesterov smoothing.
The class class L2(Function): def __init__(self, l, rng = utils.RandomUniform(0, 1)): represents l β 2 , the 2 -norm penalty, where l is the regularization constant; rng is a random number generator that is used when a subgradient is computed. rng is a callable that takes a number of integers as input, the shape of the array of random numbers to return. Default rng is random uniform numbers between 0 and 1.

The class
class TotalVariation(Function): def __init__(self, l, A, rng = utils.RandomUniform(0, 1), **kwargs): represents l p j=1 ∇β j 2 , the total variation penalty, where l is the regularization constant (denoted λ π above); A is the linear operator for the total variation penalty, as specified in Section B.1, and obtained from the static methods TotalVariation.A_from_shape or TotalVariation.A_from_subset_mask; rng is a random number generator that is used when a subgradient is computed. rng is a callable that takes a number of integers as input, the shape of the array of random numbers to return. Default rng is random uniform numbers between 0 and 1.

The class
class GroupLasso(Function): def __init__(self, l, A, rng = utils.RandomUniform(-1, 1), **kwargs): the overlapping group LASSO, where l is the regularization constant (denoted λ π above); A is the linear operator for the overlapping group LASSO penalty, as specified in Section B.2, and obtained from GroupLasso.A_from_groups; rng is a random number generator that is used when a subgradient is computed. rng is a function or callable that takes a number of integers as input, the shape of the array of random numbers to return. Default rng is random uniform numbers between −1 and 1. weights is a list or tuple provided to GroupLasso.A_from_groups that gives a weight to each group; and β g is a vector with the variables of group g.
Note that SmoothedTotalVariation inherits from NesterovFunction, a base class for Nesterovsmoothed complex penalties.

simulate.py
This is the main module of pylearn-simulate, and the starting point when generating simulated data. It contains a class LinearRegressionData that generates simulated data with a linear regression (linear least squares) objective function, and accepts any combination of functions (from functions.py) to penalize the loss.
LinearRegressionData is defined as class LinearRegressionData(SimulatedData): def __init__(self, penalties, X0, e, snr = None, intercept = False): where penalties is a list or tuple with the penalties; X0 is the initial candidate data set, as explained in Section 1; e = Xβ − y is the residual vector, as explained in Section 4.1; snr is the desired signal-to-noise ratio; intercept is a Boolean that determines whether or not to account for an intercept. If intercept = True, then the first column of X0 must contain only ones.

utils.py
This module contains some helper functions, such as the random number generators. It contains the following classes and functions: class RandomUniform(RandomNumberGenerator): def __init__(self, a = 0, b = 1, random_state = None): where a and b are the limits within which the random uniform numbers are sampled, and random_state is an optional numpy.random.RandomState object to use when sampling points (this is useful for setting a seed, for instance).
class ConstantValue(RandomNumberGenerator): def __init__(self, val, random_state = None): where val is a single value that is returned for every call to this number generator. The value of random_state is never used here.

Examples
The main benefit of generating data with pylearn-simulate is that we know everything about the generated data. In particular, we know the true minimizer, β * , and we know the Lagrange multipliers, or regularization parameters, λ 1 , λ 2 and λ π . There is no need to use, e.g., crossvalidation to find the parameters, and we know directly if the β (k) that our minimizing algorithm finds is close to the true β * or not.
We illustrate this main point by three simulations, in which we also demonstrate how to use pylearn-simulate. Note that due to differences in core packages such as NumPy or SciPy slightly different random numbers and corresponding plots may be obtained. An example in R is available in the supplementary material and in the GitHub repository (Löfstedt et al. 2014a).

Comparison to the classical approach
Traditionally, simulated data for linear regression problems is generated as follows: The independent data matrix, X, is sampled from some multivariate distribution, a regression vector β * is generated either with or without sparsity, structure, etc. The noise vector, ε, is usually Gaussian, ε ∼ N (0, σ 2 I), for some value of σ. Finally, the response variable, y, is computed as The problem then is to find a β such that a certain penalized linear regression problem is minimized.
The main issue is that the generated or sampled data, X, will not adhere to any sparsity or structure constraints that we may have on the solution β * . In essence, the data generated does not fit the mathematical model when penalties are involved. A computational problem also arises, because the regularization parameters for the penalties are not known in advance, and finding the minimum is thus going to be computationally expensive, since the parameters will have to be found by, e.g., cross-validation.
We will in this example simulate data by using pylearn-simulate for linear regression with an elastic net penalty. We will then compare the solution where the regularization constants are found by grid search and cross-validation to that of the theoretical solution.
The prediction error is plotted against the different values of the regularization parameter in Figure 1 (upper panel). The minimum cross-validated prediction error was found when lambda_l1 = 0.0, i.e., no 1 penalization, and a strong 2 penalization. The cross-validated prediction error was where y k was the predicted y from the left out cross-validation group. There were K = 10 cross-validation groups.
The regression vector that corresponds to the smallest cross-validated prediction error of y, denoted β CV , is presented in Figure 1 (lower panel) together with β * , the true regression vector, and the regression vector found when using the true regularization parameters, denoted β Sim . Note how different the regression vector found by cross-validation is compared to the true β * ; note also that β Sim is indistinguishable from β * .
The Python code for this example is contained in the supplementary material, in the file comparison_to_classical_approach.py.
In conclusion, the proposed way of simulating data, strictly adhering to a given optimization problem, presents advantages that are highly valuable when conducting a simulated experiment. First, since the values of the regularization parameters are known and set by the users, they can be given to an optimization algorithm and the result be compared to the known minimizer of the function. Apart from the practicality of doing so, large amounts of time will be saved by avoiding techniques such as cross-validation to select optimal values of the regularization parameters. Second, one can test in a controlled manner the robustness, or sensitivity of an optimization method to the "wrong" values of the regularization parameters, since one would know exactly how far the regularization parameter values are from the true values. Third, knowing exactly what is minimized (i.e., the exact expression of the function) and its minimizer allows us to know exactly how a minimization method fares: The experimenter knows at each iteration how far an algorithm is from the minimum of the function and, at the end of the optimization process, he/she will be able to derive the exact speed at which the optimum was found.
These advantages could benefit many current high level scientific projects. E.g., the one by Dohmatob, Gramfort, Thirion, and Varoquaux (2014), in which the authors compared the performance of several optimization algorithms for solving a logistic regression problem with TV and 1 penalties. It is widely known that this problem has no explicit solution. This is a typical case in which the proposed approach would have saved time and improved the quality of the comparison. Indeed, the authors studied the convergence of the algorithms for parameters close to the optimal parameters, set by 10-fold cross-validation. The proposed approach would have allowed them to directly plug the optimal parameters in the optimization algorithms and/or test the robustness of the chosen optimization algorithms to values of the parameters that would have been chosen to differ from the optimal values.

Elastic net and total variation
In this example we simulate data to fit the function in which λ 2 = 0.5 and λ TV = 1.0. This is thus elastic net with a TV penalty. We let the 1 parameter be 1 − λ 2 . We will vary the values of the regularization parameters in an interval around their "true" values and compute f (β (k) ) − f (β * ) for each of these values.
We use pylearn-simulate to generate data that fit this loss function as follows.
The complete Python code for this example is found in the supplementary material, in the file linear_regression_elastic_net_total_variation.py. The minimum solution to Equation 21 is found when using the regularization parameters used in the construction of the simulated data. These 48 × 64 data had no correlation between variables and the following characteristics: 50 % sparsity, signal-to-noise ratio 3, λ 2 = 0.5, and λ TV = 1.0.

Elastic net and smoothed group LASSO
In the third example we simulate data for the function where GL µ is the smoothed (overlapping) group LASSO function; and in which λ 1 = 0.618 and λ GL = 1.618 (in this example we let the 2 parameter be 1 − λ 1 ). We will vary the values of the regularization parameters in an interval around their "true" values and compute f (β (k) ) − f (β * ) for each of these values. We use pylearn-simulate to generate data that fit this loss function as follows.
The result is shown in Figure 3, and we again see that the solution that gives the smallest function value is at precisely the true values of λ 1 and λ GL .
The complete Python code for this example is found in the supplementary material, in the file linear_regression_elastic_net_group_lasso.py.

Discussion and conclusions
The technique that we have used in order to generate the simulated data in this paper is very useful when testing new optimization methods. Indeed, it allows us to clearly measure the performance of the optimization method while avoiding an arbitrary choice of the penalization constants. This means the choice of penalization will not impact the quality of the solution.
In fact, this completely avoids the use of cross-validation, or other resampling techniques, something that may be computationally expensive, and that does not guarantee optimality.
Additionally, if the candidate matrix, X 0 , is a real data set, it is possible to use the proposed technique to generate a quasi-real data set and thus under quasi-real conditions control all the parameters of the problem. In particular, since we can generate different choices of β * , we can test different structured and sparse penalties and see how they perform on the particular data.
Also, since we know the problem exactly, we are able to numerically study the behavior of cross-validation, or other resampling techniques, on our data. This may help us understand how cross-validation performs on different types of data under different model hypotheses.
The presented Python package, pylearn-simulate, offers a simple object oriented interface with which it is possible to generate data that fits any combination of penalties with a linear least-squares loss function. It is easy to extend the package and add new penalties, thanks to the object oriented interface. It is also possible to extend the package, as discussed below, and to add other loss functions.
The aim of this paper was to discuss how to generate simulated data for optimization problems with the linear least-squares loss with 1 , 2 , and a complex penalty such as e.g. TV. We note that Theorem 5.1 only covers these penalties, but also that the formulation of the complex penalties is very general, and therefore covers a large class of penalties. Theorem 5.1 can likely be generalized to other losses and penalties as well, but that was out of the scope of this paper.
Finally, we note that other machine learning methods, such as, e.g., logistic regression, are different from the one with linear regression presented here. In logistic regression, the y vector is implicitly related to the X matrix. It is possible to find an exact solution for the minimization problem, but without any control of the correlation structure or on the signalto-noise ratio. Therefore, a different approach is needed in this context and for other machine learning methods as well. This is something that will occupy the authors' attention in future research.

A. LASSO
The function f (β) = 1 2 Xβ − y 2 2 + λ 1 β 1 is known as the LASSO problem. Nesterov (2013) addressed how to simulate data for this case, and we will therefore not go into details. Instead we will simply adapt it to our notation and explain some steps that are not obvious.
The principle behind Nesterov's idea is as follows: First, define the error to be ε = Xβ − y, in the model between Xβ and y, such that it is independent from β * . Then, select acceptable values for the columns of X such that zero belongs to the subdifferential of f at β * , with subdifferential and we stress again that X ε does not depend on β * . We distinguish two cases: First case: We consider a variable β * j = 0, the jth element of β * . With β * j = 0 it follows that ∂|β * j | = {sign(β * j )}, and thus that we can write 0 = X j ε + λ 1 sign(β * j ), because of Equation 23, with X j the jth column of X.

Solution
The candidate matrix X 0 will serve as a first unscaled version of X and we have such that X j = ω j X 0,j , for all 1 ≤ j ≤ p.

pylearn-simulate: Simulated Data with Structured and Sparse Penalties
If β * j = 0, we use Equation 24 and have Thus, with X j = ω j X 0,j we obtain or equivalently Once X is generated, we let y = Xβ * − ε.

Gradient of non-smooth total variation
The TV penalty, for a discrete β, is defined as where grad(β j ) is the discrete spatial gradient at point β j . It is usually expressed as a forward difference discrete gradient, i.e., such that where p d is the number of variables in the dth dimension, for d = 1, . . . , D, with D dimensions.
We will first illustrate this in the 1-dimensional case. In this case x 2 = √ x 2 = |x|, since x ∈ R. We thus have We note that if we define A, a (p − 1) × p matrix, as where G = p − 1 and A g is the gth row of A.
Thus, we use Theorem 4.3 and obtain in which we recall Equation 6 and obtain that The general case will be illustrated with a small example using a 3-dimensional image. A 24-dimensional regression vector β is generated, that represents a 2 × 3 × 4 image. The image, with linear indices indicated, is given in Figure 4.
We note, when using the linear indices, that β 1 and β 2 are neighbors in the 1st dimension, that β 1 and β 5 are neighbors in the 2nd dimension and that β 1 and β 13 are neighbors in the 3rd dimension. Using 3-dimensional indices (i.e., such as β i,j,k ) the penalty becomes in which p 1 = 4, p 2 = 3 and p 3 = 2. In total, there are as many groups as 3-dimensional voxels, there are then G = p 1 p 2 p 3 = 24 groups.
We thus construct the A matrix to reflect this penalty. The first group is Thus, for matrix A g (in group g) we have a −1 in the gth column in all dimensions, a 1 in the (g + 1)th column for the 1st dimension, a 1 in the (p 1 + g)th column for the 2nd dimension and a 1 in the (p 1 · p 2 + g)th column for the 3rd dimension. Note that when these indices fall outside of the A matrix (i.e., the indices are greater than p 1 , p 2 , or p 3 , respectively) then the whole row (but not the group!) must be set to zero (or handled in some other way not specified here).
We have and use Equation 6 to obtain We note that A is very sparse, which greatly helps to speed up and save memory in the implementation.
We now illustrate how to use pylearn-simulate to generate the A matrix.

>>> import simulate
The size of the underlying 2 × 3 × 4 image is then defined as >>> shape = (2, 3, 4) Defining the shape allows then to generate the A matrix: The output of the above code is: [<24x24 sparse matrix of type <type numpy.float64 > with 36 stored elements in Compressed Sparse Row format>, <24x24 sparse matrix of type <type numpy.float64 > with 32 stored elements in Compressed Sparse Row format>, <24x24 sparse matrix of type <type numpy. The function simulate.functions.TotalVariation.A_from_shape thus returns a 3-vector with a sparse matrix in each element. This is for computational reasons (to avoid looping over p groups). The first element of A contains the first dimension, the second element the second dimension and the third element the third dimension. We note that the first group (as seen in the second print) is precisely as given above.

Gradient of non-smooth overlapping group LASSO
The group LASSO penalty is defined as where η g is a group weight that accounts for varying group sizes, β g is a sub-vector of β. Note that any two sub-vectors β a and β b are allowed to overlap, i.e., it is possible that a variable β j is present in both subvectors β a and β b . We will define matrices A g , one for each group, such that We will illustrate the construction of A g by an example. We are working with 6 variables, i.e., β ∈ R 6 . We define two groups, such that variables 1, 3, 4 and 6 belong to group 1 and variables 2, 4 and 6 belong to group 2. We define A 1 as The matrices A g are thus |g| × p, where |g| is the number of variables in group g. Each row of A g selects one variable to the group; if variable β j belongs to group g, then A g has a row with all zeros except for the element in the jth position which is η g . Thus, A g selects the variables that belong to group g and thus η g β g = A g β.
We again use Theorem 4.3 and obtain We now illustrate how to use pylearn-simulate to generate the A matrix for overlapping group LASSO.
>>> import simulate The function simulate.functions.SmoothedGroupLasso.A_from_groups thus returns a vector with two sparse matrices. The elements of this vector are the A matrices for each group. We note that the printed A matrices correspond to those in the example above. Note also that the variable indices are zero-based.

Gradient of smoothed group LASSO
The gradient of Nesterov-smoothed complex penalties is given in Equation 7. Thus, in order to compute the gradient we need the linear operator A and the dual variable α * . The A matrix is described above; the computation of the dual variable for group LASSO is given by Chen and Liu (2011), but without derivation. We therefore derive the solution here.
We compute the parts of the dual variable that corresponds to each group, i.e., α * g . We