SQUAREM: An R Package for Off-the-Shelf Acceleration of EM, MM and Other EM-like Monotone Algorithms

We discuss R package SQUAREM for accelerating iterative algorithms which exhibit slow, monotone convergence. These include the well-known expectation-maximization algorithm, majorize-minimize (MM), and other EM-like algorithms such as expectation conditional maximization, and generalized EM algorithms. We demonstrate the simplicity, generality, and power of SQUAREM through a wide array of applications of EM/MM problems, including binary Poisson mixture, factor analysis, interval censoring, genetics admixture, and logistic regression maximum likelihood estimation (an MM problem). We show that SQUAREM is easy to apply, and can accelerate any fixed-point, smooth, contraction mapping with linear convergence rate. Squared iterative scheme (Squarem) algorithm provides significant speed-up of EM-like algorithms. The margin of the advantage for Squarem is especially huge for high-dimensional problems or when EM step is relatively time-consuming to evaluate. Squarem can be used off-the-shelf since there is no need for the user to tweak any control parameters to optimize performance. Given its remarkable ease of use, Squarem may be considered as a default accelerator for slowly converging EM-like algorithms. All the comparisons of CPU computing time in the paper are made on a quad-core 2.3 GHz Intel Core i7 Mac computer. R Package SQUAREM can be downloaded at https://cran.r-project.org/web/packages/SQUAREM/index.html.


Introduction
The R package SQUAREM provides convergence acceleration techniques for speeding-up slow, monotone iterative algorithms. These include the well-known expectation-maximization (EM) algorithm [Dempster et al., 1977], majorizeminimize (MM) [Lange et al., 2000], and other algorithmic variants such as expectation-conditional maximization (ECM) [Meng and Rubin, 1993], expectation-conditional maximization or either (ECME) [Liu and Rubin, 1998], among others. Dempster et al. [1977] refer to such variants as "generalized EM (GEM)" when M step is only partially implemented. In this paper, we term these "EM-like algorithms", because they all have a contractive fixed-point mapping with linear rate of convergence, like EM. For the definition of linear rate of convergence and contractive mapping, please refer to [Ortega and Rheinboldt, 1970a, Chapter 5]. All of these algorithms are essentially based on the idea that a relatively difficult optimization problem can be converted to a much simpler iterative algorithm with Two convergence criteria can be applied and are both satisfied by the EM algorithm : 1) for parameter estimates θ n , as n − → ∞, ||θ n − θ || − → 0 (||.|| is the Euclidean Norm); 2) the convergence is defined by the sequence of the likelihood function of the parameter estimates, L(θ n ), such that |L(θ n ) − L(θ )| − → 0, as n − → ∞. The EM algorithm guarantees to produce monotone convergence such that L(θ n+1 ) ≥ L(θ n ). By Taylor's theorem under regularity conditions, expand F (θ n ) around θ : where J(θ ) is the Jacobian matrix of F evaluated at θ . Dempster et al. [1977] showed that J(θ ), the Jacobian matrix, measures the fraction of missing information. Under weak regularity conditions, the eigenvalues of J(θ ) lie on [0, 1). Thus, the largest eigenvalue of J(θ ) governs the rate of convergence for EM. The closer this is to unity, indicating a large fraction of missing information, the slower EM converges.
Motivated by the Cauchy-Barzilai-Borwein(CBB) method [Raydan and Svaiter, 2002], Roland and Varadhan [2005] and Varadhan and Roland [2008] constructed Squarem by defining the following recursive error relation: where e n = θ n − θ , I is the identity matrix and α n is the steplength that takes into account the larger eigenvalues of J(θ ). The pseudocode for Squarem algorithm is listed in Table 1 [Varadhan and Roland, 2008], which demonstrates the remarkable simplicity of the proposed algorithm.
There are three choices for α, the steplength as described in Varadhan and Roland [2008]. It is our experience that α = − ||r|| ||v|| generally works the best, and hence it is the default steplength used in SQUAREM. Varadhan and Roland [2008] also showed global convergence of Squarem algorithm, i.e., Squarem can converge to a stationary point from any starting value in the parameter space, or at least, in a large part of it by modifying steplength to ensure monotonicity. Note that when steplength is equal to −1, a Squarem evaluation is the same as two EM updates. Thus, each iteration of Squarem involves 2 or 3 evaluations of EM. Hence, when we compare EM to Squarem, we use the number of EM steps rather than number of iterations. Apart from the EM steps, there is minimal cost in computing the Squarem parameter updates, including the computation of the value of likelihood functions. In addition to the convergence criteria provided earlier, we give a definition of convergence acceleration as follows: suppose {θ n } is the sequence of estimates produced by Algorithm 1, while {θ n } is that given by Algorithm 2, then we say that Algorithm 2 accelerates Algorithm 1 if ||θ n −θ || ||θn−θ || − → 0 as n − → ∞.

Description of R package SQUAREM
R Package SQUAREM is available on CRAN. It can be downloaded via https://cran.r-project.org/web/ packages/SQUAREM/index.html. SQUAREM works for any smooth, contraction mapping with a linear convergence rate (e.g., EM-like algorithms). We describe below the two main functions, squarem() and fpiter(). Obviously, squarem() is the featured function in the package.
• squarem(), for squared iterative scheme squarem() is a function to accelerate any smooth, contractive, fixed-point iteration algorithm including EM/MM and other EM-like algorithms. The main arguments include par, fixptfn, objfn and control. par denotes the starting value of parameters. The argument fixptfn defines a function F representing the fixed-point iteration: θ k+1 = F (θ k ). fixptfn encodes a single step of any EM-like algorithm.
R> fixptfn <-function(par, data, ...) { + pnew <-F(par, data, ...) + return(pnew) + } objfn is the objective function we want to minimize. In the case of EM-like algorithms, it would be the negative log-likelihood function of data. It is not essential to supply the objective function in order for Squarem to work, but its provision guarantees global convergence. control specifies a list of algorithm options including maxiter, maximum number of iterations, and tol, tolerance, among others. If ||F (θ k ) − θ k || ≤ tol, the algorithm shall declare convergence at the (k+1) th iteration (||.|| shows the Euclidean Norm). Under regularity conditions given by Wu [1983], the satisfaction of the convergence does imply a local optimum. There are 3 important control parameters in squarem(), namely, K, method and objfn.inc. method specifies the choice of steplength and K specifies the order of the squared iterative scheme. The default values method = 3 and K = 1 generally work well. objfn.inc guides the monotonicity of the objective function. Setting objfn.inc = 0 ensures strict monotonicity, while objfn.inc = Inf results in an unguarded acceleration scheme, where the objective function is not evaluated at all. The default is objfn.inc = 1, which results in a nearly-monotone acceleration scheme. We can also set this to equal the average log-likelihood i.e., log-likelihood per individual sample. To summarize, the default usage of squarem() is such that squarem(par, fixptfn, objfn, ..., control = list()).
• fpiter(), for fixed-point iteration scheme fpiter() is a function to implement the fixed-point iteration algorithm including EM, MM and other EM-like algorithms, without any acceleration. The main arguments include par, fixptfn, objfn and control, which work the same way as in the squarem() except that there are no Squarem specific control parameters in the argument control.
In the next section, we demonstrate a detailed illustration of how to use SQUAREM.

How to apply SQUAREM acceleration
Imagine that an EM-like algorithm is used to estimate a model, with slow, linear rate of convergence. In order to speed up the algorithm using R Package SQUAREM, there are two main steps to be prepared. • Step 1 Write the part from the used EM-like algorithm into an R function such that it does only one step of the EM-like algorithm. This function corresponds to the argument fixptfn in function squarem().

• Step 2
Write an associated merit function to minimize, for example, the negative log likelihood function. This function corresponds to the argument objfn in function squarem().
There are also other arguments in function squarem() as specified in Section 3, such as starting values, tolerance and maximum number of iterations, but the default choices often work well. Once we have these arguments ready, we can launch the function squarem() to do the acceleration, simple and easy. We now illustrate this usage of R Package SQUAREM in detail with a simple example of mixture problem introduced below, which was also discussed in Varadhan and Roland [2008]. Here we revisit this example, mainly to illustrate how remarkably easy it is to apply the squarem() function, starting from an existing EM algorithm function.
In many studies, the study sample comes from a population which is a mix of two or more types of units, each with varying characteristics. Finite mixture models are ideally suited to account for this kind of heterogeneity. A finite mixture model estimates parameters describing each subpopulation and their mixing probabilities. EM algorithm is a popular technique to compute the maximum likelihood estimates for mixture models, but is notorious for its slow convergence. Here, we use a two-component Poisson mixture to illustrate the usage and power of Squarem compared to the EM algorithm. We use the data of the number of deaths of women 80 years and older during the years 1910-1912 from The London Times [Hasselblad, 1969].
We use p to denote the mixing probability and let µ 1 , µ 2 be the mean of Poisson distribution from population 1 and 2, respectively. Let i be the number of death, i = 0, 1, . . . , 9 and n i be the number of days when death i occurred.
EM algorithm For derivation of EM step, see Appendix A.1.
The EM update is such that are the (k + 1) th iteration of estimates and p (k) i is defined in Appendix A.1.
Next, we demonstrate how easy it is to set up Squarem acceleration of an EM-like algorithm. We implement the EM algorithm using function EM.poisson.mixture() below. In order to implement squared iterative scheme(Squarem) using function squarem(), we extract the part in the above EM function that corresponds to one EM step and put it into a separate function poissmix.em(). This function corresponds to the argument fixptfn in the squarem() function. Cutting and pasting the code chunk from the above function, we create the function for fixptfn and complete step 1 in applying Squarem. <-sum(y * i * (1 -zi)) / sum(y * (1 -zi)) + p <-pnew + return(pnew) + }

R>
Step 2 is to write an associated merit function to minimize, in this case, the negative log likelihood function. The log likelihood of observed data i, n i is such that: Therefore, the negative log likelihood is coded into function poissmix.loglik(). This function corresponds to the argument objfn in function squarem().
In the next section, we continue to demonstrate the utility of SQUAREM through a wide application of EM/MM problems, including interval censoring, genetics admixture, and logistic regression maximum likelihood estimation (an MM problem).

Factor analysis
Factor analysis is a statistical modeling approach that aims to explain the variability among observed variables in terms of a smaller set of unobserved factors. Factor analysis is widely applied in areas where observed variables may be conceptualized as manifesting from some unobserved latent factors, such as psychometrics, behavioral sciences, social sciences, and marketing. The latent factors can be regarded as missing data in a multivariate normal model and the EM algorithm [Dempster et al., 1977], therefore, becomes a natural way to compute the maximum likelihood estimates. We will illustrate using two examples the dramatic accelerations of EM by Squarem and also compare with ECME [Liu and Rubin, 1998], an acceleration of EM. One example comes from real data, as used by Liu and Rubin [1998] and Rubin and Thayer [1982], while the other is a simulation example.
Notations Following the notation in Rubin and Thayer [1982], let Y be n × p observed data matrix and Z be n × q unobserved factor matrix where q ≤ p. (Y i , Z i ), i = 1, 2, . . . , n are independently and identically distributed vectors following multivariate normal distribution. The marginal distribution of . Let the variance of each component of Z i , i = 1, 2, . . . , n be 1, so R is also the correlation matrix for Z i . Factor analysis model assumes that given the factors Z i , the components of vector Y i become independent and Y ip×1 |Z iq×1 ∼ multivariate normal α p×1 + β q×p Z iq×1 , τ 2 p×p where τ 2 = diag{τ 2 1 , τ 2 2 , . . . , τ 2 p }. β q×p is called factor loading matrix while the diagonal variances in τ 2 are called uniquenesses in factor analysis. In general, maximum likelihood factor analysis involves estimating α, β, τ 2 and R. But R is often considered identity matrix (orthogonal factor model) and the maximum likelihood estimator of α is alwaysȲ , the column means of observed data matrix Y . Suppose we center matrix Y by its column means, α is always vector zero. Therefore, we are left with only β, τ 2 to estimate. Given β, τ 2 , the marginal distribution of Y i , i = 1, 2, . . . , n is multivariate normal with mean zero vector and covariance matrix τ 2 + β β. Thus we can write the log likelihood of observed data matrix Y , where C yy is the sample covariance of Y . The negative log likelihood to be minimized is coded in function factor.loglik(), to check monotonicity in Squarem algorithm.
EM algorithm For derivation of EM step, see Appendix A.2.
If the loading matrix β is unrestriced, the EM update is such that where β (k+1) , τ 2 (k+1) are the (k + 1) th iteration of estimates and δ, ∆ are defined in the Appendix A.2.
Similarly, if the loading matrix β has a priori zeroes, the EM update is such that where j refers to the j th variable in vector Y i , i = 1, 2, . . . , n, subscript 1j corresponds to the factors with nonzero loadings for the j th variable and C yyj is the j th diagonal element of C yy . Next we consider two data examples to illustrate the simplicity and stability of Squarem to accelerate EM algorithm.
Real data example We use the real data from Jöreskog [1967] as in Rubin and Thayer [1982] and Liu and Rubin [1998]. The data consists of 9 variables, 4 factors, and 2 patterns of a priori zeroes for the loadings such that one a priori zero loadings on factor 4 for variables 1 through 4, and a different a priori zero loadings on factor 3 for variables 5-9. There is otherwise no restrictions. The sample covariance matrix C yy is given below: We use the starting values of β and τ 2 as in Liu and Rubin [1998], where and τ 2 j start = 10 −8 for j = 1, 2, . . . , 9.
The negative log likelihood is given by the function factor.loglik() below, which corresponds to the argument objfn} in \verbsquarem()+.

• EM
In order to perform the EM algorithm, we use function fpiter() in SQUAREM Package. The arguments consist of a starting value, function factor.em() that encodes one EM update, negative log likelihood function factor.loglik(), other variables as needed by these functions, and a control list to specify changes to default values. The starting value for β and τ 2 comes from Liu and Rubin [1998].
R> system.time(f2 <-fpiter(par = param.start, cyy = cyy, + fixptfn = factor.ecme, objfn = factor.loglik, + control = list(tol = 10^(-8), maxiter = 20000))) It only takes 876 iterations of EM updates to converge, which is faster by a factor of 17 and 7 in terms of the number of fixed point evaluations when compared to the EM and ECME, respectively. Moreover, Squarem only uses 0.233 seconds to converge, 12 times faster than the EM and 6 times faster than ECME. • Squared ECME Squarem algorithm can even be used to accelerate ECME, which is already a faster version of the EM algorithm. Let us call this Squared ECME.
In order to accommodate the randomness of CPU time, we run the above 4 schemes 100 times and summarize the mean and standard deviation of CPU running time into Table 4, along with the number of fixed point evaluations needed.  Table 4: The comparison of EM, ECME, Squarem and Squared ECME algorithms on real data from Jöreskog [1967]. Note the CPU time is in seconds.
Simulation example In the simulation example, we generate 200 observations of 32 subject scores and we assume that there are 4 latent factors. In this case, we do not impose any priori zero loadings for convenience of comparison. The function used to generate data and compare the performance of EM to Squarem is coded into simulate.FAEM(), accessible from the replication script.
The results of comparison of CPU running time and the number of EM evaluations between EM algorithm and Squarem are summarized in Figure 2. It can be seen that Squarem performs consistently better than EM algorithm for both criteria, by a factor of at least 10 in most cases.

Interval censoring
Interval censoring is a common phenomenon in survival analysis, where we do not observe the precise time of an event for each individual, but we only know the time interval during which the individual's event occurs. Following the notations in Gentleman and Geyer [1994], we assume that survival time, X, also known as failure time, come from a distribution F . Each individual i goes through a sequence of inspection times t i,1 , t i,2 , . . .. The survival time x i for individual i is not observed, however, the last inspection time prior to x i and the first inspection time after are recorded. An example of interval censored data is displayed in Table 5.
Individual n 3 4 Table 5: The example of interval censored data(unit: year). Therefore, data consists of time intervals I i = (L i , R i ) for each individual i, i = 1, 2, . . . , n and the event for individual i is known to happen during that interval. Let {s j } m j=0 be the unique ordered times of {0, {L i } n i=1 , {R i } n i=1 }, and α ij , i = 1, 2, . . . , n, j = 1, 2, . . . , m, the ij cell of an α matrix, be such that α ij = 1 if (s j−1 , s j ) ⊆ I i , the event for individual i can occur in (s j−1 , s j ) 0 otherwise and p j = F (s j −) − F (s j−1 ), p = (p 1 , p 2 , . . . , p m ) . The log likelihood of data is therefore The negative log likelihood is coded in function loglik(), corresponding to the argument objfn in squarem(). A in function loglik() refers to the alpha matrix α and pvec is the vector of probabilities, p.  where p (k+1) is the (k + 1) th iteration of estimates and µ ij = αij pj s αisps , i = 1, 2, . . . , n, j = 1, 2, . . . , m. Such one EM update is written in function intEM(), corresponding to the argument fixptfn in squarem().
Real data example The real data comes from Finkelstein and Wolfe [1985], which gives the interval when cosmetic deterioration occurred in 46 individuals with early breast cancer under radiotherapy. We use function Aintmap in R Package interval to produce matrix α and then generate starting values. We modified the function EMICM() in R Package interval in order to keep the same starting values across all algorithms, a uniform starting value where p i = 1/m, i = 1, 2, . . . , m.
Next, we compare the performance of the above three algorithms, EM, Squarem and EM-ICM. We did not include EM-ICM for comparison in the number of EM evaluations because intrinsically EM-ICM algorithm is a hybrid algorithm where each EM-ICM step is different from an EM evaluation. The tolerance for convergence is set at 10 −8 , the same across all algorithms.
• EM algorithm R> system.time(ans1 <-fpiter(par = pvec, fixptfn = intEM, + objfn = loglik, A = A, control = list(tol = 1e-8))) [1] 0 R> max(abs(ans2$par -ans3$pf)) [1] 4.707805e-05 All three algorithms converge to the same point as evidenced by the maximum difference in absolute value between algorithms-produced parameter estimates. The EM algorithm performs fairly well on this real dataset, perhaps due to the small sample size. Even so, Squarem still outperforms the EM by a factor of 5 in terms of the number of EM evaluations and a factor of 4 in CPU running time. We show in the following section that Squarem and EM-ICM algorithms are more advantageous than the EM algorithm using a simulated example as sample size increases.
Simulation example For each individual, we randomly generate censored intervals by creating a survival time (event) and a stochastic sequence of inspection times. The left end of interval is the last inspection time before the event while the right end is the first inspection time after. The function we use to generate interval censored data is coded in gendata(). It can be seen from Figure 3 that Squarem and EM-ICM algorithms are both, on average, approximately 10 times faster than the EM, for a moderate sample size n = 200. The performance of both algorithms are comparable. Although Squarem has a lower median CPU running time, its distribution is more variable than EM-ICM algorithm. In order to show the improvement of the number of EM evaluations for Squarem over EM algorithm, we summarize the mean and standard deviation of the number of EM evaluations for both algorithms in Table 7 for this simulation study.

EM Squarem
Mean 6745 446 Standard deviation 4737 306 Table 7: The comparison of mean and standard deviation of the number of EM evaluations between EM algorithm and Squarem on simulated data example for a moderate sample size n = 200.
On average, EM algorithm takes 15 times more EM steps to converge than Squarem for this simulation study with a moderate sample size n = 200. Next, we increase the sample size from n = 200 to n = 2000 and evaluate the performance of the EM, Squarem and EM-ICM algorithms again on 100 simulated interval censored datasets. As sample size expands to n = 2000, Figure 4 shows that the advantage of Squarem and EM-ICM algorithms becomes greater. Both algorithms on average converge at least 17 times faster than the EM algorithm. EM-ICM is specifically tailored to interval censoring maximum likelihood estimation, hence it is not surprising that it outperforms the EM algorithm by a large margin. However, it is noteworthy that Squarem, which is a general purpose EM-like algorithm accelerator, is very competitive with EM-ICM algorithm as shown in this simulation study.  Table 8: The comparison of mean and standard deviation of the number of EM evaluations between EM algorithm and Squarem on simulated data example for a large sample size n = 2000.

Genetics global ancestry estimation problem
Here we demonstrate the use of Squarem to solve an important problem in quantitative genetics that is notoriously computationally challenging. Suppose our study population is an admixed population with K ancestral populations. The goal is to estimate the proportion of ancestry from each contributing population for each individual's entire genome and simultaneously estimate the allele frequencies of the K ancestral populations. Let us use q i = (q i1 , q i2 , . . . , q iK ) to denote such admixture proportions for individual i, i = 1, 2, . . . , n where q ik is the proportion of subject i genome that is attributed to the ancestral population k,k = 1, 2, . . . , K and n is the number of subjects. Let Q be the n × K admixture proportions matrix. We assume that all p genome-wide markers are bi-allelic (either allele 1 or allele 2). Let F be the p × K population allele frequency matrix with f jk being the frequency of allele 1 at marker j, j = 1, 2, . . . , p in population k. Matrix F and Q consist of parameters we are interested in estimating. The data consists of genetic polymorphism data sampled from n diploid individuals. Specifically, we have recorded the genotype at p genetic polymorphisms ("markers") for each individual. Genotype at marker j for individual i is represented as allele 1 counts, x ij = 0, 1, 2. We assume that individuals are independent and under ADMIXTURE model, the log likelihood of data is: where C is a constant that does not contain parameters from F and Q. See Alexander et al. [2009] for full description of model.
The negative log likelihood of data is coded in function loglike(), corresponding to the argument objfn in squarem().

The EM update of matrix F and Q is such that
jk , m ik are defined in the Appendix A.4. This one EM evaluation is written in function admixture.em(), corresponding to the argument fixptfn in squarem(). We adapt the code provided on Peter Carbonetto github account [Carbonetto, 2016].
• EM algorithm R> load("geno.sim.RData") R> set.seed(413) R> p <-100 R> n <-150 R> K <-3 R> F <-matrix(runif(p * K), p, K) R> Q <-matrix(1/K, n, K) R> param.start <-c(as.vector(F), as.vector(Q)) R> system.time(f1 <-fpiter(par = param.start, fixptfn = admixture.em, + objfn = loglike, control = list(tol = 1e-4), X = geno, K = 3)) In this example, Squarem outperforms the EM algorithm by a factor of 4 in terms of CPU running time as well as the number of EM evaluations. For large genetic datasets, the E-step is by far the most computationally intensive part of the algorithm. For a faster implementation of the E-step using C (and interfaced to R using the .Call() function), see Carbonetto [2016]. Although the admixutre problem is naturally framed using EM, its convergence is very slow. Alexander et al. [2009] implemented faster solution to this problem (using a block relaxation optimization method), which has permitted application of ADMIXTURE to very large genetic datasets. Our EM-based implementation in R is much slower than the ADMIXTURE software, but, nevertheless it serves to illustrate the benefits of Squarem in a difficult optimization problem from genetics.

MM algorithm -logistic regression maximum likelihood estimation
In this section, we discuss a quadratic majorization algorithm (an MM algorithm) for computing the maximum likelihood estimates of logistic regression coefficients. Minorize and maximize or equivalently, majorize and minimize (MM) algorithms typically exhibit slow linear convergence just like the EM algorithm. We show that Squarem can provide significant acceleration of MM algorithms.
EM algorithm may be viewed as a special case of MM algorithms [Zhou and Zhang, 2012]. The majorization algorithms are widely applied, for example, in the work of De Leeuw [1994], Heiser [1995], Lange et al. [2000], among others. Suppose we want to minimize function f over X ⊆ R n . We construct a majorization function g on X × X such that f (x (k) ) = g(x (k) , x (k) ) ∀x (k) ∈ X, where k denotes the k th iteration, k = 0, 1, . . .. Therefore, instead of minimizing f , we minimize g such that We repeat the updates of x until convergence and this completes the majorization algorithm. Note that in the EM algorithm, the Q(θ; θ k ) function plays the role of the minorizing function.
Quadratic majorization algorithm Taylor's theorem often leads to quadratic majorization algorithms [Böhning and Lindsay, 1988] where the majorization function g is quadratic. By Taylor's theorem, expand f (x) at x (k) , where ξ is on the line between x and x (k) . The majorization function g is constructed by constructing a matrix B such that B − ∂ 2 f (ξ) is always positive semi-definite regardless of ξ. So, is a majorization function for f . Let us define a clever variable z (k) such that z (k) = x (k) − B −1 ∂f (x (k) ) and majorization function g is equivalent to the following: At the k th iteration, to minimize g(x, x (k) ) over x ∈ X is simply to minimize (x − z (k) ) B(x − z (k) ), thus the majorization algorithm becomes: ). Therefore, in order to implement quadratic majorization algorithm, we need to construct the matrix B and compute the gradient of function f . Let us consider logistic regression maximum likelihood estimation. Suppose we have an n × p design matrix X where there are n subjects and p predictors. Let y i be the number of successes for subject i ,i = 1, 2, . . . , n given the overall number of experiments, N i . We use β to denote the regression coefficients. The goal is to derive the maximum likelihood estimates of β.
The negative log likelihood of data is: .
The gradient of f (β) is such that: ). Based on the fact that p i (β)(1 − p i (β)) ≤ 1 4 , the matrix B can be constructed such that B = 1 4 X N X where N is the diagonal matrix consisting of elements N i . Thus the quadratic algorithm becomes: Let us denote the above algorithm by uniform bound quadratic algorithm since p i (β)(1 − p i (β)) ≤ 1 4 uniformly for any β and each subject i. Jaakkola and Jordan [2000] and Groenen et al. [2003] developed a non-uniform bound, X W (β)X, where W (β) is a diagonal matrix that consists of elements w i (β) = N i 2pi(β)−1 2x i β , i = 1, 2, . . . , n. Thus the non-uniform bound quadratic algorithm becomes: Real data example We use the cancer remission data in Lee [1974]. The outcome is a binary indicator of whether cancer remission occurred for the subject. Column 1 is the intercept and variables V 2, V 3, . . ., V 7 are results of six medical tests. The first five lines of data are as follows: R> ld <-read.  R> binom.loglike <-function(par, Z, y) { + zb <-c(Z %*% par) + pib <-1 / (1 + exp(-zb)) + return(as.numeric(-t(y) %*% (Z %*% par) -sum(log(1 -pib)))) + } The uniform bound quadratic majorization algorithm update and the non-uniform one are written in function qmub.update(), qmvb.update(), respectively, corresponding to the argument fixptfn in squarem().
• Uniform bound quadratic majorization algorithm R> library("SQUAREM") R> Z <-as.matrix(ld[, 1:7]) R> y <-ld[, 8]p0 <-rep(10, 7) R> system.time(ans1 <-fpiter(par = p0, fixptfn = qmub.update, + objfn = binom.loglike, + control = list(maxiter = 20000), Z = Z, y = y)) All four algorithms converge to the same maximum likelihood estimates but Squarem improves on both uniform and non-uniform bound quadratic majorization algorithms in terms of the number of quadratic majorization updates and CPU running time (in seconds). For uniform bound, Squarem converges around 5 times faster and saves the number of quadratic majorization updates by a factor of 10. The non-uniform bound quadratic majorization improves on the uniform bound one, but its Squared version provides further acceleration. Compared to non-uniform bound algorithm, Squarem shortens the computing time by a factor of 3 and cuts the number of quadratic majorization updates by a factor of 5.
We randomly generate 500 starting values β (0) = (U (0, 10), U (0, 10), . . . , U (0, 10)) where U (0, 10) is a uniform random variable in the range of (0, 10). We then summarize the performance for these four algorithms in terms of CPU running time and the number of QM (quadratic majorization) updates in Figure 5 and Figure 6.  Figure 6 clearly show that Squarem provides substantial acceleration for both uniform bound and nonuniform bound quadratic majorization algorithms. Table 9 displays that for different starting values, on average, compared to the uniform bound QM algorithm, its corresponding Squared version saves the number of QM updates by a factor of 7 and runs 5 times faster while compared to the non-uniform bound QM algorithm, an already faster version over the uniform bound QM algorithm, the Squared version further improves by a factor of 3 in CPU running time and a factor of 5 in the number of QM updates.   Table 9: The comparison among basic and Squared uniform bound quadratic majorization algorithms, and basic and Squared non-uniform bound quadratic majorization algorithms in terms of the mean and standard deviation of CPU running time and the number of quadratic majorization (QM) updates for cancer remission data with 500 randomly selected starting values.

Discussion
Since the seminal work of Dempster et al. [1977], EM and its variants have become the workhorse of computational statistics. More broadly, there are iterative algorithms which do not fit into the missing-data framework, but which are EM-like in the sense that they exhibit slow, monotone, global convergence like the EM algorithm. These include the minorize and maximize (MM) algorithm. Even more broadly, we can include all fixed-point iterations which are contractive [Ortega and Rheinboldt, 1970b] and linearly convergent. They are broader in the sense that there need not be an objective function (e.g., log-likelihood) associated with the contraction mapping. The remarkable fact is that it is extremely easy to use Squarem to try and accelerate these iterative algorithms. All that the user has to do is to create a function, fixptfn(), that implements a single step of the fixed-point iteration, whether it is EM/ECM/ECME/GEM/MM or any other contractive mapping. The objective function, objfn(), is optional, although we recommend that it be provided, if it is easy to code. The convergence criterion used in function squarem() is stringent for high-dimensional problems, and in future versions, we will incorporate other parameter based criteria and criteria based on the objective function. In addition, there are other features of Squarem not illustrated in this paper, including the options of higher-order Squarem schemes and the tracking of algorithm's progress. For full features of Squarem, see https://cran.r-project.org/web/packages/SQUAREM/SQUAREM.pdf.
The main theme of the paper is that existing modeling problems based on EM-like algorithms can potentially be made computationally more efficient by using the convergence acceleration provided by Squarem. This is particularly true in high-dimensional problems where EM-like algorithms can be excruciatingly slow. There are several examples in the literature where Squarem has been effectively used. To name a few, Matthew Stephens lab at University of Chicago has been using SQUAREM in many of their models pertaining to genetic studies where a very large number of parameters are estimated [Shiraishi et al., 2015;Raj et al., 2015, among others]. Patro et al. [2014] incorporated Squarem in the development of their new computational method, "Sailfish", to substantially increase the efficiency of processing sequencing reads. More recently, Chiou et al. [2017] have used SQUAREM to speed up the estimation in a semi-parametric model for panel recurrent event count data. There are more such examples.
Squarem is computationally efficient. It requires little effort beyond that of the basic fixed-point iteration. The computation of Squarem parameter update is trivially easy since it only requires a couple of vector products. The only place where some inefficiency could occur is in the evaluation of the objective function to assess whether the Squarem step can be accepted. When the Squarem step results in non-monotonicity, it is rejected and instead the most recent EM update is retained. This results in some wasted effort. This is typical for algorithms with fast local convergence, for example, the Newton's method in unconstrained optimization which needs to be safe-guarded with line-search or trust-region approach to ensure global convergence. However, Squarem is efficient when the objective function evaluation is relatively cheaper than that of the fixed-point iteration, which is often the case.
Squarem can be used off-the-shelf since there is no need for the user to tweak any control parameters to optimize its performance. Given its ease of application, Squarem may be considered as a default accelerator for EM-like algorithms. We invite the reader to try Squarem acceleration on his/her own EM-like algorithm or any slowly convergent, contractive fixed-point iteration.

A.1 Poisson mixtures
Let us define the missing variable Z i ,i = 0, 1, . . . , 9 such that Z i = 1 if death number i comes from population 1 0 if death number i comes from population 2 .
Let P (Z i = 1) = p,i = 0, 1, . . . , 9. Given Z i = 1, the death number i comes from population 1, following Poisson distribution with mean µ 1 while otherwise, the death number i comes from population 2, following Poisson distribution with mean µ 2 .
• E-step We take the derivative of Q function with respect to p, µ 1 , µ 2 and set to zero in order to derive the estimates of (k + 1) th iteration. Let Similarly we can derive that .

A.2 Factor analysis
• E-step where C is a constant that does not depend on parameters and k denotes the current estimates of parameters.
Therefore, E[C yy |Y, τ 2 (k) , β (k) ] = C yy E[C yz |Y, τ 2 (k) , β (k) • M-step If the loading matrix β is unrestriced: In order to obtain the maximizer of Q function in E-step, we take the derivative of Q function with respect to β, τ 2 −1 (for convenience) and set to zero.

A.3 Interval censoring
• E-step α ij tells us the possibility that the event for individual i can occur in interval (s j−1 , s j ) but we do not observe whether it actually occurs. Let us encode this missing information in a new defined variable Z ij such that: Z ij = 1 if the event for individual i occurs in (s j−1 , s j ) 0 otherwise .
With this missing variable defined, we now write the Q function in E-step.

So,
Q(p|p (k) ) = i j log p j µ ij • M-step The probability vector that maximizes Q function serves as the new estimates p (k+1) . We introduce a lagrange multiplier λ to incorporate the constraint that j p j = 1. Take derivative with respect to p j , dQ(p|p (k) ) dp j = d( i j log p j µ ij + λ(1 − j p j )) dp j (1 − f jk ) I(u a ij =0) ) I(z a ij =k) Thus, n (u) jk ,u = 0, 1 and m ik for the t th , t = 0, 1, . . . , iteration are computed by summing the joint posterior probabilities using F (t) and Q (t) .
• M-step Take the derivative of Q function with respect to f jk and q ik gives us the next iteration of matrix F and Q that f jk = n