teigen : An R Package for Model-Based Clustering and Classiﬁcation via the Multivariate t Distribution

The teigen R package is introduced and utilized for model-based clustering and classi-ﬁcation. The t EIGEN family of mixtures of multivariate t distributions is formed via an eigen-decomposition of the component covariance matrices and subsequent component-wise constraints. The teigen package implements all previously published t EIGEN family members as well as eight additional models: four multivariate and four univariate. The resulting family of 32 mixture models is implemented in both serial and parallel, with useful dedicated functions. Methodology and examples that illustrate teigen ’s functionality are presented.


Introduction
The usage of mixture models for cluster analysis is commonly referred to as model-based clustering. A random vector X arises from a parametric finite mixture model if the density of origin can be written as f (x | ϑ) = G g=1 π g p g (x | θ g ), where ϑ is the parameter space, G is the number of components, π g are the mixing weights such that G g=1 π g = 1 and all π g > 0, and p g (x | θ g ) are parametric densities with parameters θ g .
The multivariate Gaussian distribution has received the bulk of researchers' attention over the past couple of decades (e.g., Banfield and Raftery 1993;Celeux and Govaert 1995; 2. The tEIGEN family Andrews and McNicholas (2012) introduce the 24-member tEIGEN mixture model family for model-based clustering and classification. The tEIGEN family is derived from mixtures of multivariate t distributions, whose density is with mean vector µ g , scale matrix Σ g , and degrees of freedom ν g .
Following the work of Banfield and Raftery (1993) and Celeux and Govaert (1995), an eigendecomposition is imposed on the scale matrix Σ g = λ g D g A g D g , where D g is the matrix of eigenvectors, A g is the diagonal matrix of eigenvalues with |A g | = 1, and λ g are the associated constants of proportionality. These individual scalars/matrices can then be constrained to be equal across mixture components, or in some cases constrained to be the identity matrix: Σ g = λ g IAI = λ g A, for instance. It is important to note that under certain constraints on this covariance matrix decomposition, the model being fit loses the property of scale invariance. As such, the teigen package will scale variables to have mean 0 and variance 1 by default (see Section 3 for controlling this option). In addition, constraints are also imposed on the degrees of freedom ν g , following Andrews and McNicholas (2011a). Thus, taking all possible combinations of these constraints into consideration would result in a 28-model family (see Table 1), 24 of which were originally developed for the tEIGEN family (Andrews and McNicholas 2012). Those 24 models correspond to 12 of the 14 decompositions of Σ g used in the GPCM family along with the constraint for ν g . The four remaining models, and methodology required for their estimation, are discussed in Section 2.1.
All of the previously considered tEIGEN models are applicable only to multivariate data. In order for the package introduced herein to be applicable to univariate data, we introduce four more models into the tEIGEN family. The univariate mixture takes the form where µ g is the component mean and σ 2 g is the scale parameter. We once again allow constraints on the scale and degrees of freedom, by permitting them to be equal across groups. The four resulting models are summarized in Table 2. Hereafter, we refer to the combination of all 32 univariate and multivariate models as the 'tEIGEN family'.

Parameter estimation
As in the majority of mixture modelling implementations, we use a variant of the expectationmaximization (EM) algorithm (Dempster, Laird, and Rubin 1977) to perform parameter estimation. The EM algorithm consists of two steps (expectation and maximization) performed iteratively until convergence. On the E-step, we compute the expected value of the complete-data log-likelihood, and on the M-step, the parameters are maximized according to the complete-data log likelihood. The expectation-conditional maximization (ECM) algorithm (Meng and Rubin 1993) adjusts the M-step to allow several, more efficient, conditional maximization (or CM) steps.
To implement an EM algorithm, we need to be able to compute the likelihood for our mixture models. First, we define a random variable Z ig such that z ig = 1 if observation i belongs to group g and otherwise z ig = 0. In a model-based classification (or semi-supervised) scenario, we observe the first k observations with known group membership, and the remaining observations, giving n total observations, with unknown group membership. The likelihood can thus be written In a model-based clustering scenario, none of the group memberships are known, and therefore the likelihood is simply the right hand side of Equation 1 with k = 0 and H = G. Note that Free covariance and df parameters  CIIC  C  I  I  C  1 + 1  CIIU  C  I  I Table 1: tEIGEN model names and the number of covariance and degrees of freedom parameters requiring estimation. 'C' denotes constrained, 'U' denotes unconstrained, and 'I' denotes the identity matrix. '*' denotes the models being introduced in this manuscript.
there are alternative formulations which include information that falls somewhere between clustering and classification -for example, semi-supervised clustering (Melnykov, Melnykov, and Michael 2015) and fractionally supervised clustering (Vrbik and McNicholas 2015) -but these paradigms are not addressed with the teigen software.
The CCUC, CCUU, UCUC, and UCUU models were not previously introduced due to issues with their estimation procedure. Following recent work by Browne and McNicholas (2014) that is implemented in the mixture package, an iterative majorize-minimize (MM) algorithm is implemented in the estimation of the common eigenvectors. As an illustration, we provide the specifics for the CCUC model where Σ g = λDA g D .

E-step (CCUC)
The E-step involves computing the conditional expected value of the component indicator Model σ 2 g ν g Free variance and df parameters univCC Theẑ ig represent the probability that observation i belongs to group g given the current component parameters. Theŵ ig can be thought of as a weight for how much influence observation i has on the estimation of µ g and Σ g . This interpretation may become more clear in the section that follows while viewing the updates forμ g and the sample covariance matrix S g .

CM-steps (CCUC)
The CM-steps involve conditionally maximizing the parameters with respect to the completedata log-likelihood. In the first of two CM-steps, the mixing proportions, component means, and degrees of freedom are updated: where n g = n i=1ẑ ig . We discuss a novel alternative estimation method for the degrees of freedom in Section 2.2. In the second CM-step, the decomposed elements of the covariance matrix are updated according to the algorithm that follows, where 3. If t is odd, update dummy matrix U according to (a), else update according to (b).
Here s g is the largest eigenvalue of n g S g and a g is the largest eigenvalue of (λÂ g ) −1 .

UpdateÂ
If TRUE, t = t + 1 and return to step 2, else end.
Note that 1 is user defined with a default value of 0.001. See the discussion of eps in Section 3 for setting this value.

Degrees of freedom
In recent work (Andrews, McNicholas, and Subedi 2011;McNicholas 2011b,a, 2012), the degrees of freedom are updated using numeric estimation of the following nonclosed-form equations: for unconstrained degrees of freedom is solved forν g , while constraining across groups leads to the following equation that is solved forν. In terms of the numeric solution, the uniroot function from R is used which is based on the zeroin subroutine by Brent (1973).
Herein, we introduce a novel closed-form approximation for constrained degrees of freedom.
, we can then find an approximation witĥ If we alternatively define then the approximation holds forν g (unconstrained degrees of freedom) instead. Further details and justification on this approximation can be found in Appendix A.

Initialization
EM algorithms for both clustering and classification require either the initialization of the unknown z ig (which uses maximum likelihood estimation to initialize the parameters) or initialization of the model parameters (which uses expected value computations to initialize the z ig ). The teigen package makes use of the former, with several built-in options for initialization: "kmeans", "hard" random, "soft" random, "emem" (which is described in the paragraph that follows), and "uniform". The "uniform" option, as described by Andrews et al. (2011), sets the initialẑ ig = 1 G and is only available for classification scenarios because otherwise the algorithm could not progress. The difference between hard and soft random is the nature of the randomly initialized z ig : hard referring to 0's and 1's, while soft takes on random values between 0 and 1, inclusive. As a further alternative, the user can give specific cluster memberships as an initialization, as described in Section 3. The "emem" option is based on the "emEM" approach introduced by Biernacki, Celeux, and Govaert (2003). If this initialization is used, then one of the teigen models (chosen by the user) is run for a set number of iterations (chosen by the user) for a set number of starts (chosen by the user) based on one of the other initialization methods ("soft", "hard", or "kmeans"). Then, the resulting fit that maximizes the log-likelihood is used to initialize the main algorithm.
We note here that the EM algorithm, in the context of multivariate mixture models and particularly as the number of groups increases, is quite prone to converging on local maxima. This, in turn, means that teigen can be quite sensitive to starting values. For this reason, the default "k-means" initialization will use 50 random starting points, and the "soft" or "hard" (completely random) initialization methods are not recommended unless used under the 'emem' approach. Users familiar with the previously noted mclust package should keep in mind that the Mclust() command uses model-based hierarchical clustering to initialize the EM algorithm -thus leading to deterministic results. This experience can be replicated (as in Section 4.4) by using a deterministic clustering method as a custom initialization.
Note that the degrees of freedom must be initialized, and this value is also user-specified; the default value is 50.

Estimated time remaining
The user has the option of the function returning, on a continual basis, the time the algorithm has run thus far, estimated time remaining, and percent complete on a single line. An underlying procedure for this updates the time estimates after each model has run -'each model' here refers to each G×28 model individually (28 referring to the number of multivariate models in the family). The computational overhead for giving the time estimates is minimal: in the range of a second or two under default (9 × 28 = 252 models) settings. Because it is not updated during each EM iteration, longer model fittings will not lead to longer overhead for the time estimates. Setting verbose = FALSE will silence the output, should the user so desire.

Convergence
Convergence in our algorithm is determined by Aitken's acceleration (Aitken 1926), which at iteration t is given by where l (t−1) refers to the log-likelihood at iteration t − 1, and so on. Böhning, Dietz, Schaub, Schlattmann, and Lindsay (1994) propose the usage of a (t) to compute an asymptotic estimate of the log-likelihood at iteration t + 1 by We use the stopping criterion from Lindsay (1995) for user-specified 2 , with 2 = 0.1 as the default. See the discussion of argument eps in Section 3 for setting this value.

Model selection
Selecting the "best" model is a challenge within model-based clustering or classification applications, but the common practice ( where l(x,θ) is the maximized log-likelihood,θ is the MLE of ϑ, r is the number of free parameters in the model, and n is the total number of observations.
Another model selection technique permitted with the teigen package is the integrated completed likelihood (ICL, Biernacki, Celeux, and Govaert 2000). The ICL makes use of the concept of uncertainty rising from the probabilistic nature of model-based clustering. Generally, and also in the case of the teigen package, the main result of a clustering/classification algorithm is a vector of group memberships. They are typically hardened to give maximum a posteriori (MAP) classifications via The ICL is then calculated via which is essentially the BIC penalized by the amount of classification uncertainty contained in the model.

Plots
Three graphics are available when plotting an object of class teigen. The first is a bivariate marginal contour plot, where the user can specify the desired variates as well as the resolution of the contours. The second plot is an uncertainty plot, where large dots signify large uncertainty in the classification. Because theẑ ig provide the probability that observation i belongs to group g, we often interpret max g {ẑ ig } as the 'certainty' of our classification of observation i. Conversely, we can interpret 1 − max g {ẑ ig } as the 'uncertainty' of our classification of observation i. The uncertainty plot is generated by using this number to determine the point size through the argument cex. If the user passes cex to the plot, it is used for the size of the smallest point on the graph. Furthermore, the uncmult argument can magnify the size differences for better readability if the user desires.
The two plots mentioned above are only available if the teigen object was generated with multivariate data. The options xmarg and ymarg specify which variables from the data set to plot. By default the plot type is a character vector containing both "contour" and "uncertainty", but the user may choose to specify just one of these types to plot just one graph. If both types are provided, an interactive menu will be displayed so the user may switch back and forth between both graphs.
The third plot, the univariate density plot, is the default plot for univariate data and an optional plot for multivariate data. For multivariate data, if the user specifies ymarg = NULL, the function will plot a marginal univariate density using xmarg as the single variable. This plot includes the kernel density estimate from density(), the mixture distribution, and the color-coded component densities.

Code specifics
The teigen package contains the function teigen() which allows the user to perform modelbased clustering or classification in serial or parallel with some flexibility on the specifics. The function outputs an object of class teigen, for which dedicated print, plot, summary, and predict methods are also included. Most code was written and developed in R 3.3.2 (R Core Team 2017), with a couple of functions outsourced to C. We also note that running teigen in parallel depends on the parallel package (which is now part of the R core software).
• Gs: A number or vector indicating the number of groups to fit. Default is 1-9.
• models: A character vector specifying the models to fit. Models can be chosen using the terminology from Tables 1 and 2. Alternatively, notation from the popular mclust package can be used, using "V" for variable and "E" for equal across groups. In this case, the first letter refers to volume, the second to shape, the third to orientation, and fourth to degrees of freedom. Furthermore, the user can specify common groups of models such as "all", "dfconstrained", "dfunconstrained", "univariate", and "gaussian". When "gaussian" is specified, the 14 multivariate Gaussian equivalents are used.
• init: A list of initializing classifications of the form init [[G]] that contains the initializing vector for all G groups considered (see example in Section 4.4). Alternatively, the user can specify a character string indicating an initialization method. Currently, the user can choose from "kmeans" (default), "hard" random, "soft" random, "uniform", and "emem". See Section 2.3 for further details on these options.
• scale: Logical indicating whether or not the function should scale the data. Default is TRUE -note that teigen models are not scale invariant.
• dfstart: The initialized value for the degrees of freedom. The default is 50.
• known: A vector of known classifications that can be numeric or character -must be the same length as the number of rows in the data set. If using in a true classification sense, give samples with unknown classification the value NA within known (see Section 4.3 below).
• training: Optional indexing vector for the observations whose classification is taken to be known.
• gauss: Logical indicating if the algorithm should use the Gaussian distribution. If models = "gaussian" then gauss = TRUE is forced.
• dfupdate: Character string or logical indicating how the degrees of freedom should be estimated. The default is "approx", indicating a closed-form approximation be used. Alternatively, "numeric" can be specified, which makes use of uniroot(). If FALSE, the value from dfstart is used and the degrees of freedom are not updated. If TRUE, "numeric" will be used for backward-compatibility.
• eps: Vector (of size 2) giving tolerance values for the convergence criterion. First value is the tolerance level for iterated CM-steps. Second value is tolerance for the EM algorithm: convergence is based on Aitken's acceleration (default) or lack of progress. See Sections 2.1 and 2.5 for relevant details.
• verbose: Logical indicating whether the running output discussed in Section 2.4 should be displayed. This option is not available in parallel. The output displayed depends on the width of the R window. With a width of 80 or larger: time run, estimated time remaining, and percent complete are all displayed.
• maxit: Vector (of size 2) giving maximum iteration number for the iterated CM-steps and EM algorithm, respectively. A warning is displayed if either of these maximums are met.
• convstyle: Character string specifying the method of determining convergence. Default is "aitkens", which uses the criterion based on Aitken's acceleration, but lack of progress "lop" may be specified instead.
• parallel.cores: Logical indicating whether to run teigen in parallel or not. If TRUE, then the function discerns the number of cores available and uses all of them. Alternatively, a positive integer may be provided indicating the number of cores the user wishes to use for running in parallel.
• ememargs: A list of the controls for the emEM initialization: numstart -number of starts (default 25); iter -number of EM iterations (default 5); model -character string for the model name to be used (default "UUUU" from the C, U, I nomenclature, see details below); init -character string for the initialization method for emEM (default hard, or soft, or kmeans). The emEM initialization will run multiple, randomized initialization attempts for a limited number of iterations, and then continue the model-fitting process.
Output from teigen() is an object of class teigen, which can be manipulated as a twopronged list object. The main contents are the results from the model chosen by the BIC, with an additional list containing the results from the model chosen by the ICL: • x: Data used for clustering/classification.
• index: Indexing vector giving observations taken to be known (only available when clas is set greater than 0 or training is given).
• classification: Vector of group classifications as determined by the BIC.
• bic: BIC of the best fitted model.
• modelname: Name of the best model according to the BIC.
• allbic: Matrix of BIC values according to model and G. A value of -Inf is returned when a model does not converge.
• G: Value corresponding to the number of components chosen by the BIC.
• tab: Classification table for BIC model (only available when known is given). When classification is used, the "known" observations are left out of the table.
• fuzzy: The fuzzy clustering matrix for the model selected by the BIC.
• logl: The log-likelihood corresponding to the model with the best BIC.
• iter: The number of iterations until convergence for the model with the best BIC.
• parameters: List containing the fitted parameters: mean -matrix of means where the rows correspond to the component and the columns are the variables; sigma -array of scale covariance matrices (multivariate) or scale variances (univariate); lambdavector of scale parameters, or constants of proportionality; d -array of eigenvectors, or orientation matrices; a -array of diagonal matrices proportional to eigenvalues, or shape matrices; df -vector containing the degrees of freedom for each component; weights -matrix of the expected value of the characteristic weights; pig -a vector giving the mixing proportions.
• iclresults: List containing all the previous outputs, except x and index, pertaining to the model chosen by the best ICL (all under the same name except allicl and icl are the equivalent of allbic and bic, respectively).
• info: List containing a few of the original user inputs, for use by other dedicated methods of the teigen class.
• xmarg: Scalar argument giving the number of the variable to be used on the x-axis.
• ymarg: Scalar argument giving the number of the variable to be used on the y-axis. Can be set to NULL for a univariate marginal density of xmarg (using x[, xmarg] as data).
• res: Scalar argument giving the resolution for the calculation grid required for the contour plot. Default is 200, which results in a 200 × 200 grid. Also determines how smooth the univariate density curves are (higher res, smoother curves). Ignored for uncertainty plots.
• what: Only used if the model provided by x is multivariate, this argument is a character vector stating which plots should be sent to the graphics device -choices are "contour" or "uncertainty". If not provided, an interactive prompt will appear and provide these options.
• alpha: A factor modifying the opacity for the plotted points. Typically provided on the interval [0, 1].
• col: A specification for the default plotting color -see section 'Colour Specification' in the par documentation. Note that the number of colors provided must equal to the number of groups in the teigen object (extra colors ignored).
• pch: Either an integer specifying a symbol or a single character to be used as the default in plotting points -see points documentation for possible values and their interpretation. If pch is one of 21:25, see bg for coloring.
• cex: A numerical value specifying the amount by which plotting text and symbols should be magnified relative to the default. For uncertainty plots, cex changes the size of the smallest sized point. The relative sizes amongst the points remains the same. As a result, the sizes of all the points change.
• bg: Background (fill) color for the open plot symbols if pch is one of 21 : 25. If no bg is provided to color the inside of the points, then col will be used instead.
• lty: The line type for univariate plotting. See par documentation for more information.
Only updates the group curves, not the density or mixture curves.
• uncmult: A multiplier for the points on the uncertainty plot. The larger the number, the more the size difference becomes exaggerated.
• levels: Numeric vector giving the levels at which contours should be drawn. Default is to draw a contour in 0.25 steps, plus a contour at 0.001. This may result in more/less contours than desired depending on the resulting density.
• main: Optional character string for title of plot. Useful default if left as NULL.
• xlab: Optional character string for x-axis label.
• draw.legend: Logical for a default generation of a legend to the right of the plot.
• ...: Options to be passed to plot.
Note that if ymarg is NULL or the model provided by x is univariate, then plot() will provide a univariate marginal density.

Examples
Herein, we present a number of clustering/classification examples to illustrate the teigen package. We make use of a number of data sets from the base R distribution, as well as a couple available in the gclus library (Hurley 2012). Note that all teigen() calls in this illustration will force verbose = FALSE as the output produced will be computer specific; the reader is encouraged to set this argument to TRUE (the default value) if desired. In the interest of reproducibility, we use the set.seed() function when appropriate.

Model-based clustering: geometric example
To begin, we will provide a simple bivariate simulation to illustrate the geometrical difference between some of the teigen models. First, we generate the data using clusterGeneration (Qiu and Joe 2015) -a package that can be used to simulate groups of data.
R> library("clusterGeneration") R> set.seed(542687) R> sim <-genRandomClust(2, sepVal = .35, numReplicate = 1, + outputDatFlag = FALSE, outputLogFlag = FALSE, outputEmpirical = FALSE, The genRandomClust command above specifies simulating 2 groups, with a sepVal of 0.35 (well-separated groups) and numReplicate indicates that we only seek one data set for this illustration. The remaining arguments are used to avoid writing files to the user's working directory. Now we fit several teigen models to this simulation and plot the results in Figure 1.

Model-based clustering: The basics on Old Faithful
The Old Faithful data set is available in base R distributions as faithful. It is bivariate data measuring the time to eruption and length of eruption, both in minutes. In this example, we use the entire tEIGEN family initialized with soft random starting values.

R> plot(teigen_faith, what = "contour")
The plot is shown in Figure 2. As noted in Section 2.3, using the "soft" initialization is not recommended due to the sensitivity of the EM algorithm. We can illustrate this sensitivity by re-running with a different seed. Clustering Table: 1 2 3 47 128 97 ----ICL RESULTS ----Loglik: -1130.19 ICL: -2328.266 Model: UUUC # Groups: 2 Clustering Table: 1 2 97 175 It is worth noting that the ICL criterion still suggests the same two-group model. In order to illustrate the uncertainty plot, we plot the three-group solution in Figure 3 via

R> plot(teigen_faith2, what = "uncertainty")
With Figure 3, we can see the areas of high uncertainty in the classification results by focusing our attention on the larger points -the larger the point, the more uncertainty associated with its classification.
We can also illustrate the predict.teigen S3 method here by creating a new observation. By default, the BIC is used to select the model, but we can alternatively use the ICL via the argument modelselect.
R> predict(teigen_faith2, newdata = data.frame(eruptions = 2, waiting = 70))  Note that if the user specified that the data be scaled via scale = TRUE (which is default), then predict.teigen will take care of the scaling when predicting new values. As such, issues will arise if the user is not inputting newdata that is/are consistent with the original teigen call.

Semi-supervised model-based classification: the basics on iris
The famous iris data set is available in base R distributions as iris. It contains four measurements on 150 irises, hailing from three different species. We demonstrate the various ways of performing model-based classification with the teigen package. First, we use only the models from tEIGEN with unconstrained degrees of freedom. In a true classification scenario, there exists a subset of the data where group membership is unknown. In this scenario, the known classification vector should have NA inputted for those values. The following example simulates these circumstances. We randomly take 50% of the data to have unknown membership and use the uniform initialization. The index of the observations that are taken to be known are stored in $index. Therefore, as an illustration, we can run the exact same analysis as before, but this time using a training index rather than specifying unknown classifications in the known vector.
R> trainingset <-tclass_iris3$index R> tclass_iris2 <-teigen(iris[, -5], models = "dfunconstrained", + init = "uniform", known = iris[, 5], training = trainingset, + verbose = FALSE) R> tclass_iris2$tab Because we actually know the true classification of the irises, we can verify the accuracy of the algorithm this way. In this case, we have misclassified one of the 'unknown' observations as Iris versicolor, when in fact it is from the Iris virginica species.
The classification table above represents the results from one particular classification run, and therefore may not be indicative of teigen's classification performance on the iris data set. Therefore, we now fit teigen models using 100 different training sets with 2 3 of the data considered to have known group membership. We emphasize that this analysis is still in a semi-supervised context, so care needs to be taken in the interpretation of 'training' versus 'testing' sets since all observations are used in the model fitting.

R> sd(misclass)
[1] 0.02076808 We can see an average misclassification rate of 3.1%, with a standard deviation of around 2.1%. We can also aggregate the classification tables from all of the observations. This shows us, for instance, that the Iris setosa species is never misclassified through the 100 runs.

Model-based clustering: custom initialization on wine
The wine data set is available in the gclus library as wine. It contains 13 chemical measurements on 178 samples of Italian red wine. Herein, we analyze the data using model-based clustering with the entire tEIGEN family and illustrate how to perform user-specified initializations. We initialize using hierarchical clustering via the hclust function in R. We also illustrate the usage of the ICL as model selection and show how to find model parameters.

Model-based clustering: emEM initialization on ckd
In the teigen package, we have included a cleaned-up version of the chronic kidney disease (ckd) data set that can be found in the UCI Machine Learning Repository (Lichman 2013). In this version, we have removed the categorical variables and any observations with missing values on the remaining variables. After these adjustments, the data set contains 203 patients with 11 diagnostic measures each and the classifying variable indicating whether or not they have chronic kidney disease.
We now apply teigen to this data set while using the "emem" initialization (described in Section 2.3).

) is UIUU with G=3
We can compare this to a similar fitting which uses only one instance of a soft random initialization.

) is UIUU with G=4
As we can see, the more thorough emem initialization results in a better model fit (BIC of −4086 versus −4103) when compared to a single random initialization.

Model-based clustering: univariate clustering on bank
The bank notes data set is available in the gclus library as bank. It contains a number of measurements on both counterfeit and genuine Swiss bank notes. We select the Diagonal variable to perform univariate clustering on.
R> plot(teigen_bank, xlab = "Diagonal (in mm)") Note that the res argument determines the number of values evaluated for the internal curve() call for a univariate plot; thus it, again, affects resolution (see Figure 5).

Model-based clustering on a multi-core CPU
The parallelized method is now illustrated on the body data set from the gclus library. It contains a variety of physical measurements on male and female participants. Once again, we initialize via hierarchical clustering. The machine used for the demo analysis contains an octo-core CPU clocked at 3.5 GHz per core (AMD Phenom FX-8350). Note that if parallel.cores = TRUE then the detectCores() function from the parallel package will detect the number of cores available and use all of them.

Model-based clustering: Comparison with mclust on body
In the final example, we compare teigen clustering results with those of a popular multivariate Gaussian mixture model package that contains the equivalent eigen-decomposed covariance structure: version 5.2 of the mclust R package (Fraley et al. 2016). Care must be taken to make the comparison valid, so first the body data set from the gclus package (Hurley 2012 Next, the same initialization that Mclust makes use of must be supplied to the teigen call, so a list is supplied with the hcVVV results. We note here that we reduce the number of groups considered from the default for both methods (1-9) to 1-4. We assure the reader that neither method selects a larger number of groups than 4, so this is only to reduce computation time.
R> mclustinit <-list() R> hcfit <-hcVVV(data = sdata) R> for(i in 1:4) { + mclustinit[[i]] <-hclass(hcfit, i) + } Finally, the convergence criteria and tolerance levels for both the M-step and the EM cycles need to be matched up. We specify the same tolerance levels and convergence criteria in teigen as the defaults for Mclust.

Summary
The tEIGEN family was developed further by introducing the four remaining multivariate mixture models based on an eigendecomposition of the scale matrix, as well as four univariate mixture models, and including a novel closed-form approximation for the degrees of freedom. Coinciding with this progress, the teigen R package was introduced and described in detail. It allows the user to fit the tEIGEN family of multivariate and univariate t distribution mixture models to numeric data in serial or parallel, with a number of built-in options and errorcatches. Examples further illustrated how teigen can be used and showed its effectiveness as a clustering technique.

A. Closed-form degrees of freedom
Here we provide further details and justification for the approximation for the degrees of freedom introduced in Section 2.2. Rearranging Equation 2, we get The right-hand side is constant with respect to the newest estimate ofν; for ease of what follows, we define this as k. Taking the exponential, we are given ν 2 exp ϕ ν 2 = exp(k). Now, exp ϕ ν 2 may be approximated byν 2 − 1 2 forν > 2. We plot the difference for this approximation for the interval (2,200] in Figure 6. In terms of estimates, we can check the differences between the numeric estimate and this approximation (plotting results in Figure 7).

R> numeric_est <-function(x) { +
sapply(x, function(w) uniroot(function(j) -w + log(j/2) -digamma(j/2), + interval = c(1, 300))$root) + } R> curve(numeric_est(x) -(-exp(x)/(1 -exp(x)) ), from = 0.0051, to = 0.69, + xlab = "k") Along the x-axis in Figure 7 are the constants k for the calculations. As such, the estimate along the left-hand side is approximately close to 200 degrees of freedom, and the right is approximately 2 degrees of freedom. Clearly, the closed-form approximation is consistently smaller than the numeric estimate, and there is a linear relationship where the approximation is closer (within 0.17) at higher degrees of freedom. Because the error is relatively consistent (falls between 0.168 and 0.300 for all relevant estimates), we introduce a simple corrective term that makes use of the previous estimate: Finally, we note if we alternatively define k = −1 − 1 n g n i=1ẑ ig (logŵ ig −ŵ ig ) − ϕ ν g old + p 2 + log ν g old + p 2