LMest : An R Package for Latent Markov Models for Longitudinal Categorical Data

Latent Markov (LM) models represent an important class of models for the analysis of longitudinal data, especially when response variables are categorical. These models have a great potential of application in many ﬁelds, such as economics and medicine. We illustrate the R package LMest that is tailored to deal with the basic LM model and some extended formulations accounting for individual covariates and for the presence of unobserved clusters of units having the same initial and transition probabilities (mixed LM model). The main functions of the package are tailored to parameter estimation through the expectation-maximization algorithm, which is based on suitable forward-backward recursions. The package also permits local and global decoding and to obtain standard errors for the parameter estimates. We illustrate the use of the package and its main features through some empirical examples in the ﬁelds of labour market, health, and criminology.


Introduction
In this paper we illustrate the R (R Core Team 2017) package LMest (Bartolucci and Pandolfi 2017), available from the Comprehensive R Archive Network (CRAN) at https://CRAN. R-project.org/package=LMest, which provides a collection of functions that can be used to estimate latent Markov (LM) models for longitudinal categorical data. The package is related to the book by Bartolucci, Farcomeni, and Pennoni (2013), where these models are illustrated in detail from the methodological point of view. Additional insights are given in the discussion paper by Bartolucci, Farcomeni, and Pennoni (2014b).
LM models are designed for the analysis of univariate and multivariate longitudinal/panel data based on the repeated observation of a sample of units across time. These models are specially tailored to study the evolution of an individual characteristic of interest, when this characteristic is not directly observable. For this aim, the models at issue rely on a latent process following a Markov chain. Another reason for using LM models is to account for time-varying unobserved heterogeneity in addition to the effect of observable covariates on the response variables.
The initial LM formulation introduced by Wiggins (1973), see also Wiggins (1955), has been developed in several directions and in connection with applications in many fields, such as economics, medicine, psychology, and sociology. In particular, the basic LM model, which relies on a homogeneous Markov chain of first order, has been extended in several ways on the basis of parameterizations that allow us to incorporate certain hypotheses in the model. The most relevant extended version includes individual covariates that may affect the distribution of the latent process (Vermunt, Langeheine, and Böckenholt 1999;Bartolucci, Pennoni, and Francis 2007) or the conditional distribution of the response variables given this process (Bartolucci and Farcomeni 2009). LM models may be also formulated to take into account certain types of unobserved heterogeneity. In particular, we are referring to the mixed LM model (Van de Pol and Langeheine 1990) and to the LM model with random effects (Altman 2007;Maruotti 2011;Bartolucci, Pennoni, and Vittadini 2011). Within the first formulation, the initial and transition probabilities of the latent process are allowed to vary across different latent subpopulations.
LM models are conceived quite similarly to hidden Markov (HM) models (Zucchini and Mac-Donald 2009) for time-series data, but they are tailored to longitudinal data where many individuals are observed at only a few occasions, typically no more than ten. Differently from LM models, HM models with covariates or for complex data structures are rarely applied because these structures are typical of longitudinal studies.
Some R packages already exist that can handle LM and related models. In particular, HM models can be estimated by using packages HMM (Himmelmann 2010), HiddenMarkov (Harte 2017), or depmixS4 (Visser and Speekenbrink 2010). The last one is the most closely related to our package LMest as it is tailored to deal with HM models based on a generalized linear formulation that can include individual covariates. On the other hand, package dep-mixS4 is designed to deal with repeated measurements on a single unit, as in a time-series context. Packages mhsmm (O'Connell and Højsgaard 2011) and hsmm (Bulla and Bulla 2013) may be used to estimate hidden semi-Markov models. Package msm (Jackson 2011) is tailored to deal with HM and related continuous time models. Finally, it is worth mentioning package hmm.discnp (Turner 2016), which can be used to fit multiple hidden Markov models, and package seqHMM (Helske and Helske 2017), which also includes graphical tools to visualize sequence data and categorical time series with multiple units and covariates. The latter one is related to package LMest as it can be used to estimate certain versions of mixed hidden Markov models. A commercial software to perform data analyses based on certain types of LM models is Latent GOLD (Vermunt and Magidson 2016). Additionally, some MATLAB (The MathWorks Inc. 2014) toolboxes are available for this aim; see for example the HMM toolbox implemented by Murphy (1998).
The distinguishing features of the LMest package with respect to the packages mentioned above are the following: • LMest is designed to work with longitudinal data, that is, with (even many) i.i.d. replicates of (usually short) sequences of data.
• It can deal with univariate and multivariate categorical outcomes.
• It allows for missing responses, drop-out, and non-monotonic missingness, under the missing-at-random assumption (Little and Rubin 2002).
• Standard errors for the parameter estimates are obtained by exact computation or through reliable approximations of the observed information matrix.
• Individual covariates are included through suitable parameterizations.
• Additional discrete random effects can be used to formulate mixed LM models.
• Computationally efficient algorithms are implemented for estimation and prediction of the latent states, also by relying on certain Fortran routines.
The present article is organized as follows. Section 2 briefly outlines the general formulation of LM models and deals with their maximum likelihood estimation. Section 3 describes the use of the LMest package to estimate the basic LM model without covariates. Section 4 is focused on LM models with individual covariates included in the measurement model, while Section 5 is focused on the case of individual covariates affecting the distribution of the latent process. In Section 6 we introduce the mixed LM model and we describe the R function for its estimation. Finally, Section 7 summarizes the main conclusions.

Latent Markov models for longitudinal data
In the following we provide a brief review of the statistical methodology related to LM models. The illustration closely follows the recent paper by Bartolucci et al. (2014b). We also focus on maximum likelihood estimation of these models on the basis of the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977). Moreover, we deal with more advanced topics which are important for applications, such as selection of the number of latent states and prediction of these states via local or global decoding (Viterbi 1967;Juang and Rabiner 1991).

The general latent Markov model formulation
Consider the multivariate case where for a generic unit we observe a vector Y (t) of r categorical response variables at T occasions, so that t = 1, . . . , T . Each response variable is denoted by Y (t) j and has c j categories, labeled from 0 to c j − 1, with j = 1, . . . , r. Also letỸ be the vector obtained by stacking Y (t) for t = 1, . . . , T ; this vector has then rT elements. Obviously, in the univariate case we have a single response variable Y (t) for each time occasion, andỸ is composed of T elements. When available, we also denote by X (t) the vector of individual covariates available at the t-th time occasion and byX the vector of all the individual covariates, which is obtained by stacking vectors X (t) for t = 1, . . . , T . The general LM model formulation assumes the existence of a latent process, denoted by U = (U (1) , . . . , U (T ) ), which affects the distribution of the response variables. Such a process is assumed to follow a first-order Markov chain with state space {1, . . . , k}, where k is the number of latent states. Under the local independence assumption, the response vectors Y (1) , . . . , Y (T ) are assumed to be conditionally independent given the latent process U . Moreover, the elements Y (t) j of Y (t) , with t = 1, . . . , T , are conditionally independent given U (t) . This assumption leads to a strong simplification of the model, but it can be relaxed by allowing serial dependence through the inclusion of the lagged response variable among covariates, as in Bartolucci and Farcomeni (2009).
The parameters of the measurement model are the conditional response probabilities in the univariate case, with t = 1, . . . , T and u = 1, . . . , k.
On the basis of the above parameters, the conditional distribution of U givenX may be expressed as where u = (u (1) , . . . , u (T ) ) andx denotes a realization of the vector of all response variables X. Moreover, the conditional distribution ofỸ given U andX may be expressed as where, in general, we define φ (t) y|ux = P(Y (t) = y|U (t) = u, X (t) = x) and, due to the assumption of local independence, we have In the above expressions,ỹ is a realization ofỸ made by the subvectors y (t) = (y (t) 1 , . . . , y (t) r ) whereas y is a realization of Y (t) with elements y j , j = 1, . . . , r.
In the presence of individual covariates, the manifest distribution of the response variables corresponds to the conditional distribution ofỸ givenX, which is defined as (1) In the basic version of the model, individual covariates are ruled out; therefore, we use symbol P(ỹ) to refer to the manifest distribution. Moreover, when these covariates are available, we suggest to avoid that they simultaneously affect the distribution of the latent process and the conditional distribution of the response variables given this process. In fact, the two formulations have different interpretations, as explained in more detail in the following, and the resulting model would be difficult to interpret and estimate.
Finally, it is important to note that computing P(ỹ|x), or P(ỹ) for the basic LM model, involves a sum extended to all possible configurations of the vector u, which are k T ; this typically requires a considerable computational effort. However, in order to efficiently compute such a probability we can use a forward recursion due to Baum, Petrie, Soules, and Weiss (1970), as illustrated in Bartolucci et al. (2013, Chapter 3).

Maximum likelihood estimation
We illustrate maximum likelihood estimation in the general case in which covariates are available. In this case, for a sample of n independent units that provide the response vectorsỹ 1 , . . . ,ỹ n and given the corresponding vectors of covariatesx 1 , . . . ,x n , the model loglikelihood has the following expression: Each vectorỹ i is a realization ofỸ that, in the multivariate case, is made up of the subvectors y (t) i , t = 1, . . . , T , having elements y (t) ij , j = 1, . . . , r; similarly,x i may be decomposed into the time-specific subvectors x (1) i , . . . , x (T ) i . Moreover, P(ỹ i |x i ) corresponds to the manifest probability of the responses provided by subject i, see Equation 1, and θ is the vector of all free parameters affecting P(ỹ i |x i ).
The above log-likelihood function can be maximized by the EM algorithm (Baum et al. 1970;Dempster et al. 1977), as described in the following section.

Expectation-maximization algorithm
The EM algorithm is based on the complete data log-likelihood that, with multivariate categorical data, has the following expression: where, with reference to occasion t and covariate configuration x, a (t) juxy is the number of individuals that are in the latent state u and provide response y to variable Y ux is the frequency of latent state u, and b (t) uux is the number of transitions from stateū to state u. The EM algorithm alternates the following two steps until convergence: • E-step: This step consists in computing the posterior (given the observed data) expected value of each frequency involved in Equation 2 by suitable forward-backward recursions (Baum et al. 1970); these expected values are denoted byâ • M-step: This step consists in maximizing the complete data log-likelihood expressed as in Equation 2, with each frequency substituted by the corresponding expected value. How to maximize this function depends on the specific formulation of the model and, in particular, on whether the covariates are included in the measurement or in the latent model.
The convergence of the EM algorithm is checked on the basis of the relative log-likelihood difference, that is, where θ (s) is the parameter estimate obtained at the end of the s-th M-step and is a suitable tolerance level (e.g., 10 −8 ).
The EM algorithm could converge to a mode of the log-likelihood that does not correspond to the global maximum, due to the multimodality of this function. In order to avoid this problem, we suggest to use different initializations of this algorithm, either deterministic or random, and to take as final estimate the one corresponding to the highest log-likelihood; this estimate is denoted byθ. In particular, for LM models without covariates, the random initialization is based on suitably rescaled random numbers drawn from a uniform distribution from 0 to 1 for the initial and transition probabilities of the Markov chain and for the conditional response probabilities.
We refer the reader to Bartolucci et al. (2013) for a detailed description of the EM algorithm and its initialization.

Standard errors
After the model is estimated, standard errors for the parameter estimates may be obtained on the basis of the observed information matrix, denoted by J (θ). In particular, each standard error is obtained as the square root of the corresponding diagonal element of the inverse of this matrix, J (θ) −1 . The LMest package computes the observed information matrix, and then provides the standard errors, by using either the exact computation method proposed by Bartolucci and Farcomeni (2015) or the numerical method proposed by Bartolucci and Farcomeni (2009), depending on the complexity of the model of interest.
The exact computation of J (θ) is based on the Oakes' identity (Oakes 1999). This method uses the complete data information matrix, produced by the EM algorithm, and a correction matrix computed on the basis of the first derivative of the posterior probabilities obtained from the backward-forward recursions. On the other hand, with the approximate method, J (θ) is obtained as minus the numerical derivative of the score vector s(θ) at convergence. The score vector, in turn, is computed as the first derivative of the conditional expected value of the complete data log-likelihood, which is based on the expected frequenciesâ uux corresponding to the final parameter estimatesθ, that is, where X and Y stand for the observed data; see Bartolucci et al. (2013) and Pennoni (2014) for details.
For the basic LM model and for the model with individual covariates affecting the distribution of the latent process, the LMest package also provides functions to obtain standard errors by parametric bootstrap (Davison and Hinkley 1997).

Criteria for selecting the number of latent states
In certain applications, the number of latent states, k, can be a priori defined, as in the univariate case in which it is reasonable to fix k equal to the number of response categories. Otherwise, the following criteria are typically used to select the number of latent states: the Akaike information criterion (AIC) of Akaike (1973) and the Bayesian information criterion (BIC) of Schwarz (1978). They are based on the indices AIC = −2ˆ + 2 #par, whereˆ denotes the maximum of the log-likelihood of the model of interest and #par denotes the number of free parameters.
According to each of the above criteria, the optimal number of latent states is the one corresponding to the minimum value of AIC or BIC; this model represents the best compromise between goodness-of-fit and complexity. If the two criteria lead to selecting a different number of states, the second one is usually preferred. However, other criteria may be used, such as those taking into account the quality of the classification; for a review see Bacci, Pandolfi, and Pennoni (2014) and Bartolucci, Bacci, and Pennoni (2014a).

Local and global decoding
The LMest package allows us to perform decoding, that is, prediction of the sequence of the latent states for a certain sample unit on the basis of the data observed for this unit.
In particular, the EM algorithm directly provides the estimated posterior probabilities of U (t) , denoted by P(U (t) = u|X =x,Ỹ =ỹ), for t = 1, . . . , T , u = 1, . . . , k, and for every covariate and response configuration (x,ỹ) observed at least once. These probabilities can be directly maximized to obtain a prediction of the latent state of each subject at each time occasion t; this is the so-called local decoding. Note that this type of decoding minimizes the classification error at each time occasion, but may yield sub-optimal predictions of U (1) , . . . , U (T ) .
In order to track the latent state of a subject across time, the most a posteriori likely sequence of states must be obtained through the so-called global decoding, which is based on an adaptation of the Viterbi (1967) algorithm; see also Juang and Rabiner (1991). The algorithm proceeds through a forward-backward recursion of a complexity similar to the recursions adopted for maximum likelihood estimation within the EM algorithm, so that global decoding may be even performed for long sequences of data; see Bartolucci et al. (2013, Chapter 7).

Basic latent Markov model
Following Bartolucci et al. (2013), and as already mentioned above, the basic LM model rules out individual covariates and assumes that the conditional response probabilities are time homogeneous. In symbols, we have that φ In order to fit this model, we use function est_lm_basic as illustrated, through a specific application, in the following.

Application to job satisfaction data
The illustration is based on data coming from the Russia Longitudinal Monitoring Survey (RLMS) 1 . The data are obtained by the "adult questionnaire", which is also focused on aspects concerning primary and secondary employment. In particular, we consider the question concerning job satisfaction related to the primary work. The resulting response variable, named IKSJ, has five ordered categories: "absolutely satisfied", "mostly satisfied", "neutral", "not very satisfied", "absolutely unsatisfied", which are coded from 1 to 5. The data we use are referred to a sample of n = 1,718 individuals followed for T = 7 years from 2008 to 2014. According to the economic theory proposed by Stiglitz, Amartya, and Fitoussi (2010), job satisfaction connects two main latent factors concerning individual characteristics and working environment that may evolve in time. Therefore, the LM approach is particularly suitable for the analysis of the data at issue.
The data are already contained in the data frame RLMSdat included in the LMest package. As illustrated in the following, each line of this data frame refers to an individual and each column contains the observed responses from the first to the last year of interview.
• S: Design array for the response configurations (of dimension n × TT × r, where TT corresponds to the number of time occasions) with categories starting from 0; missing responses are allowed, coded as NA.
• yv: Vector of frequencies of the available configurations.
• k: Number of latent states.
• mod: Model on the transition probabilities; mod = 0 when these probabilities depend on time, mod = 1 when they are independent of time (i.e., the latent Markov chain is time homogeneous), and mod from 2 to TT when the Markov chain is partially homogeneous of order equal to mod.
• tol: Tolerance level for checking convergence, which corresponds to in definition (3); the default value is 1e-8.
• maxit: Maximum number of iterations of the algorithm; the default value is 1000.
• start: Equal to 0 for deterministic starting values of the model parameters (default value), to 1 for random starting values, and to 2 for initial values provided as input arguments.
• piv, Pi, Psi: Initial values of the initial probability vector, of the transition probability matrix, and of the conditional response probabilities, respectively, when start = 2.
• out_se: Equal to TRUE to require the computation of the information matrix and the standard errors; FALSE is the default option.
In order to obtain the estimates for the data reported above, we use function aggr_data that, starting from a unit-by-unit data frame, returns the set of distinct observed response patterns and the corresponding frequencies: R> out <-aggr_data(RLMSdat) R> yv <-out$freq R> S <-5 -out$data_dis Note that the response categories must start from 0 in order to be used in est_lm_basic; therefore, these categories are rescaled in a way that also accounts for their reverse order in the initial dataset. In this way, level 0 corresponds to category "absolutely unsatisfied" and level 4 corresponds to category "absolutely satisfied".
In this illustrative application, we estimate the basic LM model under the assumption of time homogeneous transition probabilities (mod = 1) with a fixed number of states, k = 3, so as to obtain three groups of individuals clustered on the basis of the level of job satisfaction. For this aim, we use the following command: R> mod1 <-est_lm_basic(S, yv, k = 3, mod = 1, start = 0, out_se = TRUE) In this application we use the deterministic initialization (start = 0). Moreover, option out_se = TRUE is used to obtain the standard errors for the parameter estimates on the basis of the observed information matrix, which is exactly computed as described in Section 2.2.
The running time of the above command is around 13 seconds when run, as the other codes illustrated in the paper, on an Intel Core i7 processor with 2.7 GHz.
In the following, we show the estimation results, by using the print method, which provides the maximum log-likelihood, the number of free parameters, and values of AIC and BIC indices:

R> summary(mod1)
Call: est_lm_basic(S = S, yv = yv, k = 3, start = 0, mod = 1, out_se = TRUE) Coefficients:  According to the estimated conditional probabilitiesφ y|u (returned in Psi), we can interpret the first latent state as the one corresponding to a low level of satisfaction (high probability of responding "not very satisfied" and "neutral"), the second state to an intermediate level of this characteristic (probability of around 0.82 of responding "mostly satisfied"), whereas the last state corresponds to the highest level of satisfaction. The output displayed above also contains the estimated initial probability vector (piv), with elementsπ u , u = 1, 2, 3, that may be easily interpreted as quantities proportional to the size of each latent state at the beginning of the period of observation. Accordingly, we conclude that in 2008 most individuals belong to the second latent state, 36% of them belong to the first state, and only 17% to the last latent state. Moreover, according to the estimated transition probabilitiesπ u|ū (returned in Pi), the selected model leads to the conclusion that there is a quite high persistence in the same state during the years of the survey.

Selection of the number of states
It is important to recall that, when the value of k is not a priori known, it must be selected on the basis of the observed data. Moreover, different initializations of the EM algorithm must be attempted in order to prevent the problem of multimodality of the likelihood function. Both issues, that is, model selection and multimodality, can be addressed by using function search.model.LM that may be also used for the more complex models that will be illustrated later on. In the present application, we use this function to estimate the basic LM model for increasing values of k from 1 to 5 so as to select the optimal number of latent states using BIC. Moreover, considering that the likelihood function may be multimodal, search.model.LM uses one deterministic initialization (start = 0) and a number of random initializations (start = 1) proportional to the number of latent states. In this preliminary exploration, the tolerance level is set equal to 1e-5 to reduce the computing time. Starting from the best solution obtained in this way, a final run is performed (start = 2), with a default tolerance level equal to 1e-10. Function search.model.LM requires the following main input arguments (for additional details we refer to the help page of the function): • version: Model to be estimated ("basic" = basic LM model -parameters are estimated by function est_lm_basic; "manifest" = LM model with covariates in the measurement model -function est_lm_cov_manifest; "latent" = LM model with covariates in the distribution of the latent process -function est_lm_cov_latent).
• kv: Vector of possible number of latent states.
• nrep: To fix the number of random initializations for each element of kv; this number is equal to nrep×(k -1) and the default value is nrep = 2.
Using the following commands, we obtain the results of the model selection strategy illustrated above: R> set.seed (14326  The specific output for the selected model with 4 classes may be interpreted starting from the estimated conditional response probabilities to identify the different latent states and then considering the initial and transition probabilities to have a picture of the distribution of these states across time.

Covariates in the measurement model
When the individual covariates are included in the measurement model, the conditional distribution of the response variables given the latent states may be parameterized by generalized logits. In such a situation, the latent variables account for the unobserved heterogeneity, that is, the heterogeneity between individuals that we cannot explain on the basis of the observable covariates. The advantage with respect to a standard random-effects or latent class model with covariates is that the unobservable heterogeneity is allowed to be time-varying; for a deeper discussion see Bartolucci and Farcomeni (2009) and Pennoni and Vittadini (2013).

Assumptions
In dealing with univariate data in which each response variable has an ordinal nature, as in the next illustrative example, we denote the number of its response categories by c. In formulating the model we rely on a parameterization based on global logits of the following type: with t = 1, . . . , T , u = 1, . . . , k, and y = 1, . . . , c − 1. Note that these logits reduce to standard logits in the case of binary variables, that is, when c = 2. In the above expression, µ y are the cut-points, α u are the support points corresponding to each latent state, and β is the vector of regression parameters for the covariates.
As mentioned in Section 2.1, the inclusion of individual covariates in the measurement model is typically combined with the constraints π u|x = π u and π (t) u|ū,x = π u|ū , t = 1, . . . , T ,ū, u = 1, . . . , k, in order to avoid interpretability problems of the resulting model. Also note that, under these constraints, the transition probabilities are assumed to be time homogeneous so as to reduce the number of free parameters.
The LM model with individual covariates in the measurement model may be estimated by function est_lm_cov_manifest that is illustrated in the following.

Application to health related data
We provide an illustration based on data collected within the Health and Retirement Study (HRS) conducted by the University of Michigan 2 . The data concern aspects related to retirement and health among elderly individuals in the USA. The sample is nationally representative of the population aged over 50 years, whereas the response variable is the Self-Reported Health Status (named SRHS) and it is measured on a scale based on five ordered categories: "excellent", "very good", "good", "fair", and "poor". The sample we use includes n = 7, 074 individuals interviewed at T = 8 occasions every two years. Therefore, it is reasonable to expect that unobserved factors affecting the health status may change during a so long period, and then time-invariant latent variables are not suitable to represent these factors. The LM model with covariates directly takes this issue into account.
The data are reported in long format and, therefore, for each subject the number of records is equal to the number of occasions. There are no missing responses or dropout in the dataset.
Function est_lm_cov_manifest requires the following main input arguments (see the help page of the function for additional arguments): • S: Design array for the response configurations (of dimension n × TT) with categories starting from 0.
• X: Array of covariates (of dimension n × TT × nc, where nc corresponds the number of covariates).
• k: Number of latent states.
• mod: Type of model to be estimated, coded as mod = "LM" for the model illustrated in Section 4.1 based on parameterization (4). In such a context, the latent process is of first order with initial probabilities equal to those of the stationary distribution of the chain. When mod = "FM", the function estimates a model relying on the assumption that the distribution of the latent process is a mixture of AR(1) processes with common variance σ 2 and specific correlation coefficients ρ u . This model strictly follows the one proposed by Bartolucci et al. (2014a); see also Pennoni and Vittadini (2013) for a comparison between the two types of model and the help page of the function for further details.
• q: Number of support points of the AR(1) structure described above.
• tol: Tolerance level for checking convergence; the default value is 1e-8.
• maxit: Maximum number of iterations of the algorithm; the default value is 1000.
• start: Equal to 0 for deterministic starting values of the model parameters (default value), to 1 for random starting values, and to 2 for initial values in input.
• mu, al, be, la, PI: Initial values of the model parameters when start = 2 (vector of cut-points, vector of support points for the latent states, vector of regression parameters, vector of initial probabilities, and transition probability matrix, respectively).
• output: Equal to TRUE to print additional output; FALSE is the default option.
• out_se: Equal to TRUE to calculate the information matrix and the standard errors; FALSE is the default option.
Function est_lm_cov_manifest, as well as other estimation functions for LM models, require the data to be in array format. To give this structure to data that are originally in long format, we can use function long2matrices that is included in the package.
R> out <-with(data_SRHS, long2matrices(id = id, X = cbind(gender -1, + race == 2 | race == 3, education == 4, education == 5, age -50, + (age -50)^2/100), Y = srhs)) Function long2matrices mainly requires, as input arguments, the vector of the individual labels, id, the matrix of the covariates, X, and the vector of the responses, Y; see the help page of the function for details. Note that, in using the previous command, gender is included as a dummy variable equal to 1 for female, race is included as a dummy variable equal to 1 for nonwhite, and two dummy variables are used for the educational level: the first is equal to 1 for some college education and the second is equal to 1 for college education and above. Moreover, age is scaled by 50, and age squared is also included (suitably rescaled). An alternative formulation, which may result in a simpler interpretation of the parameter estimates, is based on considering as covariates the baseline age and the time since the beginning of the longitudinal study.
Function long2matrices generates the array of the responses YY, which are rescaled to vary from 0 ("poor") to 4 ("excellent"), as required by est_lm_cov_manifest: The individual considered above is a female (see first column), white (second column), with high school diploma (third and fourth columns); she is 51 years old at the first interview, 53 years old at the second interview, and so on (fifth column).

R> S <-5 -out$YY
As a preliminary example, to estimate the model at issue with a given number of states, say k = 3, the command is R> mod2 <-est_lm_cov_manifest(S, X, k = 3, mod = "LM", tol = 1e-8, + start = 1, output = TRUE, out_se = TRUE) However, instead of showing and commenting the output produced in this way, we prefer to directly illustrate how to deal with this model by function search.model.LM, which addresses the problem of model selection, in terms of k, and that of multimodality of the likelihood function. Moreover, as the sample size is particularly large and the model selection strategy may require a considerable amount of computing time, we extract a subset of observations to make the results easily reproducible. Accordingly, we consider only those individuals who are 70 years old at the third interview so as to account for the influence of the covariates on the oldest individuals. Then, we run the search.model.LM function for a number of states k from 1 to 5, as follows: The above output contains the estimated cut-points (mu), corresponding toμ y , the estimated support points for the latent states (al), corresponding toα u , and the estimated vector of regression parameters (be),β, as in expression (4). Note that the support points could be sorted so that the latent states result ordered from the worst to the best perceived health conditions. The estimated coefficients inβ are reported in the same order adopted to define the array X of covariates. The list of objects returned by the function, contained in res2$out.single [[4]], may also be displayed in the usual way; for a complete list of the returned arguments, we refer to the help pages of the package. As an example, the standard errors for the estimated regression coefficients (sebe) may be obtained as

R> round(res2$out.single[[4]]$sebe, 3)
[1] 0.326 0.394 0.379 0.349 0.142 0.273 On the basis of the estimated regression parameters, we can evaluate the effect of the covariates on the probability of reporting a certain level of the health status. In particular, women tend to report worse health status than men (the odds ratio for females versus males is equal to exp(−0.967) = 0.380), whereas individuals having a higher number of years of schooling tend to have a better opinion about their health status than subjects having lower education (the odds ratio for college education and above is equal to exp(2.458) = 11.682). We also observe that white individuals have a lower probability of reporting a good health status with respect to non-white individuals, but the coefficient is not significant. Among the selected individuals aged 70 years and over, the effect of age is positive but it is not significant. Also the quadratic term of age is not significant.
In this example, the time-varying random effects are used to account for the unobserved heterogeneity and the interpretation of the latent distribution is not of primary interest. The fact that the optimal number of states is k = 4 provides evidence for the presence of this type of heterogeneity, that is, that SRHS cannot be only explained on the basis of the few covariates we have used.
From the estimated initial probabilities π u , returned in the vector la, we observe that many individuals start in the third latent class (44%), which corresponds to subjects with a quite good perceived health status. The estimated transition matrix (PI), with elements corresponding to π u|ū ,ū, u = 1, . . . , 4, shows a quite high persistence in the same state. The highest transition probability is 0.18 and is observed from the second state, corresponding to the worst health condition, to the first state. The remaining transition probabilities are always lower than 0.07. For another application of the LM model to ordinal longitudinal data see Pennoni and Vittadini (2013).

Covariates in the latent model
When the covariates are included in the latent model, we suppose that the response variables measure the individual characteristic of interest (e.g., the quality of life) that is represented by the latent variables. This characteristic is not directly observable and may evolve over time. In such a case, the main research interest is in modeling the effect of covariates on the latent distribution, as is illustrated in the following; see also Bartolucci, Lupparelli, and Montanari (2009).

Assumptions
A natural way to allow the initial and transition probabilities of the LM chain to depend on individual covariates is by adopting a multinomial logit parameterization as follows: for t = 2, . . . , T andū, u = 1, . . . , k, withū = u. In the above expressions, β u = (β 0u , β 1u ) and γū u = (γ 0ūu , γ 1ūu ) are parameter vectors to be estimated which are collected in the matrices β and Γ.
For a more parsimonious model, instead of using (6) we can rely on the following parameterization for the transition probabilities, that is, a multinomial logit parameterization based on the difference between two sets of parameters: where γ 11 = 0 to ensure model identifiability. The parameterization used for modeling the initial probabilities is again based on standard multinomial logits, as defined in (5).
As already mentioned, when the covariates affect the distribution of the latent process, these covariates are typically excluded from the measurement model and we adopt the constraint φ (t) y|ux = φ y|u in the univariate case or φ (t) jy|ux = φ jy|u in the multivariate case. We also rely on the assumption of time homogeneity for the conditional response probabilities. Parameterizations based on (5) and (6) or (7) are implemented in the R function est_lm_cov_latent, which allows us to estimate the resulting LM models.

Application to health related data
To illustrate function est_lm_cov_latent, we consider the HRS data introduced in Section 4.2. In such a context, an interesting research question concerns the relationship between SRHS and the covariates. When the latter ones are included in the latent model, the initial and transition probabilities are estimated accounting for the covariate configurations and this may be useful to identify clusters of individuals related to specific needs.
The R function is based on the following main input arguments: • S: Array of observed response configurations (of dimension n × TT × r) with categories starting from 0; missing responses are allowed, coded as NA.
• X1: Matrix of covariates affecting the initial probabilities (of dimension n × nc1, where nc1 is the number of corresponding covariates).
• k: Number of latent states.
• start: Equal to 0 for deterministic starting values of the model parameters (default), to 1 for random starting values, and to 2 to define initial values as input arguments.
• tol: Tolerance level for checking convergence; the default value is 1e-8.
• maxit: Maximum number of iterations of the algorithm; the default value is 1000.
• param: Type of parameterization for the transition probabilities, coded as param = "multilogit" (default) for the model parameterization defined in (6) and as param = "difflogit" for the parameterization defined in (7).
• Psi, Be, Ga: Initial values of the matrix of the conditional response probabilities and of the parameters affecting the logits for the initial and transition probabilities, respectively, when start = 2.
• output: Equal to TRUE to obtain additional output arguments; FALSE is the default option.
• out_se: Equal to TRUE to compute the information matrix and standard errors; FALSE is the default option.
• fixPsi: Equal to TRUE if the matrix of conditional response probabilities is given in the input and is kept fixed during the estimation process; FALSE is the default option.
We illustrate the function considering the full sample of n = 7, 074 individuals, with T = 8 time occasions. In particular, we fit the model defined in Section 5.1, with a number of latent states equal to the number of response categories (i.e., k = 5), by using the following command: R> mod3 <-est_lm_cov_latent(S = S, X1 = X1, X2 = X2, k = 5, start = 0, + param = "multilogit", fort = TRUE, output = TRUE) Here, we rely on a deterministic initialization of the estimation algorithm. The computing time required to run the above function, again on an Intel Core i7, is around 30 seconds.
The results can be displayed by using the print method, which returns the main output arguments:

R> summary(mod3)
Call: est_lm_cov_latent(S = S, X1 = X1, X2 = X2, k = 5, start = 0, param = "multilogit", fort = TRUE, output = TRUE) Coefficients: Be -Parameters affecting the logit for the initial probabilities: The estimated conditional response probabilities (Psi), corresponding toφ y|u , allow us to characterize the latent states: individuals in the first latent state have the highest probability of reporting the worst health status, whereas the last state corresponds to subjects having the highest probability of reporting the best health status. Individuals in the remaining states show intermediate levels of the perceived health status.
The argument Be returned by the function contains the estimated regression parameters affecting the distribution of the initial probabilities, and corresponds toβ; see definition (5). The estimated positive intercepts indicate a general tendency to report a good health status at the beginning of the survey. Gender log-odds (second row of Be) are all negative, indicating that females report a worse health status than males at the first time occasion. The two log-odds corresponding to the educational level are positive, indicating that a higher educational level leads to better health. Finally, the negative estimates for age indicate that, at the beginning of the study, older individuals generally tend to report a poor health status.
The output Ga contains the estimated parameters affecting the distribution of the transition probabilities, and corresponds to matrixΓ; see definition (6). To offer some insights for the interpretation of this output, note that parameters in γ 1ūu refer to the transition from levelū to level u of the latent process. Therefore, as an example, the first column of Ga[,,,5] contains the parameter estimates measuring the influence of each covariate on the transition from the fifth state, corresponding to the best health conditions, to the first state, corresponding to the worst conditions. On the basis of these results, we notice that the influence of gender is positive, meaning that for females the probability of this transition is higher with respect to males. The influence of education, measured on the logit scale, is negative meaning that individuals with a higher level of education tend to move from the fifth to the first state less frequently than those with a lower education. At the same time, age has a positive effect on the chance of transition from the highest to the lowest state.
Using option output = TRUE, the function also returns some additional outputs. In particular, it is possible to obtain the estimated initial probability matrix, Piv, and the estimated transition probabilities matrices, PI. On the basis of these estimates, it is possible to compute the average initial and transition probabilities for a group of individuals of interest. For example, if we select white females with high level of education (college education and above), we obtain the corresponding average estimated initial and transition probabilities by means of the following commands: On the basis of the results above we conclude that, at the beginning of the period of observation, white females have a probability approximately equal to 0.82 of being in the last two states corresponding to very good or excellent health conditions. On the other hand, nonwhite females have a higher probability of being in the first three states, corresponding to worse health conditions, than white females. Moreover, the estimated transition probability matrices show a quite high persistence in the same latent state for both groups of individuals. However, the lower diagonal elements of the transition matrix for non-white females are higher than those referred to white females, except for some transitions, showing a worse perception of their health status.

Decoding
Local and global decoding are implemented in the R function decoding, which allows us to predict the sequence of latent states for a certain sample unit on the basis of the output of the main estimation functions in the package. Function decoding requires the following input arguments: • est: Object containing the output of one of the following functions: est_lm_basic, est_lm_cov_latent, est_lm_cov_manifest, or est_lm_mixed.
• Y: Vector or matrix of responses.
• X1: Matrix of covariates affecting the initial probabilities (for est_lm_cov_latent) or affecting the distribution of the responses (for est_lm_cov_manifest).
For the application above, the most likely sequence of latent states for all sample units may be obtained by the following command: For instance, we conclude that for the first subject there is only one transition, at the third time occasion, from the second to the third latent state.

Mixed latent Markov model
Another relevant extension of LM models may be formulated to take into account additional sources of (time-fixed) dependence in the data. In this paper, we provide an illustration of the mixed LM model ( Van de Pol and Langeheine 1990) in which the parameters of the latent process are allowed to vary in different latent subpopulations defined by an additional discrete latent variable.

Assumptions
Let U be a (time-invariant) discrete latent variable that defines unobserved clusters (or latent classes) of units having the same initial and transition probabilities. The latent process is here denoted by V = (V (1) , . . . , V (T ) ), which substitutes the symbol U used in the previous sections. In such a context, the variables in V follow a first-order Markov chain only conditionally on U . This additional latent variable has k 1 support points (corresponding to the latent classes) and mass probabilities denoted by λ u , u = 1, . . . , k 1 . Accordingly, we denote by k 2 the number of latent states, corresponding to the number of support points of every latent variable V (t) , t = 1, . . . , T .
Note that, under this approach, which may be useful from a perspective of clustering, the initial and transition probabilities of the latent Markov chain differ between sample units in a way that does not depend on the observable covariates.
The parameters to be estimated are the conditional response probabilities, denoted by This model relies on the assumption that the conditional response probabilities and the transition probabilities are time-homogeneous. Obviously, this formulation may be extended by also including observable covariates as illustrated in the previous sections; see Bartolucci et al. (2013, Chapter 6), for a detailed description.
We derive the manifest distribution ofỸ by extending the rules given in Section 2. In particular, the conditional distribution of V given U is equal to where v = (v (1) , . . . , v (T ) ) denotes a realization of V . Given the assumption of local independence that is maintained under this model, the conditional distribution ofỸ given U and V reduces to whereas the conditional distribution ofỸ given U is expressed as Finally, the manifest distribution ofỸ is now obtained by the following sum which depends on the mass probabilities for the distribution of the latent variable U . Even in this case P(ỹ) may be computed through a forward recursion (Baum et al. 1970).
Referred to the maximum likelihood estimation of the mixed LM model formulated above, we can extend the procedure illustrated in Section 2.2, where the complete data log-likelihood has now the following expression: In the previous expression, a (t) jvy is the number of sample units that are in latent state v at occasion t and provide response y to variable j. Moreover, with reference to latent class u and occasion t, b (t) uv is the number of sample units in latent state v, and b (t) uvv is the number of transitions from statev to state v. Finally, c u is the overall number of sample units that are in latent class u.

Application to data from criminology
The mixed LM model is illustrated by using a simulated dataset similar to the one analyzed in Bartolucci et al. (2007); see also Francis, Liu, and Soothill (2010) and Pennoni (2014). The data are related to the complete conviction histories of a cohort of offenders followed from the age of criminal responsibility, 10 years. The offense code has been reduced to 73 major offenses and they have been grouped according to the Research Development and Statistics Directorate (1998) 3 on the basis of the following ten typologies: "violence against the person", "sexual offenses", "burglary", "robbery", "theft and handling stolen goods", "fraud and forgery", "criminal damage", "drug offenses", "motoring offenses", and "other offenses". The main interest is in evaluating the patterns of criminal behavior among individuals.
For the simulated data, we consider n = 10,000 individuals (including the proportion of nonoffenders): 4,800 females and 5,200 males. We also consider T = 6 age bands of length equal to five years (10-15, 16-20, 21-25, 26-30, 31-35, and 36-40 years) and r = 10 binary response variables, corresponding to the typologies of offenses defined above. For every age band, each response variable is equal to 1 if the subject has been convicted for a crime of the corresponding offense group and to 0 otherwise. Then, the data matrix, reported below in long format, has been simulated on the basis of the same parameter estimates reported in Bartolucci et al. (2007): R> data("data_criminal_sim", package = "LMest") R> head(data_criminal_sim) id sex time y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 The first column of the data matrix contains the id code of each subject, whereas the covariate gender (second column named sex) is coded as 1 for male and 2 for female, the column named time is referred to the age band, and the last ten columns are related to the binary response variables.
The R function aimed at estimating the mixed LM models is est_lm_mixed, which requires the following input arguments: • S: Array of response configurations (n × TT × r) with categories starting from 0.
• yv: Vector of frequencies of the configurations.
• k1: Number of support points, corresponding to latent classes, of the distribution of the latent variable U .
• k2: Number of support points, corresponding to latent states, of the distribution of the latent process V .
• start: Equal to 0 for deterministic starting values of the model parameters (default value) and to 1 for random starting values.
• tol: Tolerance level for checking convergence; the default value is 1e-8.
• maxit: Maximum number of iterations of the algorithm; the default value is 1000.
• out_se: Equal to TRUE to calculate the information matrix and the standard errors; FALSE is the default option.
For this example we also use function long2wide that allows us to convert the data from the long to the wide format. It requires to specify the name of the data matrix, the column referred to the identification number of the individuals, the column of the age band (time occasions), and the names of the columns of the covariates and of the responses, as follows: R> out <-long2wide(data = data_criminal_sim, nameid = "id", namet = "time", + colx = "sex", coly = paste0("y", 1:10)) R> YY <-out$YY R> XX <-out$XX R> freq <-out$freq Other options can be found in the help page of the function. The main objects in output are the array YY of response configurations, the array XX of covariate configurations, and the vector freq of the corresponding frequencies. For the data at hand, the design matrix for the responses, YY, contains 915 different response configurations, for T = 6 age bands, and r = 10 response variables. Similarly, for the covariate gender, matrix XX contains 915 configurations for T = 6 age bands. In the following, we show two response patterns with the associated covariate configurations and the corresponding frequency in the sample: [1] 2 2 2 2 2 2

R> freq[149]
[1] 113 From the above configurations, we observe that only 3 females have been convicted for violence against the person (first response variable) in the last age band, which is from 36 to 40 years old, whereas 113 females committed a theft (fifth variable) during the first time window, related to age 10-15.
To illustrate the use of function est_lm_mixed, we fit the model in Section 6 on such data with k 1 = 2 latent classes, k 2 = 2 latent states, restricting the analysis to females. We use the following commands in R: Using an Intel Core i7 processor, the above function takes around 46 seconds to converge. Then, we obtain the value of the log-likelihood at convergence by the print method:
The model formulation allows us to characterize the two clusters of individuals at the beginning of the period of observation and to follow their evolution over time. According to the estimated mass probabilities, the first cluster, which includes around 22% of females, is characterized by individuals having, at the beginning of the period of observation, a probability equal to 1 to be in the first latent state (corresponding to null tendency to commit a crime). On the other hand, females included in the second cluster (78%) are characterized by an initial probability of being in the second latent state of around 0.09. Comparing the estimated transition probability matrices we observe, within each cluster, a very high level of persistence in the first latent state. Moreover, females classified in the first cluster present a higher probability (of around 0.64) to move from the second to the first state than those assigned to the second cluster (0.34), revealing a more pronounced tendency to improve in their behavior.

Conclusions
We illustrate the R package LMest that allows us to efficiently fit latent Markov (LM) models for categorical longitudinal data. For a comprehensive overview about these models we refer the reader to Bartolucci et al. (2013) and Bartolucci et al. (2014b). Both manifest and latent distributions of the model can be parameterized so as to include the effect of individual covariates. The mixed formulation includes additional latent variables in these parameterizations. It shall be noted that all functions above can be used with multivariate categorical outcomes, with the only exception of est_lm_cov_manifest, which is restricted to univariate categorical outcomes. Functions est_lm_basic and est_lm_cov_latent also allow us to deal with missing data, non-monotone missingness, and dropout, under the missing-at-random assumption.
Overall, we consider this package as a relevant advance for applied researchers interested in longitudinal data analyses in the presence of categorical response variables. In particular, we recall that in this context LM models are particularly useful at least from three different perspectives: (i) to represent and study the evolution of an individual characteristic (e.g., quality of life) that is not directly observable; (ii) to account for unobserved heterogeneity due to omitted covariates in a time-varying fashion; and (iii) to account for measurement errors in observing a sequence of categorical response variables. We recall that, when covariates are available, they are typically included in the measurement model for applications of type (ii), so that the response variables are affected by observed covariates and latent variables that are considered on the same footing, whereas the covariates are included in the latent models for applications of type (i) and (iii), so that they affect the distribution of the latent process.
Further updates of the package will include the possibility to use multivariate outcomes in function est_lm_cov_manifest and new functions with different formulations of mixed LM models, also for sample units collected in clusters (Bartolucci et al. 2011). We also plan to include estimation methods which are alternative to pure maximum likelihood estimation, as the three-step method proposed by Bartolucci, Montanari, and Pandolfi (2015) as well as the estimation procedure proposed by Bartolucci, Pennoni, and Vittadini (2016) to allow the model to be suitable in a potential outcome research framework (Rubin 1974).