JointAI: Joint Analysis and Imputation of Incomplete Data in R

Missing data occur in many types of studies and typically complicate the analysis. Multiple imputation, either using joint modelling or the more flexible fully conditional specification approach, are popular and work well in standard settings. In settings involving non-linear associations or interactions, however, incompatibility of the imputation model with the analysis model is an issue often resulting in bias. Similarly, complex outcomes such as longitudinal or survival outcomes cannot be adequately handled by standard implementations. In this paper, we introduce the R package JointAI, which utilizes the Bayesian framework to perform simultaneous analysis and imputation in regression models with incomplete covariates. Using a fully Bayesian joint modelling approach it overcomes the issue of uncongeniality while retaining the attractive flexibility of fully conditional specification multiple imputation by specifying the joint distribution of analysis and imputation models as a sequence of univariate models that can be adapted to the type of variable. JointAI provides functions for Bayesian inference with generalized linear and generalized linear mixed models as well as survival models, that take arguments analogous to their corresponding and well known complete data versions from base R and other packages. Usage and features of JointAI are described and illustrated using various examples and the theoretical background is outlined.


Introduction
Missing data are a challenge common to the analysis of data from virtually all kinds of studies. Especially when many variables are measured, as in big cohort studies, or when data are obtained retrospectively, e.g., from registries, proportions of missing values up to 50% in some variables are not uncommon.
Multiple imputation, which is often considered the gold standard to handle incomplete data, has its origin in the 1970s and was primarily developed for survey data (Rubin 1987(Rubin , 2004. One of its first implementations in R (R Core Team 2019) is the package norm (Novo and Schafer 2013), which performs multiple imputation under the joint modelling framework using a multivariate normal distribution (Schafer 1997). Nowadays more frequently used is multiple imputation using a fully conditional specification (FCS), also called multiple imputation using chained equations (MICE) and its seminal implementation in the R package mice (Van Buuren and Groothuis-Oudshoorn 2011; Van Buuren 2012).
Since the introduction of multiple imputation, datasets have gotten more complex. Therefore, more sophisticated methods that can adequately handle the features of modern data and comply with assumptions made in its analysis are required. Modern studies do not only record univariate outcomes, measured in a cross-sectional setting but outcomes that consist of two or more measurements, such as repeatedly measured or survival outcomes. Furthermore, non-linear effects, introduced by functions of covariates, such as transformations, polynomials or splines, or interactions between variables are considered in the analysis and, hence, need to be taken into account during imputation.
Standard multiple imputation, either using FCS or a joint modelling approach, e.g., under a multivariate normal distribution, assumes linear associations between all variables. Moreover, FCS requires the outcome to be explicitly specified in each of the linear predictors of the full conditional distributions. In settings where the outcome is more complex than just univariate, this is not straightforward and not generally possible without information loss, leading to misspecified imputation models and, likely, to bias. Some extensions of standard multiple imputation have been developed and are implemented in R packages and other software, but the larger part of the software for imputation is restricted to standard settings such as cross-sectional survey data. The Comprehensive R Archive Network (CRAN) task view on missing data (https://CRAN.R-project.org/view= MissingData) gives an overview of available R packages that deal with missing data in different contexts, using various approaches.
Relevant in our context, i.e., in settings where potentially complex models, such as models with non-linear associations, survival or multi-level outcomes, are estimated on data with missing values in covariates, are for example the following R packages.
The package mice itself provides limited options to perform multi-level imputation, restricted to conditionally normal and binary level-1 covariates and the use of a linear model or predictive mean matching for level-2 covariates. The packages micemd (Audigier and Resche-Rigon 2019) and miceadds (Robitzsch, Grund, and Henke 2019) provide extensions to Poisson models and predictive mean matching for level-1 covariates.
smcfcs (Bartlett and Keogh 2019), short for "substantive model compatible fully conditional specification", uses Bayesian methodology to extend standard multiple imputation using FCS to ensure compatibility between analysis model and imputation models. It provides functionality to fit linear, logistic and Poisson models, as well as parametric (Weibull) and Cox proportional hazards survival models, and competing risk models. The model specification is similar to the mice package, however less automated.
The R package jomo (Quartagno and Carpenter 2019) performs joint model multiple imputation in the Bayesian framework using a multivariate normal distribution and includes an extension to the standard approach to assure compatibility between analysis model and imputation models. It can handle generalized linear (mixed) models, cumulative link mixed models, proportional odds probit regression and Cox proportional hazards models. Unfortunately, no functions are available to facilitate the evaluation of convergence of the Markov chain Monte Carlo (MCMC) algorithm. The R package mitml (Grund, Robitzsch, and Lüdtke 2019) provides an interface to pan (imputation of continuous level-1 covariates only) and jomo and includes functions that make the analysis and evaluation of the imputed data more convenient.
hmi (Speidel, Drechsler, and Jolani 2019) ("hierarchical multi-level imputation") combines functionality of the packages mice and MCMCglmm (Hadfield 2010) to perform multiple imputation in single-and multi-level models, but it assumes all incomplete covariates in multilevel models to be level-1 covariates. Similarly, the package mlmmm (Yucel 2010), which uses the EM-algorithm to perform multi-level imputation, does not consider incomplete level-2 variables.
mdmb (Robitzsch and Lüdtke 2019) implements model-based treatment of missing data using likelihood or Bayesian methods in linear and logistic regression and linear and ordinal multilevel models. Under the Bayesian framework, substantive model compatible imputation is available. A drawback is that the specification does not follow the specification of well-known R functions, which complicates usage especially for new users, and that the specification of more complex models can quickly become quite involved.
Depending on the type of outcome model (survival, multi-level or single-level), whether nonlinear effects are involved (which need substantive model compatible imputation), the measurement level of incomplete covariates and whether missingness occurs in level-1 as well as in level-2 covariates, the user has to work with different software packages. This requires users to be familiar with the usage and underlying statistical methodology of a number of packages and approaches. Since for several packages the documentation is rather inscrutable and vague, it is unclear what precisely these packages can and cannot do and what the underlying assumptions are. Choosing an appropriate software package and applying it correctly may, thus, become quite a daunting challenge.
The R package JointAI (Erler 2019), which is presented in this paper, aims to facilitate the correct analysis of incomplete data by providing a unified framework for both simple and more complex models, using a consistent specification that most users will be familiar with from commonly used (base) R functions.
Most of the packages named above perform multiple imputation, i.e., create multiple imputed datasets, which are then analysed in a second step, followed by pooling of the results. JointAI follows a different, fully Bayesian approach (used as well in mdmb). By modelling the analysis model of interest jointly with the incomplete covariates, analysis and imputation can be performed simultaneously while assuring compatibility between all sub-models (Erler, Rizopoulos, van Rosmalen, Jaddoe, Franco, and Lesaffre 2016;Erler, Rizopoulos, Jaddoe, Franco, and Lesaffre 2019). In this joint modelling approach, the added uncertainty due to the missing values is automatically taken into account in the posterior distribution of the parameters of interest, and no pooling of results from repeated analyses is necessary. The joint distribution is specified conveniently, using a sequence of conditional distributions that can be specified flexibly according to each type of variable. Since the analysis model of interest defines the first distribution in the sequence, the outcome is included in the joint distribution without the need for it to enter the linear predictor of any of the other models. Moreover, non-linear associations that are part of the analysis model are automatically taken into account for the imputation of missing values. This directly enables our approach to handle complicated models, with complex outcomes and flexible linear predictors.
In this paper, we introduce the R package JointAI, which performs joint analysis and imputation of regression models with incomplete covariates under the missing at random (MAR) assumption (Rubin 1976), and explain how data with incomplete covariate information can be analysed and imputed with it. The package is available for download at the Comprehensive R Archive Network (CRAN) under https://CRAN.R-project.org/package=JointAI. Sec-tion 2 briefly describes the theoretical background of the method. An outline of the general structure of JointAI is given in Section 3, followed by an introduction of the example datasets that are used throughout the paper in Section 4. Details about model specification, settings controlling the MCMC sampling, and summary, plotting and other functions that can be applied after fitting the model are given in Sections 5 through 7. We conclude the paper with an outlook of planned extensions and discuss the limitations that are introduced by the assumptions made in the fully Bayesian approach.

Theoretical background
Consider the general setting of a regression model where interest lies in a set of parameters θ that describe the association between a univariate outcome y and a set of covariates X = (x 1 , . . . , x p ). In the Bayesian framework, inference over θ is obtained by estimation of the posterior distribution of θ, which is proportional to the product of the likelihood of the data (y, X) and the prior distribution of θ, When some of the covariates are incomplete, X consists of two parts, the completely observed variables X obs and those variables that are incomplete, X mis . If y had missing values (and this missingness was ignorable), the only necessary change in the formulas below would be to replace y by y mis . Here, we will, therefore, consider y to be completely observed. In the implementation in the R package JointAI, however, missing values in the outcome are allowed and are imputed automatically.
The likelihood of the complete data, i.e., observed and unobserved data, can be factorized in the following convenient way: where the first factor constitutes the analysis model of interest, described by a vector of parameters θ y|x , and the second factor is the joint distribution of the incomplete variables, i.e., the imputation part of the model, described by parameters θ x , and θ = (θ y|x , θ x ) .
Explicitly specifying the joint distribution of all data is one of the major advantages of the Bayesian approach, since this facilitates the use of all available information of the outcome in the imputation of the incomplete covariates (Erler et al. 2016), which becomes especially relevant for more complex outcomes like repeatedly measured variables (see Section 2.2.1).
In complex models the posterior distribution can usually not be derived analytically but MCMC methods are used to obtain samples from the posterior distribution. The MCMC sampling in JointAI is done using Gibbs sampling, which iteratively samples from the full conditional distributions of the unknown parameters and missing values.
In the following sections we describe each of the three parts of the model, the analysis model, the imputation part and the prior distributions, in detail.

Analysis model
The analysis model of interest is described by the probability density function p(y | X, θ y|x ). The R package JointAI can currently handle analysis models that are either generalized linear regression models (GLM), generalized linear mixed models (GLMM), cumulative logit (mixed) models, parametric (Weibull) survival models or Cox proportional hazards models.

Generalized linear (mixed) models
For a GLM the probability density function is chosen from the exponential family and has the linear predictor where g(·) is a link function, y i the value of the outcome variable for subject i, and x i is a column vector containing the row of X that contains the covariate information for i.
For a GLMM the linear predictor is of the form where y ij is the j-th outcome of subject i, x ij is the corresponding vector of covariate values, b i a vector of random effects pertaining to subject i, and z ij a column vector containing the row of the design matrix of the random effects, Z, that corresponds to the j-th measurement of subject i. Z typically contains a subset of the variables in X, and b i follows a normal distribution with mean zero and covariance matrix D.
In both cases the parameter vector θ y|x contains the regression coefficients β, and potentially additional variance parameters (e.g., for linear (mixed) models), for which prior distributions will be specified in Section 2.3.

Cumulative logit (mixed) models
Cumulative logit mixed models are of the form y ij ∼ Mult(π ij,1 , . . . , π ij,K ), π ij,1 = P (y ij ≤ 1), where π ij,k = P (y ij = k) and logit(x) = log x 1−x . A cumulative logit regression model for a univariate outcome y i can be obtained by dropping the index j and omitting z ij b i . In cumulative logit (mixed) models, the design matrix X does not contain an intercept, since outcome category specific intercepts γ 1 , . . . , γ K are specified. Here, the parameter vector θ y|x includes the regression coefficients β, the first intercept γ 1 and increments δ 1 , . . . , δ K−1 .

Survival models
Survival data are typically characterized by the observed event or censoring times, T i , and the event indicator, D i , which is one if the event was observed and zero otherwise. JointAI provides two types of models to analyse right censored survival data, a parametric model which assumes a Weibull distribution for the true (but partially unobserved) survival times T * , and a semi-parametric Cox proportional hazards model. The parametric survival model is implemented as where 1 is the indicator function which is one if T * i ≥ C i , and zero otherwise. The Cox proportional hazards model can be written as where h 0 (t) is the baseline hazard function, which, in JointAI, we model using a B-spline approach with six degrees of freedom, i.e., h 0 (t) = 6 q=1 γ Bq B q (t), where B q denotes the q-th basis function.
The survival function of the Cox model is where θ includes the regression coefficients β (which do not contain an intercept) and the coefficients γ B used in the specification of the baseline hazard. Since the integral over the baseline hazard does not have a closed-form solution, in JointAI it is approximated using Gauss-Kronrod quadrature with 15 evaluation points.
Each of the conditional distributions is a member of the exponential family, extended with distributions for ordinal categorical variables, and chosen according to the type of the respective variable. Its linear predictor is where x i,mis < = (x i,mis 1 , . . . , x i,mis −1 ) and x i,obs is the vector of values for subject i of those covariates that are observed for all subjects.
Factorization of the joint distribution of the covariates in such a sequence yields a straightforward specification of the joint distribution, even when the covariates are of mixed type.
Missing values in the covariates are sampled from their full-conditional distribution that can be derived from the full joint distribution of outcome and covariates.
When, for instance, the analysis model is a GLM, the full conditional distribution of an incomplete covariate x i,mis can be written as where θ x is the vector of parameters describing the model for the -th covariate, and contains the vector of regression coefficients α and potentially additional (variance) parameters. The product of distributions enclosed by curly brackets represents the distributions of those covariates that have x mis as a predictive variable in the specification of the sequence in (1).
Even though (2) describes the actual imputation model, i.e., the distribution the imputed values for x i,mis are sampled from, we will use the term "imputation model" for the conditional distribution of x i,mis from (1), since the latter is the distribution that is explicitly specified by the user and, hence, of more relevance when using JointAI.

Imputation with longitudinal outcomes
Factorizing the joint distribution into analysis model and imputation part also facilitates extensions to settings with more complex outcomes, such as repeatedly measured outcomes.
In the case where the analysis model is a GLMM or ordinal mixed model, the conditional distribution of the outcome in (2), p y i | x i,obs , x i,mis , θ y|x , has to be replaced by Since y does not appear in any of the other terms in (2), and (3) can be chosen to be a model that is appropriate for the outcome at hand, the thereby specified full conditional distribution of x i,mis allows us to draw valid imputations that use all available information on the outcome. This is an important difference to standard FCS, where the full conditional distributions used to impute missing values are specified directly, usually as regression models, and require the outcome to be explicitly included into the linear predictor of the imputation model. In settings with complex outcomes it is not clear how this should be done and simplifications may lead to biased results (Erler et al. 2016). The joint model specification utilized in JointAI overcomes this difficulty.
When some of the covariates are time-varying, it is convenient to specify models for these variables in the beginning of the sequence of covariate models, so that models for longitudinal variables have other longitudinal and baseline covariates in their linear predictor, but longitudinal covariates do not enter the predictors of baseline covariates.
Note that, whenever there are incomplete baseline covariates it is necessary to specify models for all longitudinal variables, even completely observed ones, while models for completely observed baseline covariates can be omitted. This becomes clear when we extend the factorized joint distribution from above with completely and incompletely observed longitudinal (level-1) covariates s obs and s mis : Given that the parameter vectors θ x obs , θ x mis , θ s obs and θ s mis are a priori independent, and p(x i,obs | θ x obs ) is independent of both x mis and s mis , it can be omitted.
Since p(s ij,obs | x i,obs , x i,mis , θ s obs ), however, has x i,mis in its linear predictor and will, hence, be part of the full conditional distribution of x i,mis , it cannot be omitted from the model.

Non-linear associations and interactions
Other settings in which the fully Bayesian approach employed in JointAI has an advantage over standard FCS are settings with interaction terms that involve incomplete covariates or when the association of the outcome with an incomplete covariate is non-linear. In standard FCS such settings lead to incompatible imputation models (White, Royston, and Wood 2011;Bartlett, Seaman, White, and Carpenter 2015). This becomes clear when considering the following simple example where the analysis model of interest is the linear regression y i = β 0 + β 1 x i + β 2 x 2 i + ε i and x i is imputed using x i = α 0 + α 1 y i +ε i . While the analysis model assumes a quadratic relationship, the imputation model assumes a linear association between x and y and there cannot be a joint distribution that has the imputation and analysis model as its full conditional distributions.
Because, in JointAI, the analysis model is a factor in the full conditional distribution that is used to impute x i , the non-linear association is taken into account. Furthermore, since it is the joint distribution that is specified, and the full conditional then derived from it, the joint distribution is ensured to exist.

Prior distributions
Prior distributions have to be specified for all (hyper)parameters. A common prior choice for the regression coefficients is the normal distribution with mean zero and large variance. In JointAI variance parameters in models for normally distributed variables are specified as, by default vague, inverse-gamma distributions.
The covariance matrix of the random effects in a mixed model, D, is assumed to follow an inverse Wishart distribution where the degrees of freedom are usually chosen to be equal to the dimension of the random effects, and the scale matrix is diagonal. Since the magnitude of the diagonal elements relates to the variance of the random effects, the choice of suitable values depends on the scale of the variable the random effect is associated with. Therefore, JointAI uses independent gamma hyperpriors for each of the diagonal elements. More details about the default hyperparameters and how to change them are given in Section 5.7 and Appendix A.

Package structure
The package JointAI has seven main functions, lm_imp(), glm_imp(), clm_imp(), lme_imp(), glme_imp(), clmm_imp, survreg_imp() and coxph_imp(), that perform regression of continuous and categorical, univariate or multi-level data as well as right-censored survival data. Model specification is similar to the specification of standard regression models in R and described in detail in Section 5.
Based on the specified model formula and other function arguments, JointAI does some preprocessing of the data. It checks which variables are incomplete and/or time-varying, and identifies their measurement level in order to specify appropriate (imputation) models. Interactions and functional forms of variables are detected in the model formula, and corresponding design matrices for the various parts of the model are created.
MCMC sampling is performed by the program JAGS (Plummer 2003). The JAGS model, data list, containing all necessary parts of the data, and user-specified settings for the MCMC sampling (described in Section 6) are passed to JAGS via the R package rjags (Plummer 2019).
The above named main functions *_imp(), all return an object of class JointAI. Summary and plotting methods for JointAI objects, as well as functions to evaluate convergence and precision of the MCMC samples, to predict from JointAI objects and to export imputed values are discussed in Section 7.
Currently, the package works under the assumption of a missing at random (MAR) missingness process (Rubin 1976(Rubin , 1987. When this assumption holds, observations with missing outcome may be excluded from the analysis in the Bayesian framework. Hence, missing values in the outcome do not require special treatment in this setting, and, therefore, our focus here is on missing values in covariates. Nevertheless, JointAI can handle missing values in the outcome; they are automatically imputed using the specified analysis model.

Example data
To illustrate the functionality of JointAI, we use three datasets that are part of this package or the package survival (Therneau 2019; Terry M. Therneau and Patricia M. Grambsch 2000). The NHANES data contain measurements from a cross-sectional cohort study, whereas the simLong data is a simulated dataset based on a longitudinal cohort study in toddlers. The third dataset (lung) contains data on survival of patients with advanced lung cancer.

The NHANES data
The NHANES data is a subset of observations from the 2011 -2012 wave of the National Health and Nutrition Examination Survey (National Center for Health Statistics (NCHS)   Figure 1 shows histograms and barplots of all continuous and categorical variables, respectively, together with the proportion of missing values for incomplete variables. Such a plot can be obtained with the function plot_all(). Arguments fill and border allow the user to change colours, the number of rows and columns can be adapted using nrow and/or ncol, and additional arguments can be passed to hist() and barplot() via "...".
The pattern of missing values in the NHANES data is shown in Figure 2. This plot can be obtained using the function md_pattern(). Again, arguments color and border allow the user to change colours and arguments such as legend.position, print_xaxis and print_yaxis permit further customization.
Each row represents a pattern of missing values, where observed (missing) values are depicted with dark (light) colour. The frequency with which each of the patterns is observed is given on the right margin, the number of missing values in each variable is given underneath the plot. Rows and columns are ordered by number of cases per pattern (decreasing) and number of missing values (increasing). The first row, for instance, shows that there are 116 complete cases, the second row that there are 29 cases for which only alc is missing. Furthermore, it is apparent that creat, albu, uricacid and bili are always missing together. Since these variables are all measured in serum, this is not surprising.
The function md_pattern() also returns the missing data pattern in matrix representation when pattern = TRUE. There, missing and observed values are represented with a 0 and 1, respectively.

The simLong data
The simLong data is a simulated dataset mimicking a longitudinal cohort study of 200 motherchild pairs. It contains the following baseline (i.e., not time-varying) covariates • GESTBIR: gestational age at birth in weeks; complete • ETHN: ethnicity; binary; 2.8% missing • AGE_M: age of the mother at intake; complete • HEIGHT_M: height of the mother in cm; 2.0% missing • PARITY: number of times the mother has given birth; binary; 2.4% missing • SMOKE: smoking status of the mother during pregnancy; 3 ordered categories; 12.2% missing • EDUC: educational level of the mother; 3 ordered categories; 7.8% missing • MARITAL: marital status; 3 unordered categories; 7.0% missing • ID: subject identifier and seven longitudinal variables: • time: measurement occasion/visit (by design, children should have been measured at/around 1, 2, 3, 4, 7, 11, 15, 20, 26, 32, 40 and 50 months of age) • age: child's age at measurement time in months • hgt: child's height in cm; 20% missing • wgt: child's weight in gram; 8.8% missing • bmi: child's BMI (body mass index) in kg/m 2 ; 21.6% missing • hc: child's head circumference in cm; 23.6% missing • sleep: child's sleeping behaviour; 3 ordered categories; 24.7% missing Figure 3 shows the longitudinal profiles of hgt, wgt, bmi and hc over age. All four variables clearly have a non-linear pattern over time.
Histograms and barplots of all variables in the simLong data are displayed in Figure 4. Here, the arguments use_level and idvar of the function plot_all() are used to display baseline (level-2) covariates on the subject level instead of the observation level: R> plot_all(simLong, use_level = TRUE, idvar = "ID", ncol = 5) The missing data pattern of the simLong data, shown in Figure 5. For readability the pattern is given separately for the level-1 (left) and level-2 (right) variables. It is clearly non-monotone, but does not have any distinctive features.

The lung data
For demonstration of the use of JointAI for the analysis of survival data we use the dataset lung that is included in the R package survival. It contains data of 228 patients with advanced lung cancer and includes the following variables: • inst: institution code; complete    Figure 6: Missing data pattern of the lung data.
• time: survival time in days; complete • status: event indicator (1: censored, 2: dead); complete • age: patient's age in years; complete • sex: male (1) vs female (2); complete • ph.ecog: ECOG performance score (describes how the disease impacts the patient's daily life); scale from 0 (no impact) to 5 (dead); 0.4% missing • ph.karno: Karnofsky performance score as rated by physician (describes the degree of a patient's impairment by the disease); scale from 0 (dead) to 100 (no impairment); 0.4% missing • pat.karno: Karnofsky performance score as rated by patient; 1.3% missing • meal.cal: kilocalories consumed at meals; 20.6% missing • wt.loss: weight loss over the last six months in kg; 6.1% missing The missing data pattern and distribution of the observed values of the lung data is shown in Figures 6 and 7.

Missing data visualization and exploration
There are several R packages that provide functionality for a more in-depth exploration of incomplete data, see for example the ones listed in the CRAN task view on missing data (https://CRAN.R-project.org/view=MissingData). Particularly useful may be the packages naniar (Tierney, Cook, McBain, and Fay 2019) and VIM (Kowarik and Templ 2016).
For the description of the remaining arguments see below and Section 6.
The arguments formula (in lm_imp(), glm_imp() and clm_imp()) and fixed (in lme_imp(), glme_imp() and clmm_imp()) take a standard two-sided formula object, where an intercept is added automatically (except in ordinal models). For the use of the argument random see Section 5.2.
Survival models expect the left hand side of formula to be a survival object (created with the function Surv() from package survival; see Section 5.3).
The argument family enables the choice of a distribution and link function from a range of options when using glm_imp() or glme_imp(). The implemented options are given in Table 1. distribution link gaussian identity, log, inverse binomial logit, probit, log, cloglog Gamma inverse, identity, log poisson log, identity

Interactions
In JointAI, interactions between any type of variables (observed, incomplete, time-varying) can be handled. When an incomplete variable is involved, the interaction term is re-calculated within each iteration of the MCMC sampling, using the imputed values from the current iteration. Interaction terms involving incomplete variables should hence not be pre-calculated as an additional variable since this would lead to inconsistent imputed values of main effect and interaction term.

Non-linear functional forms
In practice, associations between outcome and covariates do not always meet the standard assumption of linearity. Often, assuming a logarithmic, quadratic or other non-linear effect is more appropriate.
For completely observed covariates, JointAI can handle any common type of function implemented in R, including splines, e.g., using ns() or bs() from the package splines. Since functions involving variables that have missing values need to be re-calculated in each iteration of the MCMC sampling, currently, only functions that are available in JAGS can be used for incomplete variables. Those functions include: • log(), exp() • sqrt(), polynomials (using I()) • abs() • sin(), cos() • algebraic operations involving one or multiple (in)complete variables, as long as the formula can be interpreted by JAGS.
The list of functions implemented in JAGS can be found in the JAGS user manual (Plummer 2017) available at https://sourceforge.net/projects/mcmc-jags/files/Manuals/.

Functions with restricted support
When a function of an incomplete variable has restricted support, e.g., log(x) is only defined for x > 0, or, in mod2c from above I(creat/albu^2) can not be calculated for albu = 0, the model specified for that incomplete variable needs to comply with these restrictions. This can either be achieved by truncating the distribution, using the argument trunc, or by selecting a distribution that meets the restrictions.
Example: When using a log transformation for the covariate bili, we can use the default imputation method "norm" (a normal distribution) and truncate it by specifying trunc = list(bili = c(lower, upper)), where lower and upper are the smallest and largest values allowed: R> mod3a <-lm_imp(SBP~age + gender + log(bili) + exp(creat), + trunc = list(bili = c(1e-5, 1e10)), data = NHANES) Truncation always requires to specify both limits. Since −Inf and Inf are not accepted, a value far enough outside the range of the variable (here: 1e10) can be selected for the second boundary of the truncation interval.
Alternatively, we may choose a model for the incomplete variable (using the argument models; for more details see Section 5.4) that only imputes positive values such as a log-normal distribution ("lognorm") or a gamma distribution ("gamma"): Functions that are not available in R It is possible to use functions that have different names in R and JAGS, or that do exist in JAGS, but not in R, by defining a new function in R that has the name of the function in JAGS.
Example: In JAGS the inverse logit transformation is defined in the function ilogit. In base R, there is no function ilogit, but the inverse logit is available as the distribution function of the logistic distribution plogis(). Thus, we can define the function ilogit() as

A note on what happens inside JointAI
When a function of a complete or incomplete variable is used in the model formula, the main effect of that variable is automatically added as an auxiliary variable (more on auxiliary variables in Section 5.5), and only the main effects are used as predictors in the imputation models.
In mod2b, for example, the spline of age is used as predictor for SBP, but in the imputation model for bili, age enters with a linear effect. This can be checked using the function Incomplete variables are always imputed on their original scale, i.e., in mod2b the variable bili is imputed and the quadratic and cubic versions calculated from the imputed values. Likewise, creat and albu in mod2c are imputed separately, and I(creat/albu^2) calculated from the imputed (and observed) values. To assure consistency between variables, functions involving incomplete variables should be specified as part of the model formula and not be pre-calculated as separate variables.

Multi level structure and longitudinal covariates
In multi-level models, additional to the fixed effects structure specified by the argument fixed, a random effects structure needs to be provided via the argument random.
Analogously to the specification of the argument random in lme(), it takes a one-sided formula starting with a~, and the grouping variable is separated by |. A random intercept is added automatically and only needs to be specified in a random intercept only model.
A few examples: • random =~1 | id: random intercept only, with id as grouping variable • random =~time | id: random intercept and slope for variable time • random =~time + I(time^2) | id: random intercept, slope and quadratic random effect for time • random =~time + x | id: random intercept, random slope for time and random effect for variable x It is possible to use splines in the random effects structure if there are no missing values in the variables involved, e.g.: R> mod5 <-lme_imp(bmi~GESTBIR + ETHN + HEIGHT_M + ns(age, df = 2), + random =~ns(age, df = 2) | ID, data = simLong) Longitudinal ("time-varying"; level-1) covariates can be used in the fixed or random effects and will be imputed if they contain any missing values. Currently, only one level of nesting is possible.

Survival models
JointAI provides two functions to analyse survival data with incomplete covariates: survreg_imp() and coxph_imp(). Analogously to the complete data versions of these functions from the package survival, the left hand side of the model formula has to be a survival object specified using the function Surv().
Example: To analyse the lung data, we can either use a parametric Weibull model or a Cox proportional hazards model. The survival package needs to be loaded for the function Surv().

Imputation / covariate model types
JointAI automatically selects an (imputation) model for each of the incomplete baseline (level-2) or longitudinal (level-1) covariates, based on the class of the variable and the number of levels. The automatically selected types for baseline covariates are: • norm: linear model (for continuous variables) • logit: binary logistic model (for factors with two levels) • multilogit: multinomial logit model (for unordered factors with > 2 levels) • cumlogit: cumulative logit model (for ordered factors with > 2 levels) The default methods for longitudinal covariates are: • lmm: linear mixed model (for continuous longitudinal variables) • glmm_logit: logistic mixed model (for longitudinal factors with two levels) • clmm: cumulative logit mixed model (for longitudinal ordered factors with >2 levels) When a continuous variable has only two different values, it is automatically converted to a factor and modelled using a logistic model, unless a different model type is specified by the user. Variables of type logical are also converted to binary factors.
The (imputation) models that are chosen by default may not necessarily be appropriate for the data at hand, especially for continuous variables, which often do not comply with the assumptions of (conditional) normality.
Therefore, currently the following alternative imputation methods are available for continuous baseline covariates: • lognorm: normal model for the log-transformed variable (for right-skewed variables > 0) • gamma: gamma model with log-link (for right-skewed variables > 0) • beta: beta model with logit-link (for continuous variables with values in [0, 1]) lognorm assumes a (conditional) normal distribution for the natural logarithm of the variable, but the variable enters the linear predictor of the analysis model (and possibly other covariate models) on its original scale.
For longitudinal covariates the following alternative model types are available: • glmm_gamma: gamma mixed model with log-link (for right-skewed variables > 0) • glmm_poisson: Poisson mixed model with log-link (for count variables) • glmm_lognorm: normal mixed model for the log-transformed variable (for right-skewed variables > 0)

Specification of imputation model types
In models mod3b and mod3c in Section 5.1.3 we have already seen two examples in which the imputation model type was changed using the argument models. This argument takes a named vector of (imputation) model types, where the names are the names of covariates. When the vector supplied to models only contains specifications for a subset of the incomplete and longitudinal covariates, default models are used for the remaining ones. As explained in Section 2.2.1, models for completely observed longitudinal covariates only need to be specified when any of the baseline covariates have missing values.
R> mod7a <-lm_imp(SBP~age + gender + WC + alc + bili + occup + smoke, + models = c(WC = 'gamma', bili = 'lognorm'), data = NHANES, n.iter = 100) R> R> mod7a$models WC smoke bili occup alc "gamma" "cumlogit" "lognorm" "multilogit" "logit" The function get_models(), which finds and assigns the default (imputation) methods, can be called directly. get_models() has arguments • fixed: the fixed effects formula • random: the random effects formula (if necessary) • data: the dataset • auxvars: an optional one-sided formula specifying auxiliary variables • no_model: an optional vector of names of covariates for which no model will be specified • models: an optional named vector of model types chosen by the user R> mod7b_models <-get_models(bmi~GESTBIR + ETHN + HEIGHT_M + SMOKE + hc + + MARITAL + ns(age, df = 2), random =~ns(age, df = 2) | ID, + data = simLong) R> mod7b_models $models HEIGHT_M ETHN MARITAL SMOKE age "norm" "logit" "multilogit" "cumlogit" "lmm" hc "lmm" $meth HEIGHT_M ETHN MARITAL SMOKE hc "norm" "logit" "multilogit" "cumlogit" "lmm" get_models() returns a list of two vectors: • models: containing all specified models • meth: containing the models for the incomplete variables only When there is a "time" variable in the model, such as age in our example (which is the age of the child at the time of the measurement), it may not be meaningful to specify a model for that variable. Especially when the "time" variable is pre-specified by the design of the study it can usually be assumed to be independent of the covariates and a model for it has no useful interpretation.
Order of the sequence of imputation models By default, models for level-1 covariates are specified after models for level-2 covariates, so that the level-2 covariates are used as predictor variables in the models for level-1 covariates but not vice versa. Within the two groups, models are ordered by the number of missing values (decreasing), so that the model for the variable with the most missing values has the most variables in its linear predictor.

Auxiliary variables
Auxiliary variables are variables that are not part of the analysis model but should be considered as predictor variables in the imputation models because they can inform the imputation of unobserved values.

Good auxiliary variables are (Van Buuren 2012)
• associated with an incomplete variable of interest, or are • associated with the missingness of that variable and • do not have too many missing values themselves. Importantly, they should be observed for a large proportion of the cases that have a missing value in the variable to be imputed.
In *_imp(), auxiliary variables can be specified with the argument auxvars, which takes a one-sided formula.
Example: We might consider the variables educ and smoke as predictors for the imputation of occup: R> mod8a <-lm_imp(SBP~gender + age + occup, auxvars =~educ + smoke, + data = NHANES, n.iter = 100) The variables educ and smoke are not included in the analysis model (as can be seen when printing the posterior mean of the regression coefficients of the analysis model with coef()):

Functions of variables as auxiliary variables
As shown above in mod4b, it is possible to specify functions of auxiliary variables. In that case, the auxiliary variable is not considered as a linear effect but as specified by the function.
Note that omitting auxiliary variables from the analysis model implies that the outcome is independent of these variables, conditional on the other variables in the model. If this is not true, the model is misspecified which may lead to biased results (similar to leaving a confounding variable out of a model).

Reference values for categorical covariates
In JointAI, dummy coding is used when categorical variables enter a linear predictor, i.e., a binary variable is created for each category, except the reference category. These binary variables have value one when that category was observed and zero otherwise. Contrary to the behaviour in base R, this coding is also used for ordered factors. It is not possible to change this behaviour by changing the contrasts specification in the base R options().
By default, the first category of a categorical variable (ordered or unordered) is used as reference, however, this may not always allow the desired interpretation of the regression coefficients. Moreover, when categories are unbalanced, setting the largest group as reference may result in better mixing of the MCMC chains. Therefore, JointAI allows specification of the reference category separately for each variable, via the argument refcats. Changes in refcats will not impact the imputation of the respective variable, but change categories for which dummy variables are included in the linear predictor of the analysis model or other covariate models.

Setting reference categories for individual variables
Alternatively, refcats takes a named vector, in which the reference category for each variable can be specified either by its number or its name, or one of the three global types: "first", "last" or "largest". For variables for which no reference category is specified in the list the default is used.
R> mod9b <-lm_imp(SBP~gender + age + race + educ + occup + smoke, + refcats = list(occup = "not working", race = 3, educ = 'largest'), + data = NHANES) To facilitate specification of the reference categories, the function set_refcat() can be used. It prints the names of the categorical variables that are selected by • a specified model formula (using the argument formula) and/or • a one-sided formula specifying auxiliary variables (using the argument auxvars), or • a vector naming covariates (using the argument covars) or all categorical variables in the data if only the argument data is provided, and asks a number of questions which the user needs to reply to via input of a number.

Hyperparameters
In the Bayesian framework, parameters are random variables for which a distribution needs to be specified. These distributions depend on parameters themselves, i.e., on hyperparameters.
The function default_hyperpars() returns a list containing the default hyperparameters used in a JointAI model (see Appendix A). mu_reg_* and tau_reg_* refer to the mean and precision of the prior distribution for regression coefficients. shape_tau_* and rate_tau_* are the shape and rate parameters of a gamma distribution that is used as prior for precision parameters. RinvD is the scale matrix in the Wishart prior for the inverse of the random effects design matrix D, and KinvD is the number of degrees of freedom in that distribution. shape_diag_RinvD and rate_diag_RinvD are the shape and rate parameters of the gamma prior of the diagonal elements of RinvD.
In random effects models with only one random effect, a gamma prior is used instead of the Wishart distribution for the inverse of D.
The hyperparameters mu_reg_surv and tau_reg_surv are used in survreg_imp() as well as coxph_imp().

Scaling
When variables are measured on very different scales this can result in slow convergence and bad mixing. Therefore, JointAI automatically scales continuous variables to have mean zero and standard deviation one. Results (parameters and imputed values) are transformed back to the original scale.
In some settings, however, it is not possible to scale continuous variables. This is the case for those incomplete variables that enter a linear predictor in a function and for incomplete variables that are imputed with models that are defined on a subset of the real line (i.e., with a gamma or a beta distribution). Variables that are imputed with a log-normal distribution are scaled, but not centred. To prevent scaling, the argument scale_vars in *_imp() can be set to FALSE. When a vector of variable names is supplied to scale_vars, only those variables are scaled.
By default, only the MCMC samples that are scaled back to the scale of the data are stored in a JointAI object. When the argument keep_scaled_mcmc = TRUE, the scaled sample is also kept.

Ridge regression
Using the argument ridge = TRUE it is possible to impose a ridge penalty on the coefficients of the analysis model, shrinking these parameters towards zero. This is done by specification of a Gamma(0.01, 0.01) prior for the precision of the regression coefficients β instead of setting it to a fixed (small) value.

JAGS model file
Using the user-specified or default settings described above, JointAI will write the JAGS model. By default, the model is written to a temporary file and deleted afterwards. When the argument keep_model is set to TRUE the model file will be kept. Arguments modelname and modeldir allow the user to specify the name of the file (including the ending, e.g., .R or .txt) and the file location. When a file with that same name already exists in the given location, a question is prompted giving the user the option to use the existing file or to overwrite it. To prevent the question, the argument overwrite can be set to TRUE or FALSE.
The functionality of using an existing JAGS model file enables the user to make changes to the JAGS model that is created automatically by JointAI, for example to change the type of prior distribution used for a particular parameter.

MCMC settings
The main functions *_imp() have a number of arguments that specify settings for the MCMC sampling: • n.chains: number of MCMC chains • n.adapt: number of iterations in the adaptive phase • n.iter: number of iterations in the sampling phase • thin: thinning degree • monitor_params: parameters/nodes to be monitored • seed: optional seed value for reproducibility • inits: initial values • parallel: whether MCMC chains should be sampled in parallel • n.cores: how many cores are available for parallel sampling The first four, as well as the following two parameters, are passed directly to functions from the R package rjags: • quiet: should printing of information be suppressed?
• progress.bar: type of progress bar ("text", "gui" or "none") In the following sections, the arguments listed above are explained in more detail and examples are given.

Number of chains
To evaluate convergence of MCMC chains it is helpful to create multiple chains that have different starting values. More information on how to evaluate convergence and the specification of initial values can be found in Sections 7.3.1 and 6.3, respectively.
The argument n.chains selects the number of chains (by default n.chains = 3). For calculating the model summary, multiple chains are merged.

Adaptive phase
JAGS has an adaptive mode, in which samplers are optimized (for example the step size is adjusted). Samples obtained during the adaptive mode do not form a Markov chain and are discarded. The argument n.adapt controls the length of this adaptive phase.
The default value for n.adapt is 100, which works well in many of the examples considered here. Complex models may require longer adaptive phases. If the adaptive phase is not sufficient for JAGS to optimize the samplers, a warning message will be printed (see example below).

Sampling iterations
n.iter specifies the number of iterations in the sampling phase, i.e., the length of the MCMC chain. How many samples are required to reach convergence and to have sufficient precision (see also Section 7.3.2) depends on the complexity of data and model, and may range from as few as 100 to several million.

Thinning
In settings with high autocorrelation, it may take many iterations before a sample is created that sufficiently represents the whole range of the posterior distribution. Processing of such long chains can be slow and cause memory issues. The parameter thin allows the user to specify if and how much the MCMC chains should be thinned out before storing them. By default thin = 1 is used, which corresponds to keeping all values. A value thin = 10 would result in keeping every 10th value and discarding all other values.
Example: default settings Using the default settings n.adapt = 100 and thin = 1, and 100 sampling iterations, a simple model would be R> mod10a <-lm_imp(SBP~alc, data = NHANES, n.iter = 100) The relevant part of the model summary (obtained with summary()) shows that the first 100 iterations (adaptive phase) were discarded, the 100 iterations that follow form the posterior sample, thinning was set to "1" and there are three chains. Example: thinning R> mod10c <-lm_imp(SBP~alc, data = NHANES, n.iter = 500, thin = 10) Here, iterations 110 until 600 are used in the output, but due to a thinning interval of ten, the resulting MCMC chains contain only 50 samples instead of 500, that is, the samples from iteration 110, 120, 130, . . .

Parameters to follow
Since JointAI uses JAGS (Plummer 2003) for performing the MCMC sampling, and JAGS only saves the values of MCMC chains for those nodes for which the user has specified that they should be monitored, this is also the case in JointAI.
For this purpose, the main functions *_imp() have an argument monitor_params, which takes a named list (or a named vector) with possible entries given in Table 6.2. This table contains a number of keywords that refer to (groups of) nodes. Each of the keywords works as a switch and can be specified as TRUE or FALSE (with the exception of other).

Parameters of the analysis model
The default setting is monitor_params = c(analysis_main = TRUE), i.e., only the main parameters of the analysis model are monitored, and monitoring is switched off for all other parameters. Main parameters are the regression coefficients of the analysis model (beta), depending on the model type, the residual standard deviation (sigma_y), and, for mixed models, the random effects variance-covariance matrix D.
The function parameters() returns the parameters specified to be followed (also for models where no MCMC sampling was performed, i.e., when n.iter = 0 and n.adapt = 0). We use it here to demonstrate the effect that different choices for monitor_params have. For example: R> mod11a <-lm_imp(SBP~gender + WC + alc + creat, data = NHANES, n.adapt = 0) R> R> parameters(mod11a) [1] "(Intercept)" "genderfemale" "WC" "alc>=1" [5] "creat" "sigma_SBP" Note that the elements of "beta" are re-named after the names of the covariates they refer to in order to facilitate interpretation of the results.

Parameters of covariate models and imputed values
The parameters of the models for covariates can be selected with monitor_params = c(imp_pars = TRUE). This will set monitors for the vector of regression coefficients (alpha) and other parameters, such as precision (tau_*) and intercepts and increments (gamma_* and delta_*) in cumulative logit models.

Random effects
For mixed models, analysis_main also includes the random effects variance-covariance matrix D. Setting analysis_random = TRUE will switch on monitoring for the random effects b (ranef), random effects variance-covariance matrix (D), inverse of the random effects variancecovariance matrix (invD) and the diagonal of the scale matrix of the Wishart-prior of invD (RinvD).

Other parameters
The element other in monitor_params allows the specification of one or multiple additional parameters to be monitored. When other is used with more than one element, monitor_params has to be a list. Here, as an example, we monitor the probability of being in the alc>=1 group for subjects one through three and the expected value of the distribution of creat for the first subject.

Initial values
Initial values are the starting point for the MCMC sampler. Setting good initial values, i.e., values that are likely under the posterior distribution, can speed up convergence. By default, the argument inits = NULL, which means that initial values are generated automatically by JAGS. It is also possible to supply initial values directly as a list or as a function.
Initial values can be specified for every unobserved node, that is, parameters and missing values, and it is possible to specify initial values for only a subset of nodes.
When the initial values provided by the user do not have elements named ".RNG.name" or ".RNG.seed", JointAI will add those elements, which specify the name and seed value of the random number generator used for each chain. The argument seed allows the specification of a seed value with which the starting values of the random number generator, and, hence, the values of the MCMC sample, can be reproduced.

Initial values in a list of lists
A list containing initial values should have the same length as the number of chains, where each element is a named list of initial values. Moreover, initial values should differ between the chains.
For example, to create initial values for the parameter vector beta and the precision parameter tau_SBP for two chains the following syntax could be used: R> init_list <-lapply(1:2, function(i) { + list(beta = rnorm(4), + tau_SBP = rgamma(1, 1, 1)) + }) R> R> init_list The user provided lists of initial values (and starting values for the random number generator) are stored in the JointAI object and can be accessed via mod11a$mcmc_settings$inits.

Initial values as a function
Initial values can be specified as a function. The function should either take no arguments or a single argument called chain, and return a named list that supplies values for one chain.
For example, to create initial values for the parameter vectors beta and alpha: R> inits_fun <-function() { + list(beta = rnorm(4), + alpha = rnorm (3) When a function is supplied, the function is evaluated by JointAI and the resulting list is stored in the JointAI object.

For which nodes can initial values be specified?
Initial values can be specified for all unobserved stochastic nodes, i.e., parameters or unobserved data for which a distribution is specified in the JAGS model. They have to be supplied in the format of the parameter or unobserved value in the JAGS model. To find out which nodes there are in a model and in which form they have to be specified, the function coef() from package rjags can be used to obtain a list with the current values of the MCMC chains (by default the first chain) from a JAGS model object. This object is contained in a JointAI object under the name model. Elements of the initial values should have the same structure as the elements in this list of current values.

Example:
We are using a longitudinal model and the simLong data in this example. Here we only show some relevant parts of the output of coef(mod12c$model).
R> mod12c <-lme_imp(bmi~time + HEIGHT_M + hc + SMOKE, random =~time | ID, R> no_model = 'time', data = simLong) RinvD is the scale matrix in the Wishart prior for the inverse of the random effects variancecovariance matrix D. In the data that is passed to JAGS (which is stored in the element data_list in a JointAI object), this matrix is specified as a diagonal matrix, with unknown diagonal elements: The element RinvD in the initial values has to be a 2 × 2 matrix, with positive values on the diagonal and NA as off-diagonal elements, since these are fixed in the data: There are no initial values specified for the third and fourth column, since these columns contain the dummy variables corresponding to the categorical variable SMOKE and are calculated from the corresponding column in the matrix Xcat, i.e., these are deterministic nodes, not stochastic nodes. Elements that are completely unobserved, like the parameter vectors alpha and beta, the random effects b or scalar parameters delta_SMOKE and gamma_SMOKE are entirely specified in the initial values.

Parallel sampling
To reduce the computational time it is possible to perform sampling of multiple MCMC chains in parallel. This can be specified by setting the argument parallel = TRUE. The maximum number of cores to be used can be set with the argument n.cores. By default this is two less than the number of cores available on the machine, but never more than the number of MCMC chains.
Parallel computation is done using the packages foreach (Microsoft and Weston 2019) and doParallel (Microsoft Corporation and Weston 2019). Note that it is currently not possible to display a progress bar when using parallel computation and that (warning) messages produced during sampling (for instance about an insufficient adaptive phase) are not returned.

After fitting
Any of the main functions *_imp() will return an object of class JointAI. It contains the original data (data), information on the type of model (call, analysis_type, models, fixed, random, hyperpars, scale_pars) and MCMC sampling (mcmc_settings), the JAGS model (model) and MCMC sample (MCMC; if a sample was generated), the computational time (time) of the MCMC sampling, and some additional elements that are used by methods for objects of class JointAI but are typically not of interest to the user.
In this section, we describe how the results from a JointAI model can be visualized, summarized and evaluated. The functions described here use, by default, the full MCMC sample and show only the parameters of the analysis model. Arguments start, end, thin and exclude_chains are available to select a subset of the iterations of the MCMC sample that is used to calculate the summary. The argument subset allows the user to control for which nodes the summary or visualization is returned and follows the same logic as the argument monitor_params in *_imp(). The use of these arguments is further explained in Section 7.4.

Visualizing the posterior sample
The posterior sample can be visualized by two commonly used plots: a traceplot, showing samples across iterations, and a plot of the empirical density of the posterior sample.

Traceplot
A traceplot shows the sampled values per chain and node across iterations. It allows the visual evaluation of convergence and mixing of the chains and can be obtained with the function traceplot().
R> mod13a <-lm_imp(SBP~gender + WC + alc + creat, data = NHANES, + n.iter = 500, seed = 2019) R> R> traceplot(mod13a) When the sampler has converged the chains show one horizontal band, as in Figure 8. Consequently, when traces show a trend, convergence has not been reached and more iterations are necessary (e.g., using add_samples()). Graphical aspects of the traceplot can be controlled by specifying standard graphical arguments via the dots argument "...", which are passed to matplot() (which is part of base R). This allows the user to change colour, linetype and -width, limits, etc. Arguments nrow and/or ncol can be supplied to set specific numbers of rows and columns for the layout of the grid of plots.
With the argument use_ggplot it is possible to get a ggplot2 (Wickham 2016) version of the traceplot. It can be extended using standard ggplot2 syntax. The output of the following syntax is shown in Figure 9.

Densityplot
The posterior distributions can also be visualized using the function densplot(), which plots the empirical density per node per chain, or combining chains (when joined = TRUE). The argument vlines takes a list of lists, containing specifications passed to the function abline() (part of base R), and allows the addition of (vertical) lines to the plot, e.g., marking zero, or marking the posterior mean and 2.5% and 97.5% quantiles ( Figure 10).
As with traceplot() it is possible to use the ggplot2 version of densplot() when setting use_ggplot = TRUE. Here, vertical lines can be added as additional layers. Figure 11 shows, as an example, the posterior density from mod13a to which vertical lines, representing the 95% credible interval and a 95% confidence interval from a complete case analysis, are added. The corresponding syntax is given in Appendix B.

Model Summary
A summary of the posterior distribution estimated in a JointAI model can be obtained using the function summary().
The posterior summary consists of the mean, standard deviation and quantiles (by default the 2.5% and 97.5% quantiles) of the MCMC samples from all chains combined, as well as the tail probability (see below) and Gelman-Rubin criterion (see Section 7.3.1).
Additionally, some important characteristics of the MCMC samples on which the summary is based, are given. This includes the range and number of iterations (Sample size per chain), thinning interval and number of chains. Furthermore, the number of observations (number of rows in the data) is printed.

Tail probability
The tail probability, calculated as 2 × min {P r(θ > 0), P r(θ < 0)} , where θ is the parameter of interest, is a measure of how likely the value 0 is under the estimated posterior distribution. Figure 12 visualizes three examples of posterior distributions and the corresponding minimum of P r(θ > 0) and P r(θ < 0) (shaded area).

Evaluation criteria
Convergence of the MCMC chains and precision of the posterior sample can also be evaluated in a more formal manner. Implemented in JointAI are the Gelman-Rubin criterion for convergence (Gelman and Rubin 1992;Brooks and Gelman 1998)  The Gelman-Rubin criterion (Gelman and Rubin 1992;Brooks and Gelman 1998), also referred to as "potential scale reduction factor", evaluates convergence by comparing within and between chain variability and, thus, requires at least two MCMC chains to be calculated. It is implemented for JointAI objects in the function GR_crit(), which is based on the function gelman.diag() from the package coda (Plummer, Best, Cowles, and Vines 2006). The upper limit of the confidence interval should not be much larger than 1.

Monte Carlo Error
Precision of the MCMC sample can be checked with the function MC_error(). It uses the function mcmcse.mat() from the package mcmcse (Flegal, Hughes, Vats, and Dai 2017) to calculate the Monte Carlo error (the error that is made since the sample is finite) and compares it to the standard deviation of the posterior sample. A rule of thumb is that the Monte Carlo error should not be more than 5% of the standard deviation (Lesaffre and Lawson 2012). Besides the arguments explained in Section 7.4, MC_error() takes the arguments of mcmcse.mat().  MC_error() returns an object of class MCElist, which is a list containing matrices with the posterior mean, estimated Monte Carlo error, posterior standard deviation and the ratio of the Monte Carlo error and posterior standard deviation, for the scaled (if this MCMC sample was included in the JointAI object) and unscaled (transformed back to the scale of the data) posterior samples. The associated print method prints only the latter.
To facilitate quick evaluation of the Monte Carlo error to posterior standard deviation ratio, plotting of an object of class MCElist using plot() shows this ratio for each (selected) node and automatically adds a vertical line at the desired cut-off (by default 5%; see Figure 13): R> plot(MC_error(mod13a)) R> plot(MC_error(mod13a, end = 250))

Subset of the MCMC sample
By default, the functions traceplot(), densplot(), summary(), predict(), GR_crit() and MC_Error() use all iterations of the MCMC sample and consider only the parameters of the analysis model (if they were monitored). In this section we describe how the set of iterations and parameters to display can be changed using the arguments subset, start, end and thin.

Subset of parameters
When the main parameters of the analysis model have been monitored in a JointAI object, i.e., when monitor_params was set to TRUE, only these parameters are returned in the model summary, plots and criteria shown above. If the main parameters of the analysis model were not monitored and the argument subset is not specified, all parameters that were monitored are displayed.
To display output for nodes other than the main parameters of the analysis model or for a subset of nodes, the argument subset needs to be specified. It follows the same logic as the argument monitor_params of *_imp explained in Section 6.2.
Example: When the number of imputed values is large or in order to check convergence of random effects, it may not be feasible to plot and inspect all traceplots. In that case, a random subset of, for instance, the random effects, can be selected.
Re-fit the model monitoring the random effects: R> mod13e <-update(mod13b, monitor_params = c(ranef = TRUE)) Extract random intercepts and random slopes: Plot the chains of 12 randomly selected random intercepts and slopes (output not shown):

Subset of MCMC samples
With the arguments start, end and thin it is possible to select which iterations from the MCMC sample are included in the summary. start and end specify the first and last iterations to be used, thin the thinning interval. Specification of start thus allows the user to discard a "burn-in", i.e., the iterations before the MCMC chain had converged.
If a particular chain does not have converged it can be excluded from the result summary or plot using the argument exclude_chains which takes a numeric vector identifying chains to be excluded, e.g., exclude_chains = c(1,3).

Predicted values
Often, the aim of an analysis is not only to estimate the association between outcome and covariates but to predict future outcomes or outcomes for new subjects.
The function predict() allows us to obtain predicted values and corresponding credible intervals from JointAI objects. Note that for mixed models, currently only prediction for an "average" subject is implemented, not prediction conditional on the random effects. A dataset containing data for which the prediction should be performed for is specified via the argument newdata. If no newdata is given, the origina data are used. The argument quantiles allows the specification of the quantiles of the posterior sample that are used to obtain the credible interval (by default the 2.5% and 97.5% quantile). Arguments start, end, thin and exclude_chains control the subset of MCMC samples that is used. predict() returns a list with elements dat (a dataset consisting of newdata with the predicted values and quantiles appended), fit (the predicted values) and quantiles (the quantiles that form the credible interval).
Via the argument type the user can specify the scale of the predicted values. For generalized (mixed) models predicted values can be calculated on the scale of the linear predictor (type = "link") or the scale of the response (type = "response"). For ordinal (mixed) models it is possible to return the posterior probability of each of the outcome categories (type = "prob") or the class with the hightest mean posterior probability (type = "class"). Note that, since quantiles is an array when type = "prob" it is not appended to dat in that case.
For survival models type = "lp" is synonymus for type = "link" and type = "risk" is exp(lp) in Cox models.

Prediction to visualize non-linear effects
Another reason to obtain predicted values is the visualization of non-linear effects (see Figure 14). To facilitate the generation of a dataset for such a prediction, the function predDF() can be used. It generates a data.frame that contains a sequence of values through the range of observed values for a covariate specified by the argument var, and the median or reference value for all other continuous and categorical variables.

Export of imputed values
Imputed datasets can be extracted from a JointAI object (in which a monitor for the imputed values has been set, i.e., monitor_params = c(imps = TRUE)) with the function get_MIdat().
A completed dataset is created by taking the imputed values from a randomly chosen iteration of the MCMC sample, transforming them back to the original scale (if scaling was performed before the MCMC sampling) and filling them into the original incomplete data.
The argument m specifies the number of imputed datasets to be created, include controls whether the original data are included in the long format data.frame (default is include = TRUE), start specifies the first iteration that may be used, and minspace is the minimum number of iterations between iterations eligible for selection. To make the selection of iterations reproducible, a seed value can be specified via the argument seed.
When export_to_SPSS = TRUE the imputed data is exported to SPSS (IBM SPSS Statistics, IBM Corp.), i.e., a .txt file containing the data and a .sps file containing SPSS syntax to convert the data into an SPSS data file (with ending .sav) are written. Arguments filename and resdir allow specification of the name of the .txt and .sps file and the directory they are written to.
get_MIdat() returns a long-format data.frame containing the imputed datasets (and by default the original data) stacked on top of each other. The imputation number is given in the variable Imputation_, column .id contains a newly created id variable for each observation in cross-sectional data (multi-level data should already contain an id variable) and the column .rownr identifies rows of the original data (relevant in multi-level data).
R> impDF <-get_MIdat(mod13d, m = 10, seed = 2019) The function plot_imp_distr() allows visual comparison of the distributions of the observed and imputed values ( Figure 15): R> plot_imp_distr(impDF, nrow = 1) The distribution of the observed values is shown in dark blue, the distribution of the imputed values per dataset in light blue.

Assumptions and extensions
Like any statistical model, the approach to imputation followed in JointAI relies on assumptions that need to be satisfied in order to obtain valid results.
A commonly made assumption that is also required for JointAI is that the missing data mechanism is ignorable, i.e., that data is missing at random (MAR) or missing completely at random (MCAR) (Rubin 1976) and that parameters in the model of the missingness mechanism are independent of the parameters in the data model (Schafer 1997). It is the task of the researcher to critically evaluate whether this assumption is satisfied for a given dataset and model.
Furthermore, all models involved in the imputation and analysis need to be correctly specified. In current implementations of imputation procedures in software, imputation models are typically automatically specified, using standard assumptions like linear associations and default model types. In JointAI, the arguments models and auxvar permit tailoring of the automatically chosen models to some extent, by allowing the user to chose non-normal imputation models for continuous variables and to include variables or functional forms of variables that are not used in the analysis model in the linear predictor of the imputation models.
When using auxiliary variables in JointAI, it should be noted that due to the ordering of the conditional distributions in the sequence of models it is implied that the auxiliary variable is independent of the outcome, since neither the model for the auxiliary variable (if the auxiliary variable has missing values) has the outcome in its linear predictor nor vice versa.
Moreover, in order to make software usable, default values have to be chosen for various parameters. These default values are chosen to work well in certain settings, but can not be guaranteed to be appropriate in general and it is the task of the user to make the appropriate changes. In JointAI this concerns, for example, the choice of hyperparameters and automatically chosen types of imputation models.
To expand the range of settings in which JointAI provides a valid and user-friendly way to simultaneously analyse and impute data, several extensions are planned. These include: • Implementation of (penalized) splines for incompletely observed covariates. • Evaluation of model fit by providing functionality to perform posterior predictive checks.
• Implementation of subject-specific prediction from mixed models.
• Extension of the Cox model to handle time-varying covariates.
• Implementation of imputation for unordered level-1 factors.
• Implementation of additional choices of shrinkage priors (such as lasso and elastic net).