Generating Adaptive and Non-Adaptive Test Interfaces for Multidimensional Item Response Theory Applications

Computerized adaptive testing (CAT) is a powerful technique to help improve measurement precision and reduce the total number of items required in educational, psychological, and medical tests. In CATs, tailored test forms are progressively constructed by capitalizing on information available from responses to previous items. CAT applications primarily have relied on unidimensional item response theory (IRT) to help select which items should be administered during the session. However, multidimensional CATs may be constructed to improve measurement precision and further reduce the number of items required to measure multiple traits simultaneously. A small selection of CAT simulation packages exist for the R environment; namely, catR (Magis and Raîche 2012), catIrt (Nydick 2014), and MAT (Choi and King 2014). However, the ability to generate graphical user interfaces for administering CATs in real-time has not been implemented in R to date, support for multidimensional CATs have been limited to the multidimensional three-parameter logistic model, and CAT designs were required to contain IRT models from the same modeling family. This article describes a new R package for implementing unidimensional and multidimensional CATs using a wide variety of IRT models, which can be unique for each respective test item, and demonstrates how graphical user interfaces and Monte Carlo simulation designs can be constructed with the mirtCAT package.


Introduction
Computerized adaptive testing (CAT) is a methodology designed to reduce the length of educational, psychological, and medical tests. In contrast to fixed linear tests (e.g., paperand-pencil forms, or digital surveys where questions are administered in sequence), CATs attempt to select optimal items based on selection rules that capitalize on pre-calibrated item information and the participants' provisional trait estimates (Weiss 1982). Throughout a CAT session, the trait estimates are updated as the responses to items are collected. The trait estimates serve as a basis in determining which items should be administered next, and the associated standard errors for the estimates help inform whether the CAT session should be terminated early. In common CAT designs, items are administered when they are believed to effectively reduce the expected standard error of measurement of the latent trait values. Administering items which optimally reduce the standard error of measurement helps to create efficient test forms that improve the measurement reliability for a given participant by using a smaller subset of test items (Wainer and Dorans 2000).
In order to implement CATs effectively, various item-level characteristics must be known a priori. Specifically, the parameters required to operationalize the item selection process, as well as to compute provisional latent trait estimates, are generally adopted from the item response theory paradigm (IRT; Lord 1980). IRT parameters can be estimated for tests containing unidimensional or multidimensional latent trait structures, and offer a parametric mechanism to model the interaction between participants and item characteristics (Reckase 2009). CATs based on a unidimensional latent trait assumption have been extensively studied in methodological literature; however, with the advent of modern computing power, multidimensional CATs are becoming more popular (Reckase 2009). Multidimensional CATs (MCATs) are a useful alternative to administering multiple unidimensional CATs in situations where the traits are correlated (Segall 1996) or when items simultaneously capture variation in multiple traits (i.e., have "cross-loadings" in the vernacular of linear factor analysis; Mulaik 2010). Correlations between latent traits provide additional information about the locations of auxiliary traits, and in turn help to improve the overall measurement precision between the trait estimates. Due to the increase in statistical information, MCATs will often require fewer items than independently administered unidimensional CATs to reach the same measurement precision (Mulder and Linden 2009).
Several important prerequisites are required before building interfaces to be used for MCATs. A cursory overview of these prerequisites include: • Obtaining a suitable item pool. The item pool (or bank) is a relatively large set of items that can be selected from during an MCAT application. The associated item parameters must have been calibrated for the population of interest beforehand using multidimensional IRT software (e.g., the mirt package in R Chalmers 2012, or an equivalent). In situations where more than one population will be administered items from the pool, all items should contain limited to no differential functioning to ensure that the selection of items is unbiased (Chalmers, Counsell, and Flora 2016;Wainer and Dorans 2000).
• Initializing the MCAT session. Before an MCAT session can begin, initial latent trait estimates and hyper-parameter distribution definitions are generally required. The initial trait estimates often serve as the basis for selecting the initial item (if the initial item was not explicitly declared), while the hyper-parameter distributions are included as an added information component in the item selection process. The hyper-parameters are also included to add prior distributional information when updating the ability estimates throughout the MCAT session. When there is little to no prior information about the ability estimates, the starting values are generally selected to equal the mean of the latent trait distribution (often this is simply a vector of zeros).
• Selecting the next item to administer. Several criteria have been proposed for unidimensional CATs to select optimal items for ability and classification designs, many of which have been implemented in unidimensional CAT software in R (e.g., see Magis and Raîche 2012). Fewer MCAT criteria have been proposed in the literature, though a small number of criteria are available. MCAT selection methods include the determinantrule (D-rule), trace of the information or asymptotic covariance matrix (T-rule and A-rule, respectively), weighted composite rule (W-rule), eigenvalue-rule (E-rule), and the Kullback-Leibler divergence criteria (Kullback and Leibler 1951). Due to their importance in MCAT applications, these criteria are explained in more detail below.
• (Optional) -Selecting a pre-CAT design. Because MCAT estimation methods are based on responses to previous items, it can be desirable to run a "pre-CAT" stage before beginning the actual MCAT. In the pre-CAT stage, a small selection of items are administered under more controlled settings to ensure that, during the MCAT stage, the methods have enough information to be properly executed.
• Selecting the IRT scoring method. Multiple criteria have been defined for obtaining provisional trait estimates. These criteria include: maximum-likelihood (ML) estimation, evaluating the expected or maximum values of the a posteriori distribution, weighted likelihood estimation, and several others (Bock and Aitkin 1981;Warm 1989). However, ML estimation requires special care because it cannot be used if responses are at the extreme ends of the categories (i.e., all-correct, all-incorrect). One possible solution to this issue is to use Bayesian methods (such as maximum a posteriori estimation) until a sufficient amount of variability in the responses are available for proper ML estimation. Another potential solution when selecting the ML algorithm is to include a pre-MCAT stage to collect responses until suitable ML estimates can be obtained.
• Terminating the application. Deciding how to terminate an MCAT session is important for many practical reasons. MCATs may be terminated according to multiple criteria in a single session. For example, terminating a test based on the standard error of measurement is desirable if inferences about the precision of each latent trait estimate is required, though for multidimensional models the choice of whether this criteria should be applied globally or specifically for each latent trait must be specified. Tests may also be terminated after a specific number of items have been administered, the time allotted for answering the test has expired, the latent traits can be classified as above or below a set of predetermined latent cutoff values (Eggen 1999), and so on.
Much of the superficial information listed above is also important for unidimensional CAT applications. Conversely, literature relevant to unidimensional CATs will largely be relevant for MCATs because they share the same underlying methodology. Therefore, additional information regarding MCAT methodology can largely be obtained from previous CAT publications, such as Magis and Raîche (2012) and the references therein.
A small number of R packages exist for studying CAT designs through Monte Carlo simulations, including catR (Magis and Raîche 2012) and catIrt (Nydick 2014), which exclusively focus on unidimensional IRT models, and MAT (Choi and King 2014), which exclusively investigates the properties of the multidimensional three-parameter logistic model (M3PL). Hitherto, these packages have provided useful simulation tools for Monte Carlo research of CAT design combinations with homogeneous IRT models; however, they have not been organized for real-time implementation of CATs, do not provide resources to build graphical user interfaces (GUIs), exclusively support either unidimensional or multidimensional CATs, and do not support mixing different classes of IRT models into CAT designs.
As more R packages are developed for studying unidimensional and multidimensional CATs, a number of pertinent features remain missing. The mirtCAT package described in this article has been designed to address many of these missing features in the R environment. Specifically, mirtCAT provides front-end users with functions for generating CAT GUIs to be used in their research applications, and includes several tools for investigating the statistical properties of heterogeneous CAT designs by way of Monte Carlo simulations. The remainder of this article describes the theory behind applying MCATs, provides examples of how Monte Carlo simulation studies can be organized with the code from the package, and demonstrates how real-time unidimensional and multidimensional CATs -as well as standard questionnaire designs -can be generated to collect item response data.

Multidimensional computerized adaptive testing
A number of multidimensional IRT models have been proposed for dichotomous and polytomous response data. For ease of presentation, we will only focus on the multidimensional four-parameter logistic model (M4PL) for dichotomous responses (coded as 0 and 1 for incorrect and correct answers, respectively), which is an extension of the multidimensional three-parameter logistic model (Reckase 2009), and the multidimensional nominal response model for polytomous items (Thissen, Cai, and Bock 2010). Multidimensional IRT models often contain a unidimensional counterpart as a special case when only one latent trait is modeled; therefore, the following theory relates to unidimensional IRT models as well. 1 The probability that a participant positively endorses the j-th dichotomous item with an M4PL structure is where the complementary probability for answering the item incorrectly is P j (y = 0|θ) = 1 − P j (y = 1|θ). The g j and u j parameters are restricted to be between 0 and 1 (where g j < u j ), and serve to bound the probability space within g j ≤ P j (y = 1|θ) ≤ u j . The g parameter is useful when there is a non-zero probability for participants to randomly guess an item correctly. The u j parameter, on the other hand, controls the probability that participants will carelessly answer an item incorrectly. The multidimensional two-parameter logistic model (M2PL) can be recovered from Equation 1 when the g j and u j are fixed to 0 and 1, respectively, and the multidimensional three-parameter model (M3PL) is realized when fixing only u j to 1. Finally, θ is taken to be a D-dimensional vector of random ability or latent trait values, d j is the item intercept parameter representing the relative item "easiness", and a j is a vector of slope parameters that modulate how θ influences the probability function.
The multidimensional nominal response model (MRNM) can be used to model K-unordered 1 Although only two IRT models are presented below, in principle many other IRT models may be substituted in empirical applications. polytomous response categories that are coded k = 0, 1, . . . , K − 1. This model has the form where the a j and θ terms have the same interpretation as in Equation 1. Equation 2 contains unique intercept values (d jk ) and so-called "scoring" parameters (α jk ) for each respective category. For identification purposes, the first element of d jk and α jk are often constrained to be equal to 0, while the last element of α jk is constrained to be K − 1. The α jk values represent the relative ordering of the categories; larger α jk values indicate that the category has a stronger relationship with higher levels of θ. When specific constraints are applied to Equation 2, various specialized IRT models can be recovered. For instance, when the scoring parameters are constrained to have equal interval spacing (α j = 0, 1, 2, . . . , K − 1) the multidimensional generalized partial credit model (MGPCM) is realized, and when K = 2 the MRNM will become equivalent to the M2PL model.

Predicting latent trait scores
After item responses have been collected, various estimates for θ can be computed. Thê θ estimates are obtained using the observed item responses, the item trace-line functions given their respective item parameters (ψ j ), and (potentially) prior distributional information about θ. Multiple methods exist for obtainingθ values, such as weighted and unweighted maximum-likelihood estimation (WLE and ML, respectively; Bock and Aitkin 1981;Warm 1989), Bayesian methods such as the expectation or maximum of the posterior distribution (EAP and MAP, respectively; Bock and Aitkin 1981), and several others which have seen less use in applied settings (e.g., EAP for sum scores; Thissen, Pommerich, Billeaud, and Williams 1995). ML estimation of θ for a given response pattern requires optimizing the likelihood function where χ jk is a dichotomous indicator variable (coded as 0 or 1) used to select the probability terms corresponding to the endorsed categories. In practice, however, it is generally more effective to use the log of Equation 3, Optimizing the log-likelihood directly results in ML estimates; however, obtaining a possible maximum requires that y contain a mix of 0 to K j − 1 responses across the J items. If there is no variability in the response vector, such that SD(y) = 0, thenθ will tend to −∞ or ∞ during optimization. Bayesian methods generally do not from suffer this particular limitation because they include additional information about the distribution of θ through a prior density function, φ(·), with hyper-parameters, η. The posterior function utilized in Bayesian prediction methods is where Equation 5 is either integrated across to find the EAP estimates or maximized to find the MAP estimates. In multidimensional IRT applications, the prior density function is typically assumed to be from the multivariate Gaussian family with mean vector µ θ and covariance matrix Σ θ ; however, other multivariate density functions are possible.
Following the computation ofθ, a measure of precision is required to make inferences about the statistical precision of the estimates. As is the case with standard ML estimation theory, computing a quadratic approximation of the curvature in Equation 4 provides a suitable measure of the parameter variability (Fisher 1925). The computation of VAR(θ) is determined by inverting the D × D matrix of second derivatives with respect to θ (also known as the Hessian or negative of the observed information matrix), Standard error estimates for each element inθ are then obtained by taking the square-root of each diagonal element in the asymptotic covariance matrix, Σ(θ|y, ψ). If the log of Equation 5 is used instead of Equation 4 then prior information about θ will also be included in the computation of the Hessian. Due to the added statistical information in Bayesian methods, Σ(θ|y, ψ, η) will generally provide slightly smaller standard errors when an informative prior distribution is included in the computations. 2

Item selection for MIRT models
Selecting optimal items in MCATs is generally more complicated than unidimensional CATs because items should only be selected if they effectively improve the precision of multiple traits. In this section, we focus on item selection methods that are tailored towards obtaining the lowest SE(θ) for all individuals sampled. An interesting area that is not investigated in this section is when items are selected so to optimally classify individuals above or below predefined cutoff values. Although various methods exist for unidimensional models, classification-based applications for MCATs have rarely been investigated in the literature and continue be an important area for future research (Reckase 2009).
Selecting items according to the maximum information principle (Lord 1980) requires evaluating the Fisher-information matrix for each remaining item in the pool. The Fisher information is defined as where the inverse of Equation 7 serves as another suitable measure to approximate sampling variability of θ. For notational clarity, the vector of parameters ψ is omitted from the following presentation because the item parameters are assumed to be constant. F(θ) is useful in MCAT applications because it contains no reference to the observed response patterns, and therefore can be used to predict the amount of expected information contributed by items that have not been administered. However, in MCAT applications θ is not known beforehand; therefore, provisional estimates (θ) are used instead as plausible stand-ins for θ. Theθ values are continually updated throughout the MCAT session to provide better approximates to the unobserved θ values. Because the precision about θ improves as more items are administered, the Fisher information criteria will in turn progressively select more suitable items for the unobserved latent trait values.
As outlined in work by Segall (1996), selecting the most informative item requires evaluating F(θ) for each of the M remaining items in the item pool. Due to the local independence assumption in IRT (Lord 1980), the information contributed by the addition of the m-th item is additive, such that where F J (θ) is the sum of the information matricies for the previously answered items. The matrix in Equation 8 is evaluated for the m = 1, 2, . . . , M remaining items in the pool. The M information matrices are then compared by reducing the multidimensional information to suitable scalar values according to how the joint variability should be quantified. If prior distribution information is included in the selection process then the following formula can be used to compute a Bayesian variant of the expected information (Segall 1996) In the situation where a multivariate normal prior distribution is included, Equation 9 can be expressed as One potentially optimal approach to quantify the amount of joint item information in F J+m (θ) is to select the item which provides the largest matrix determinant. The item with the maximum determinant indicates which item provides the largest increase in the volume of F J (θ); consequently, this selection property will maximally decrease the overall volume in Σ J (θ) as well, where Σ J (θ) = F −1 J (θ) (Segall 1996). This criterion is called the "D-rule", and is more formally expressed as Another potentially useful criterion for selecting items is the maximum trace of F J+m (θ), T-rule = max Tr(F J+1 (θ)), Tr(F J+2 (θ)), . . . , Tr(F J+M (θ)) .
While the T-rule does not guarantee the largest reduction in volume for Σ J (θ), it does select items which increase the average unweighted information about the latent traits, and also allows for unequal domain score weights to be applied if certain latent traits are deemed to be more important a priori. Applying weights to the T-rule helps to measure important traits more accurately because items will be selected with greater frequency if they measure the traits of interest. A closely related selection criterion to the T-rule is the asymptotic covariance rule, or A-rule (Mulder and Linden 2009), which selects items based on the minimum (potentially weighted) trace of Σ J+m (θ), A-rule = min Tr(Σ J+1 (θ)), Tr(Σ J+2 (θ)), . . . , Tr(Σ J+M (θ)) .
Much like the T-rule, the A-rule does not guarantee the maximum increase in the volume of the information matrix. Instead, the A-rule attempts to reduced the marginal expected standard error for eachθ by ignoring the covariation between traits. Next, the eigenvalue rule (Erule) has been proposed to select the item which minimizes the general variance of the ability estimates by selecting the smallest possible value from the set of eigenvalues in each Σ J+m (θ). However, the E-rule may not optimally select items in a way that maximally reduces the standard error of measurement for all latent traits, and in general is not recommended for routine use (Mulder and Linden 2009). Finally, the W-rule can be used to select the maximum of the weighted information criteria, W = w F J+m (θ)w, where w is a weight vector subject to the constraint 1 w ≡ 1 (Linden 1999). As with the optional weights for the T-rule and A-rule, the W-rule is an effective selection mechanism when specific latent traits should have lower measurement error than other traits.
An alternative approach to selecting items using the Fisher information given provisional θ values is the Kullback-Leibler information (Chang and Ying 1996). This approach has the potential benefit over traditional information-based methods in that it can account for uncertainty of theθ values when only a small number of items have been administered (Chang and Ying 1996). The Kullback-Leibler information is where θ 0 is the vector of true ability parameters, and the double bars in KL(θ||θ 0 ) signify that θ and θ 0 should be treated distinctly. Chang and Ying (1996) suggested that Equation 14 should be evaluated over the range θ ± ∆ n , where ∆ n may decrease by a factor of √ n as the number of items administered increases. A numerical integration approach is also possible for the Kullback-Leibler information, though for multidimensional IRT models this may be less useful because of the (often cumbersome) numerical evaluation of the integration grid across all latent trait dimensions (see Reckase 2009, p. 335, for a similar observation about evaluating the KL information in MCATs).

Exposure control and content balancing
An unfortunate consequence when maintaining item pools is that informative items are often selected with greater frequency than less informative items. Exposing a smaller selection of items too often may lead to item security issues, loss of investments for items that are rarely selected, or in some cases may cause a decrease in content validity coverage due to reduced item sampling variability. Selection methods can be adjusted by including "exposure control" methods to help avoid overusing highly informative items (Linden and Glas 2010). Several methods of exposure control exist, though perhaps the most intuitive approach is the method proposed by McBride and Martin (1983). In their method, McBride and Martin suggest sampling from the n most optimal items (given the selection criteria) rather than simply selecting the most optimal item, and further recommend gradually reducing n as the examinee progresses through the test (the so-called 5-4-3-2-1 approach). This helps generate item variability in earlier stages of the test where item overexposure is more likely to occur.
Simulation-based item exposure methods, such as the Sympson-Hetter (SH) approach (e.g., Veldkamp and Linden 2008), provide a different approach to controlling item overexposure. The SH method requires items to be pre-assigned a fixed value ranging from 0 to 1, and during the CAT session a simulation experiment is performed to determine whether or not a selected item is to be administered. For instance, after an optimal item is determined from the item pool, a random uniform value (r) is then drawn and compared to the item's assigned SH value. If r is less than the assigned SH value then the item is administered, otherwise it is discarded from the item pool and the next most optimal item undergoes the same simulation experiment; this process continues until an item is selected and administered.
Unfortunately, it is not clear whether more advanced exposure control methods outperform simple heuristic methods in empirical settings (e.g., see, Revuelta 1998). An undesirable side-effect when implementing exposure control methods is that there is loss of item selection efficiency, and in most applications this loss of efficiency will require more items to be administered before termination criteria based on the standard error of measurement can be obtained. Using exposure control early in the MCAT session, where it often is most important, also generates uncertainty about selection methods that theoretically should perform well earlier in the CAT session (such as the Kullback-Leibler criteria).
Another area of interest when administering test items is the use of "content balancing" to ensure that specific types of item content appear during the CAT session. Content balancing generally involves the classification of items to predetermined groups, and these groups are assigned a proportion or percentage value relating to how often the content groups should appear in the CAT session. A simple yet effective method for content balancing, proposed by Kingsbury and Zara (1991), involves comparing the empirically obtained content proportions to the desired content proportions. After computing the selection criteria for each remaining item in the item pool, the proportion of items in the content domains for all the items previously administered are subtracted from the desired content domain percentages for the respective domains. The content domain that has the largest difference between the desired proportion is then selected, and the item with the most optimal selection criteria within the selected domain is then administered. This approach ensures that content balancing is efficiently achieved throughout the session while quasi-maintaining the optimal item selection criteria.
Content balancing methods mainly have been studied in unidimensional CAT applications, however they can be applied equally well to MCATs. Fortunately, MCAT designs can implicitly offer a suitable approach to balancing content domains. As Segall (1996) explained, an MCAT session primarily intended to measure one trait could be organized to form a bifactor structure (e.g., Gibbons and Hedeker 1992;Gibbons et al. 2007) to achieve a content balancing effect. Within the bi-factor model, each content grouping can be organized as a specific latent trait, where item slopes only relate to the respective subsets of homogeneous content groupings. Given the bi-factor structure, items could then be selected using criteria that weight the selection process in favor of selecting items which measure the general trait, while also including information about the specific traits; this would result in a probabilistic selection mechanism for sampling the content domains indirectly with the item selection criteria. Of course, practitioners may still wish to include more traditional content balancing methods if other properties of the test should be selected. In this case, multiple content balancing methods could be combined to form a balanced sampling design. For instance, in addition to selecting specific contents with the bi-factor design, a CAT session may also be organized to contain 80% multiple-choice questions, 10% reading comprehension questions, and 10% fill-in-the-blank questions, and these proportions can be controlled using Kingsbury and Zara's (1991) method.

Termination criteria
In the interest of time, item bank security, and avoiding fatigue effects, one or more stopping criteria should be included in MCATs. One reasonable approach to terminating the MCAT application is to require all SE(θ k ) ≤ δ, where δ is a maximally tolerable standard error of measurement for all the latent traits. However, if some elements inθ should be measured with more precision then unique δ k values for eachθ k value should be defined. For example, if the MCAT is organized to contain a bi-factor structure then the test developer may wish to terminate the session when only the primary trait reaches a predefined δ k value. Unequal δ k values should be used in conjunction with the W-rule, weighted T-rule, or weighted A-rule, so that items which accurately measure the traits of interest are selected with greater frequency.
Classification-based criteria also exist for terminating MCATs when cutoff values are supplied for each trait. In classification-based MCATs, the session may be terminated when the confidence intervals (given 1 − α) for eachθ do not contain the pre-specified cutoff values. When the CI (θ)s do not contain the cutoff values then the individuals may be classified as above or below the cutoff values for the respective traits, otherwise the MCAT results will suggest that not enough information exists to reject the plausibility of the cutoffs. More specific methods for terminating CATs are also possible (e.g., use of loss functions or risk analyses), however these are not explored in this article.
Finally, termination criteria can be based on other practical considerations as well, such as setting the maximum number of items that can be administered in a given session, stopping the CAT after a specific amount of time has elapsed, theθ values are changing very little as new items are added, and so on.

The mirtCAT package
The mirtCAT package (available from the Comprehensive R Archive Network at https: //CRAN.R-project.org/package=mirtCAT) provides tools for test developers to generate GUIs for CATs as well as functions for Monte Carlo simulation studies involving CAT designs. mirtCAT uses the HTML generating tools available in the shiny package (RStudio Inc. 2014) to generate real-time CATs for interactive sessions within standard web-browsing software. The mirtCAT package builds and extends upon the estimation tools available in the mirt package (Chalmers 2012), and provides a wide range of support for any mixture of unidimensional and multidimensional IRT models to be used for CATs.
Currently supported models in mirtCAT include unidimensional and multidimensional versions of the 4PL model (Lord 1980), the graded response model and its rating scale counterpart (Muraki 1992;Muraki and Carlson 1995;Samejima 1969), the generalized partial credit and nominal model (Thissen et al. 2010), the partially compensatory model (Chalmers and Flora 2014;Sympson 1977), the nested-logit model (Suh and Bolt 2010), the idealpoint model (Maydeu-Olivares, Hernández, and McDonald 2006), and polynomial or product constructed latent trait combinations (Bock and Aitkin 1981). Additionally, the mirt package supports the use of prior parameter distributions, linear and non-linear parameter constraints (e.g., see, Chalmers 2015), and specification of fixed parameter values; hence, nested versions of the previously mentioned models can be estimated from empirical data. For instance, the 1PL model (Thissen 1982) can be formed in mirt because it is a highly nested version of the M4PL model with equality constraints for the slope parameters.

A simple non-adaptive GUI example
The mirtCAT package generates interactive GUIs to run CATs by providing inputs to the mirtCAT() function. When generating a GUI, the mirtCAT() function requires a data.frame object containing questions, response options, and output types. When only a data.frame is supplied, a non-adaptive test (i.e., a survey) will be initialized because no IRT parameters were defined for selecting the items adaptively.
Currently, the required inputs names in the data.frame object are: • Question -A character vector containing all the question stems.
• Option.# -Possible response options for each item, where # corresponds to the specific category. For instance, a test with 4 unique response options for each item would contain the columns "Option.1", "Option.2", "Option.3", and "Option.4". If some items have fewer categories than others then NA placeholders must be used to omit the unused options.
• Type -Indicates the type of response input to use from the shiny package. The supported types are: "radio" for radio buttons, "select" for a pull-down box for selecting inputs, "text" for requiring typed user input, "checkbox" for collecting multiple checkboxes of responses for each item, "slider" for slider-style inputs, and "none" when only an item stem should be presented.
• Answer or Answer.# -(Optional) A character vector (or multiple character vectors) indicating the scoring key for items that have correct answers. If there is no correct answer for a question then NA values must be specified as placeholders. When a checkbox type is used with this input then responses are scored according to how many matches were selected.
• Stem -(Optional) A character vector of absolute or relative paths pointing to external markdown (.md) or HTML (.html) files, which can be used as item stems. NAs are used if the item has no corresponding file.
• ... -Additional optional argument inputs that are passed to the associated shiny construction functions. For the "slider" input, however, a column for the "min", "max", and "step" arguments must be defined.
When generating surveys with mirtCAT, only the Type, Question, and Option inputs are typically required. If questions are to be scored in real time (as they generally are in ability measuring CATs) then a suitable Answer vector must be supplied. Finally, if specific graphical stimuli should be included then the paths pointing to the item-stem files must be included in the Stem input.
Before generating a real-time CAT GUI, it is informative to first generate a simple survey to highlight the fundamental mirtCAT() inputs. For example, say that a researcher wishes to build a small survey with only three rating scale items, where each item contains five ratingscale options ranging from "Strongly Disagree" to "Strongly Agree". Initializing the HTML interface to collect responses can be accomplished with the following code: R> library("mirtCAT") R> options <-matrix(c("Strongly Disagree", "Disagree", "Neutral", + "Agree", "Strongly Agree"), nrow = 3, ncol = 5, byrow = TRUE) R> questions <-c("Building CATs with mirtCAT is difficult.", + "mirtCAT requires a substantial amount of coding.", + "I would use mirtCAT in my research.") R> df <-data.frame(Question = questions, Option = options, Type = "radio") R> results <-mirtCAT(df = df) When defining the Option input in a data.frame, unique names are automatically generated and create the suitable labels Option.1, Option.2, Option.3, Option.4, and Option.5; this is how the data.frame() function manages ambiguous labels by default. After calling the mirtCAT() function, an interface is generated and embedded within the operating system's default web browser. Figure 1 depicts screen-captures of the default GUI for the initial page and for the first prompted item page.
After the GUI has been generated, the survey will continue to administer each of the defined items until the item bank has been exhausted. Upon completion, the interface will disconnect from the web browser and the R session that was previously suspended for the duration of the GUI will become active again. If the session was assigned to an object, as it was above, the R work-space will contain the saved responses from the GUI. The next section elaborates on the structure of the mirtCAT package in much more detail, and demonstrates the flexibility of building tailored interfaces for various MCAT designs.

mirtCAT package details
The mirtCAT() function has inputs broken into design, GUI, item selection, and implementation criteria by supplying the following inputs Arguments Description Possible inputs df Named data.frame containing questions, options, answers, output types, and graphical stem locations.
data.frame or missing.
mo Single group object defined by the mirt package. This is required if the test is to be scored.
Object of class SingleGroupClass or missing.
method Character string for selecting the method of predictingθ values.
Numeric scalar or character string.
local_pattern If df was supplied, a character matrix (with one row) specifying how a participant responded to the test, otherwise an integer matrix (with potentially more than one row). If supplied, the GUI will not be generated and instead the MCAT results will be evaluated off-line.
Character or integer matrix.
design_elements A logical value indicating whether a list containing the design objects should be returned.
cl An optional cluster object defined with the parallel package. Used for simulation designs that should be run in parallel.
mirtCAT(df, mo, method = "MAP", criteria = "seq", start_item = 1, local_pattern = NULL, design_elements = FALSE, cl = NULL, design = list(), shinyGUI = list(), preCAT = list(), ...) A selection of important arguments for mirtCAT() is displayed in Table 1. Three list arguments (preCAT, design, and shinyGUI) are omitted from this table because they have special design and structural properties for the GUI and CAT design. These list objects define various characteristics about the test, persons, or CAT design, and modify elements of the GUI generated with the shiny package. Tables 2 and 3 describe the possible inputs for these lists at a superficial level, and more detailed descriptions are available in the package help files.
The mo input contains the IRT models with their associated parameters, where the required 'mo' object can be defined using multiple approaches. For instance, if test constructors have a suitable dataset that can be used to calibrate the IRT parameters directly then obtaining an estimated model from the mirt() function is one possible and straightforward approach. Following convergence of the item parameters estimated from mirt, the model object may be readily passed to the mo input to define the CAT parameters for real-time GUIs or Monte Carlo simulation work.
If, on the other hand, calibration data are not available, but population parameters are known a priori (as is the case for Monte Carlo simulation designs), then a suitable matrix or data.frame can be passed to the generate.mirt_object() function in mirtCAT to create a suitable 'mo' object. The column names in the matrix input correspond to the item parameter names used in mirt, which can be located in the package's help files (see help("mirt") for details). Default is 0 (disabled). criteria Item selection criteria (see Table 1). Default is "random". methodθ prediction method (see Table 1 Integration range (if required). Default is c(-6,6). weights Weights to use when with weighted selection criteria such as criteria = "Wrule" (also applies to the T-rule and A-rule). Default uses equal weights for each trait. KL_delta ∆ parameter required for specifying the evaluation range of the Kullback-Leibler criteria. content An optional character vector indicating the type of content measured by an item. content_prop An optional named numeric vector indicating the distribution of item content proportions. A content vector must also be supplied to indicate the item membership. classify A numeric vector indicating cut-off values for classification above or below some prior thresholds. classify_CI A numeric vector (between 0 and 1) indicating the confident intervals used to classify individuals as above or below values in classify. exposure Can be two types of vectors. If a numeric vector between 0 and 1 then the Sympson-Hetter exposure control is used, otherwise if an integer vector where all values are greater than or equal to 1 the optimal selection criteria are sampled from given the supplied sequence. constraints A named list specifying various item selection constraints useful to control how items are administered or scored. customNextItem A function to be used when the selection of the items should be completely customized instead of using the fixed selection methods provided. For further details refer to the documentation for the findNextItem function.  The default hyper-parameters in generate.mirt_object() are assumed to be from a standard multivariate normal distribution; however, these defaults may be overwritten (as they were above).
Several methods for predictingθ scores are available through the fscores() function in mirt by supplying an appropriate character vector to the method argument. Namely, the estimation method can be selected to fix estimates at their previous values, estimate traits using ML, MAP, EAP, EAP for sum scores, and WLE, or estimate values using imputation variants of these estimators if an asymptotic parameter covariance matrix was computed beforehand. The hyper-parameters for the prior distributions of θ are obtained from the internal 'GroupPars' element in mo. For all Bayesian prediction methods, fscores() contains a specialized custom_den argument for users to define a customized density function if they wish to supply their own prior distribution function. The aforementioned prediction methods are available in both the pre-CAT and CAT stages; however, users should bear in mind that methods which can handle extreme response patterns (such as MAP) may be beneficial in the pre-CAT stage.
There are multiple item selection criteria available in mirtCAT, some of which are only applicable to unidimensional or multidimensional models. Criteria applicable to both unidimensional and multidimensional adaptive tests are the "KL" and "KLn" method for the point-wise Kullback-Leibler divergence and the point-wise Kullback-Leibler with a decreasing delta value (∆ · √ n, where n is the number of items previous answered), respectively, "IKLP" and "IKL" for the integration based Kullback-Leibler criteria with and without the prior density weight, and "IKLn" and "IKLPn" for the √ n sequentially weighted counter-parts of the integration criteria (Chang and Ying 1996). Possible inputs for unidimensional adaptive tests include "MI" for the maximum information criteria, "MEPV" for minimum expected posterior variance, "MLWI" for maximum-likelihood with weighted information, "MPWI" for maximum posterior weighted information, and "MEI" for maximum expected information (see Magis and Raîche 2012, and the references therein for further elaboration of these methods).
Possible inputs for multidimensional adaptive tests include the "Drule" for the maximum determinant of the information matrix, "Trule" for the maximum (weighted) trace of the information matrix, "Arule" for the minimum (weighted) trace of the asymptotic covariance matrix, "Erule" for the minimum eigenvalue of the information matrix, and "Wrule" for the weighted information criteria. The multivariate selection criteria have posterior weighted analogues for Bayesian selection, which are available by passing "DPrule", "TPrule", "EPrule", and "WPrule", where the "P" indicates the use of a prior distribution. Finally, non-adaptive selection methods include the sequential ("seq" ) and random ("random") criteria, which can be used in both adaptive and non-adaptive tests.

Auxiliary functions
Upon completion of the mirtCAT() function, an S3 object of class 'mirtCAT' is returned and contains information about the raw and scored response pattern, person demographics supplied in the survey, order in which the items were administered, estimation history, and final trait estimates. Three generic S3 methods, print(), summary(), and plot(), have been defined to help summarize the returned object. The print() method will display the number of items administered and, if mirtCAT() was supplied suitable item parameters, the finalθ and SE(θ) estimates. summary() will return a list of more detailed information about the raw and scored response patterns, items administered, item response times (in seconds), history of theθ and SE(θ) estimates, and so on. Finally, plot() will generate figures based on the estimation history ofθ and SE(θ) (or confidence intervals) to display how the test was progressively scored.
When constructing CATs, developers may wish to experiment with their CAT designs by supplying fixed response patterns. mirtCAT eases the construction of plausible response patterns through the generate_pattern() function, and allows the CAT interface to be run without generating a GUI by passing an object containing suitable response patterns to mirtCAT(..., local_pattern). For example, a participant with a latent ability score of θ = 1 could create the following response pattern.
R> set.seed(1) R> pattern <-generate_pattern(mirt_object, Theta = 1) R> pattern The generate_pattern() function sequentially generates plausible responses for each item in the item pool, and stores these values into a matrix. If generate_pattern() were supplied the df object with the respective item options then the function would return a character matrix of plausible responses instead of an integer matrix. The Theta input to generate_pattern() can be either a numeric vector to generate a single response pattern or a matrix of latent trait values to generate a matrix of plausible patterns corresponding to the rows in Theta. Supplying a matrix of trait values is especially useful for generating plausible response patterns for Monte Carlo simulation work, as we will observe in Section 5.
After generating one or more response patterns, the matrix is then passed to the local_pattern argument to execute the CAT session(s) off-line. As a simple example of how one might use these response patterns, the following CAT was designed to select all items with the maximum information ("MI") selection criteria.

R> plot(result)
The plot is shown in Figure 2.
Administering an adaptive test using an item bank with only 10-items clearly is not optimal for accurately measuring θ = 1, however this example is only intended to demonstrate how the output is summarized. After calling the summary(result) function, a list containing various CAT elements is returned. The $items_answered and $scored_responses elements together indicate that item five was answered first with scored response 1, followed by item six with a scored response of 0, then item ten with a scored response of 1, and so on until item nine was chosen last by default with a scored response of 1. The $raw_responses element from the summary() output indicates which category was selected in the GUI or simulated response pattern; for off-line analyses this element can largely be ignored and simply indicates placeholder categories. The correspondingθ and SE(θ) estimates are shown in the $thetas_history and $thetas_SE_history elements, and demonstrate how the ability estimates, and their respective standard errors, successively change after each item is administered.
At this point is useful to note the connection with mirt, which thus far has silently performed all the computations of theθ and SE(θ) estimates. The list of outputs returned by summary(result, sort = FALSE) can readily be used by functions in the mirt package by selecting various elements from the list output. More specifically, the unsorted $scored_responses element can be added to a calibration dataset (if the parameters were previously estimated) by simple use of the rbind() function. Including new response data to the original calibration dataset can be useful when recalibrating the parameter estimates at a later time. Additionally, the unsorted response pattern could be passed into fscores(..., response.pattern = pattern) to further examine what theθ would have been given alternative prediction methods. For instance, if the above response pattern were to be estimated with the ML prediction method then the estimates would be R> responses <-summary(result, sort = FALSE)$scored_responses R> fscores(mirt_object, response.pattern = responses, method = "ML") Because all items were administered, the unsorted response pattern is identical to the pattern generated from generate_pattern(). If items were not responded to due to early termination of the CAT then NA values would be present in the items containing no observations.

Single case MCAT example
In this section, an MCAT design and graphical user interface are constructed using the code located in Appendix A. The questions and item parameters were arranged to emulate how a multidimensional mathematical achievement test with cross-factor loadings and correlated latent traits could be managed. The item bank consisted of 120 items in total, and a data.frame object with the questions and answers was included. The first 30 items were constructed to measure only a hypothetical "Addition" trait, while the last 30 items measured only on a "Multiplication" trait. The middle 60 items were evenly split to contain a mix of the Addition and Multiplication slopes. However, the first half of these items were designed to relate more to the Addition trait (contained larger slopes), while the last 30 items were designed to relate more to the Multiplication trait. The expected information and standard error plots below indicate that respondents with abilities closer to the center of the distributions will be measured with the most accuracy, while those in the more extreme ends of the ability range will be measured with much less precision.
Given the objects defined in Appendix A, we first generate a plausible observed response pattern for a participant with abilities θ = [−0.5, 0.5].
R> set.seed(1) R> pat <-generate_pattern(mo = mod, Theta = c(-0.5, 0.5), df = df) R> head(pat) [1] "145" "195" "200" "232" "207" "175" The character responses indicates that, among the options in df[1, ], the category pertaining to the option "145" was selected as the correct answer, "195" was selected for the second item among the possible options in df[2, ], and so on for the remaining 118 items. To determine how the MCAT session would behave if each item were administered in sequence, the min_items argument could be increased to ensure that all items are selected; alternatively, the min_SEM input could be decreased to a much smaller value to accomplish the same goal.
R> result <-mirtCAT(df = df, mo = mod, local_pattern = pat, + design = list(min_items = 120)) R> print(result) The plot is shown in Figure 4. Some initial observations can be made from inspecting the graphical output generated by plot(result). By design, the test parameters were simulated to almost exclusively measure the Addition trait for the first 60 items, and the consequences of this are clearly seen in the ability estimates and their respective standard errors. The standard errors for Theta_1 rapidly decreased in the first half of the test, while the standard errors for Theta_2 stayed roughly the same until the second half of the test began 3 . As well, the point estimates for Theta_1 were able to move closer towards the population value of −0.5 in the first half, but only after the second half of the test began does Theta_2 begin to move towards the population value of 0.5. Although not shown above, the results from summary(results) revealed that both traits were measured with a standard error less than 0.4 after the 73rd item was administered.

CAT Standard Errors
Administering all items in an item pool is generally not desirable in real testing situations when the item parameters are available. Therefore, we will instead implement a multidimensional adaptive test design to select items that are more suitable for the observed response pattern. First, we choose an MCAT item selection criteria to help maximally increase the information in both traits simultaneously. Because the latent traits are deemed to be of equal importance in this example, the use of the D-rule for selecting items is a reasonable choice. Next, we set the stopping criteria for the standard error of measurement to 0.4 for all traits. The response pattern previously simulated is then reanalyzed in mirtCAT() with R> set.seed(1234) R> MCATresult <-mirtCAT(df = df, mo = mod, criteria = "Drule", + local_pattern = pat, start_item = "Drule", + design = list(min_SEM = 0.4)) R> print ( $raw_responses [1] "3" "2" "1" "3" "3" "5" "5" "3" "1" "3" "3" "1" "1" "3" "4" "1" "5" "4" For this particular response pattern, the MCAT session was terminated after only 18 items were administered. As can be seen from the summary results above, and the generated plot below, the items were effectively selected so to reduce the standard error estimates more rapidly. Improving the standard errors consequently improves the rate at which the point estimates converge to their population values. When compared to administering items in an ordered sequence, the MCAT with the D-rule criteria was able to obtain the same degree of measurement precision with 55 fewer items. Clearly, even for a small test bank such as the one simulated here, the payoff to implementing MCATs can be quite meaningful compared to more traditional item selection methods (see Figure 5).

Customizing GUI elements
To demonstrate how the previous MCAT example can be transformed into a useful GUI, we will now focus on the related graphical inputs required for the mirtCAT() function. The code in Appendix A generated character vectors for the questions, options, and answers for 120 items, and placed these values in an objected called df. The df object can be passed to the mirtCAT(df = ...) input, which will generate simple text-based question stems using suitable HTML code. However, in the code below we will include two additional items to demonstrate how more stimulating item stems can be presented. When defining items, test designers will often wish to present stimuli other than the default text output, and instead include materials in the form of images, tables, maps, and so on. In such cases, the df input may include a Stem character vector to point to previously defined markdown or HTML files.

CAT Standard Errors
The following code adds two items to the existing df object which point to external HTML files. The rbind.fill() function from the plyr package (Wickham 2011) is used below to quickly fill in missing values with NAs, which can be useful when combining two data.frame objects. Screen captures of the graphical item stems can be seen in Figure 6.

Monte Carlo simulations
In this section, functions in the mirtCAT package for generating Monte Carlo simulation studies are presented, and the results are compared to existing R packages capable of analyzing CAT designs. The first simulation generates a unidimensional CAT design, and compares the results from mirtCAT to the catIrt and catR packages (version 0.5-0 and 3.4, respectively). The second design generates a two-dimensional MCAT design, but now compares the results to the MAT package (version 2.2). Finally, a third simulation design was constructed using mirtCAT to generate an MCAT design with mixed-item types, exposure control, content balancing, and a weighted item selection criterion.

Unidimensional simulation design
A unidimensional item bank was constructed to contain 1000 3PL item response models. The item slope parameters were drawn from a log-normal distribution, a ∼ log N (0.2, 0.3), the intercept parameters were drawn from a standard normal distribution, d ∼ N (0, 1), the lower-bound parameters were all set to a constant, g = 0.2, the latent trait values were drawn from a standard normal distribution, and N = 5000 plausible response patterns were generated given the latent trait values and item parameters. Using the mirtCAT package, the plausible response patterns were generated using the following code: R> nitems <-1000 R> N <-5000 R> Theta <-matrix(rnorm(N)) R> a <-matrix(rlnorm(nitems, 0.2, 0.3), nitems) R> d <-rnorm(nitems) R> pars <-data.frame(a1 = a, d = d, g = 0.2) R> mirt_object <-generate.mirt_object(pars, "3PL") R> responses <-generate_pattern(mirt_object, Theta = Theta) Analyzing the matrix of response patterns requires passing the object to the local_pattern argument in mirtCAT(). To allow for comparisons between existent R packages, a compatible CAT design was constructed. The design was organized such that all items were selected using the maximum-information criteria (including the initial item), theθ values were updated using EAP estimation given a standard normal prior, the number of items selected were required to be between 10 and 50, and the CAT was terminated early if SE(θ) < 0.25.
When performing Monte Carlo simulation studies, front-end users should consider using multicore architecture methods. Invoking more than one processor to perform the computations can potentially reduce the estimation times by a factor proportional to the number of cores available. mirtCAT() explicitly supports parallel computing by accepting a cl argument, where cl is a suitable socket-type object to be used by functions in the parallel package.
The unidimensional CAT design and generated response patterns were then analyzed using code from the catIrt and catR packages. mirtCAT was executed twice to compare the effect of single versus multi-core architecture (eight processors were selected for the multicore execution using an Intel i7, 3.40GHz processor; Operating system: Ubuntu, version 16.04 LTS). All final ability estimates correlated equally well with the generated population values (r ≈ 0.9689), and returned nearly the exact same estimates (r > 0.9999). However, where these packages differed was in the estimation time required to complete the simulation. The catIrt package was the least efficient at estimating this design, requiring 5733 seconds to complete the simulation (approximately 95 minutes), while catR required 1703 seconds (approximately 28 minutes). 5 The mirtCAT package, however, required 565 seconds for execution with single-core architecture (9 minutes and 25 seconds) and only 144 seconds (2 minutes and 24 seconds) when using the internally organized multi-core architecture. As is evident from this simulation, multi-core architecture can be highly effective when performing Monte Carlo studies for CATs.

Multidimensional simulation design
The second simulation compared numerical results to the MAT package for a simple MCAT design. A relatively large item bank was organized to consist of 1000 M2PL items with two latent traits. The slope and intercept parameters were drawn from the same distribution as in the previous simulation, however the θ parameters were drawn from a standard multivariate normal distribution with an inter-factor correlation of r = 0.5. The MCAT design used the D-rule throughout the session (including the initial item), the test was terminated if more than 30 items were selected or all SE(θ) < 0.3, and the trait estimates were computed using MAP estimation with a multivariate normal prior distribution of MVN (0, 1 .5 1 ). The estimation algorithm was exclusively required to be the MAP algorithm because it is the only method supported in MAT.
Theθ estimates recovered by MAT and mirtCAT were essentially equivalent (r > 0.9999), and correlated equally well with the population θ values (r ≈ 0.9377 for the first trait, r ≈ 0.6093 for the second trait). Estimation efficiency slightly favored the MAT package, requiring approximately 259 versus 394 seconds to complete the simulation. Nevertheless, the MAT package contains several limitations; namely, the MAT package currently only supports MAP estimation of the latent traits with a multivariate normal prior (whereas mirtCAT can support any defined prior density function), only supports the M3PL model, has limited support for other CAT related properties (such as content balancing, exposure control, pre-CAT stages, terminating the MCAT according to classification rules, and so on), and contains no public functions to help build customized MCAT designs (the majority of the package is written in self-contained C++ code). Therefore, the package may be of limited use when researchers wish to study more realistic MCATs, when investigating IRT models other than the M3PL model (i.e., a mixture of IRT models), and when developers require tools to build MCATs for real-time applications.

MCAT simulation with item selection factors
A final simulation was organized to demonstrate how mirtCAT can be used for tests with more complex IRT model combinations. An item pool of size 360 was constructed to consist of an even number of M4PL models and MGPCMs (with five response options). The test was organized to have a bi-factor structure with three specific-item traits. Sufficient measurement precision was only required for the general factor, where the specific traits were treated as nuisance factors that were only included to account for inter-item dependencies. The general factor slopes were drawn from a log-normal distribution, a g ∼ log N (0.2, 0.3), while the specific factor slopes were drawn from a s ∼ log N (−1, 0.2). Each item was structured such that only one specific trait could influence each item, and each specific trait loaded uniquely on 120 items. For the 180 multidimensional 4PL models, all g and u parameters were set to the constants 0.2 and 0.95, respectively, while the intercepts were drawn from a standard normal distribution, d ∼ N (0, 1). The MGPCM intercepts were all drawn from d k ∼ N (0, 2) and sorted from lowest to highest for each item. A matrix of multivariate ability parameters were sampled from a standard multivariate normal distribution with uncorrelated traits, θ ∼ MVN (0, I). Finally, given the item and person parameters, a total of 1000 plausible response patterns were generated using the generate_pattern() function.
The MCAT design was organized to select items that were more informative of the general factor by utilizing the W-rule with the weight vector w = (0.85, 0.05, 0.05, 0.05). The MCAT was terminated when either 50 items were administered or the general factor standard error was less than 0.2. In addition to these selection rules, a content balancing scheme was constructed to ensure that there were more M4PL items administered than MGPCMs (70% versus 30%, respectively). Finally, a Sympson-Hetter exposure control method was included such that if the general factor slopes were too large then the item would have a lower probability of being selected. Specifically, the SH expose control scheme for the i-th item was defined as 0.3 if a g > 2.5, 0.6 if 2 < a g ≤ 2.5, 0.9 if 1 < a g ≤ 2, 1 otherwise.
The code used to generate and execute this simulation can be located in Appendix B.
In spite of the content balancing and exposure control effects, the MCAT design appeared to effectively recover the generated population trait values for the general factor. The final trait estimates correlated well with the population generating values (r = 0.9751) with little bias and variability (bias = −0.0046; root mean-square deviation = 0.2213). The MCAT was terminated after an average of 33.022 items were administered, and the empirical proportion of M4PL and MGPCM administered were 0.6863 and 0.3137, respectively. As is evident from the figure below, items were more effectively selected from the item bank when the general trait scores were within approximately ±0.5 of the population mean. When the population values were close to the population mean, the MCATs were able to achieve the SE(θ g ) < 0.2 stopping criteria; however, more extreme population values were not measured accurately by the pool of available items, and instead the MCAT sessions were terminated when the maximum number of items were administered (see Figure 7).

Discussion
In this article, an R package was introduced for generating interactive graphical interfaces specifically for MCAT and non-MCAT designs. The package provided tools to generate and analyze plausible MCAT responses for use in Monte Carlo simulations, provided functions to summarize MCAT results, and included utilities to plot the estimation history for multiple latent traits. The estimation efficiency was contrasted with existent R packages, and various graphical user interfaces were constructed to demonstrate how real-time MCAT applications can be generated with the mirtCAT package. Using the wide selection of IRT models available in the mirt package, mirtCAT was able to support the construction of flexible unidimensional and multidimensional adaptive test designs containing a mixture of IRT models.
Test developers may choose to execute their MCAT GUIs locally for single computer administrations or deploy their GUIs over the Internet for remote accessibility. Executing the GUIs locally may require password protecting the questions and answer keys, disabling keyboard shortcuts in the Operating System (i.e., Ctrl + Alt + Delete, Alt + F4, Alt + Tab, and so on), and may require saving the relevant R objects and terminating the R session immediately after the testing interface is complete. Executing the MCAT GUIs locally will often offer a more controlled laboratory setting, and generally will help maintain the integrity of the item pool. Therefore, local execution of MCAT GUIs is the most recommended approach. Remote deployment of MCAT GUIs, on the other hand, will require the configuration of a server capable of handling the computations, and may introduce other unwanted issues (such as Internet connection problems, or slower upload and download times).
To date, the only open-source GUI-generating system that has been developed for deploying CATs in real-time has been the Concerto project (Kosinski and Rust 2011). Concerto uses the catR package as the back-end to perform the computations for unidimensional IRT models. However, an unfortunate complication regarding the Concerto project is that it is not implemented in R, and instead is executed as a standalone web application where R is used as a computational back-end. This requires the user to obtain extra knowledge regarding how to setup personal servers to collect the response data, requires learning additional web-interface tools outside of the R language, and, after these skills have been acquired, the interface is still currently limited to unidimensional IRT models. With the help of the tools available in mirtCAT, projects such as Concerto may be further extended to include unidimensional and multidimensional tests, potentially with a mixture of IRT models, in a manner similar to how the catR package has been adopted to perform the back-end computations in R.
Future work on mirtCAT will largely be driven by users who are interested in utilizing the package tools in their applied research work. However, a number of potential avenues to explore may include: better support for classification CATs, more dynamic GUI elements, and the inclusion of interactive items responses as the shiny framework continues to mature. Currently, interactive items can be included by pointing to raw HTML stems, although these do not directly integrate with the responses to each item. Additionally, more holistic control over content constraints may be included by providing support for so-called shadow testing designs (e.g., Veldkamp and Linden 2002).
mirtCAT is written and manipulated entirely within R, thereby allowing seamless transitioning between data collection using GUIs and further item analysis work with packages such as mirt. Users only need to learn basic R code, understand how to define their item parameters with mirtCAT or estimate their item parameters with the mirt package, and learn how to manipulate the simple tools defined in mirtCAT to build a completely automated CAT application for their own research purposes. The package provides intuitive tools to generate interfaces for local or server-sided use, uses mirt to perform many of the underlying organizational and computational components, includes powerful Monte Carlo simulation design support for CATs and MCATs, and helps to facilitate the collection of respondent data using adaptive and non-adaptive tests or surveys. When used in conjunction with mirt, mirtCAT provides a fluid and organized work-flow for test developers to collect, as well as analyze, their important response data. + questions[i] <-paste0(m1, " * ", m2, " = ?") + } + answers[i] <-as.character(ans) + ch <-ans + sample(c(-5:-1, 1:5) * spacing[i, ], 5) + ch[sample(1:5, 1)] <-ans + options[i,] <-as.character(ch) + } R> df <-data.frame(Question = questions, Option = options, Answer = answers, + Type = "radio")