Random Generation of Response Patterns under Computerized Adaptive Testing with the R Package catR

This paper outlines a computerized adaptive testing (CAT) framework and presents an R package for the simulation of response patterns under CAT procedures. This package, called catR, requires a bank of items, previously calibrated according to the four-parameter logistic (4PL) model or any simpler logistic model. The package proposes several methods to select the early test items, several methods for next item selection, different estimators of ability (maximum likelihood, Bayes modal, expected a posteriori, weighted likelihood), and three stopping rules (based on the test length, the precision of ability estimates or the classification of the examinee). After a short description of the different steps of a CAT process, the commands and options of the catR package are presented and practically illustrated.


Introduction
Computerized adaptive testing (CAT) was developed as an alternative to sequential (or fixed item) tests. In this CAT framework, the items of the test are assigned to each respondent in an optimal sequence, each item being selected as the most useful or informative at the current step of the test. Optimality is usually defined in regards to the previously administered items, the examinee's responses to these items, and the current or provisional ability estimate. Any CAT has several well-known advantages over sequential tests: it requires shorter tests to get the same level of precision in the ability estimates, it reduces the risk of fraud or cheating (each examinee having a different sequence of test items), it provides an immediate estimation of the ability level of the respondent (Wainer 2000). However, CAT has two main drawbacks. First, computer and software resources are necessary to administer adaptive tests, which might item parameter values, to estimate the ability level of the subjects, and to select the next item(s) to be administered.

IRT models
Although the usual item response model under consideration for CAT is the three-parameter logistic (3PL) model (Birnbaum 1968), a less popular but more general model will be considered throughout this paper. The four-parameter logistic (4PL) model was introduced by Barton and Lord (1981) and takes the following form: . (1) In (1), X j is the (binary) response of the examinee to item j; θ is the ability level of the examinee; and a j , b j , c j and d j are the four item parameters, respectively the discrimination parameter (the slope of the item characteristic curve), the difficulty level (the intercept), the pseudo-guessing parameter (the lower asymptote) and the inattention parameter (the upper asymptote). The 3PL and 4PL models differ only by the last parameter, the upper asymptote d j , which is equal to one in the 3PL model but can be smaller than one in the 4PL model. In fact, 1 − d j represents the probability that a high-ability examinee incorrectly answers the item.
The main reason for considering this 4PL model in a CAT environment is that it is a more general model than the usual logistic IRT models. As pointed out above, the 3PL model is a particular 4PL model where all d j parameters equal one. If the lower asymptotes c j are additionally constrained to the value zero, then only item difficulties and discriminations are present, and one deals with the two-parameter logistic (2PL) model. Finally, fixing all discriminations a j 's to one leads to the one-parameter logistic (1PL) model. The latter is also sometimes referred to as the Rasch model, although strictly speaking this model is obtained when all discriminations are fixed to one (unlike the 1PL model for which all a j 's are equal, but not necessarily to one). Also, the 4PL model has been introduced very recently in the CAT framework as an appropriate model to avoid issues with early mistakes in the test (Rulison and Loken 2009). This problem is particularly crucial for high-ability examinees that fail the first items of the adaptive test, since the recovery of the true ability level by its estimate cannot be guaranteed. Note that this is not an intrinsic property of CAT process but a weakness of IRT scoring (Green 2011). However, according to Rulison and Loken (2009), this issue can be withdrawn by using a 4PL model with upper asymptotes close to, but smaller than one.
The main issue with this 4PL model, however, is that the item parameters cannot be easily estimated. Like the estimation of the lower asymptote in the 3PL model, the estimation of the upper asymptote in the 4PL model requires a lot of subjects. Rulison and Loken (2009) avoided that issue by calibrating the item parameters under a 3PL model (for which convenient software routine exists) and artificially fixing the upper asymptotes to 0.99 or 0.98. Although the purpose of their study was not affected by this artefact, it is definitely not an appropriate method for item parameter estimation. Recently, Loken and Rulison (2010) proposed a Bayesian framework to estimate simultaneously the item parameters from the 4PL model. They found that the recovery of these four item parameters is quite good. Also, they showed the potential superiority of the 4PL model over the simplified 3PL, 2PL and 1PL models when the true one is the 4PL. In sum, the 4PL model has to be considered here as a general model, from which usual logistic models can be found back by constraining the parameters appropriately, leaving the door open for future developments in estimation of the 4PL model.
Several methods are available for calibrating (i.e., estimating) the item parameters from a data set of response patterns. The most common methods are: joint maximum likelihood (Lord 1980), conditional maximum likelihood (Andersen 1970(Andersen , 1972, marginal maximum likelihood (Bock and Aitkin 1981) and Bayesian modal estimation Gifford 1985, 1986). However, in the context of adaptive testing, it is often assumed that the item bank has been calibrated in advance, either from previous administrations of the items or by applying item generation methods (Irvine and Kyllonen 2002). The issue of estimating item parameters will therefore be skipped from this presentation, and it is assumed that one can provide a matrix of item parameters as a basis for constructing an item bank (see later).
It is also important to recall that the item responses are considered as binary, true-false responses. An obvious extension would consist in administrating polytomous, multiple choice items, or a mix of both types. Nevertheless, polytomous item responses can always be reduced to binary outcomes, this introduces some lack of information but allows for a complete application of the present CAT process. Furthermore, the estimation of ability levels is an entire part of the CAT framework and will consequently be discussed further.

Ability estimation
There exist several methods to estimate the ability level of an examinee, given the fixed item parameters and the corresponding response pattern. The most popular methods are the maximum likelihood (ML) estimator, the Bayes modal (BM) or maximum a posteriori estimator, the expected a posteriori (EAP) estimator and the weighted likelihood (WL) estimator. These methods are briefly presented below; see e.g., van der Linden and Glas (2000) for further details. All four estimators are available in the catR package.
The maximum likelihood estimator (Lord 1980) is the ability valueθ M L that maximizes the likelihood function L(θ) or its logarithm log L(θ): where Q j (θ) = 1−P j (θ) is the probability of an incorrect answer and J is the test length. The asymptotic standard error of the ML estimate can be approximated by the following formula: where I j (θ) is the item information function: and P j (θ) stands for the first derivative of P j (θ) with respect to θ.
The Bayes modal (BM) estimator (Birnbaum 1969) is similar to the ML estimator, except that the function to be maximized is the posterior distribution g(θ) of the ability level, which is obtained by a combination of the prior distribution f (θ) and the likelihood function L(θ): g(θ) = f (θ)L(θ). In other words, the ML and BM estimators are the modes of the likelihood function and the posterior distribution, respectively. Thus, the BM estimator is the ability valueθ BM that maximizes the posterior distribution g(θ) or its logarithm: The choice of a prior distribution is usually driven by some prior belief of the ability distribution among the population of examinees. The most common choice is the normal distribution with mean µ and variance σ 2 . In this case, the standard error ofθ BM is obtained by Sometimes, the prior mean and variance are fixed to zero and one, respectively, so that f (θ) reduces to the standard normal distribution. Another common choice is the uniform distribution on a fixed ability interval. In this case, both ML and BM estimators are equivalent when the ability interval of the prior uniform density is sufficiently wide.
Although less frequently considered in CAT, a third prior distribution can be considered: the Jeffreys' non informative prior density (Jeffreys 1939(Jeffreys , 1946. This prior distribution is proportional to the square root of the test information function: and the test information function I(θ) is the sum of item information functions: With Jeffreys' prior distribution, the standard error ofθ BM is approximated by where I (θ) and I (θ) are respectively the first and second derivatives of I(θ) with respect to θ. Jeffreys' prior is said to be non-informative because it relies on the item parameters of the test, and not on some prior belief of the ability distribution, as modelled by the normal distribution for instance. It has the advantage of being less affected by misspecifications of the prior distribution. Its practical usefulness in the framework of CAT, however, still has to be validated.
The third estimator is the expected a posteriori (EAP) estimator (Bock and Mislevy 1982). While the BM estimator computes the mode of the posterior distribution, the EAP estimator computes its posterior mean:θ with the same notations as for the BM estimator. In practice, the integrals in (10) are approximated, for instance by adaptive quadrature or numerical integration. The standard error ofθ EAP is given by Because the distribution of the ability levels is usually symmetric around the average ability level, the BM and the EAP estimators often return similar estimates and standard errors.
The last estimator to be presented here is the weighted likelihood (WL) estimator (Warm 1989). It was introduced to reduce and almost cancel the bias of the ML estimator, by using an appropriate weighing of the likelihood function. Although the ML estimator is asymptotically unbiased, Lord (1983Lord ( , 1984 noticed that for small tests, its bias is proportional to the inverse of the test length. Warm (1989) established that the WL estimatorθ W L must satisfy the following relationship: and P j (θ) is the second derivative of P j (θ) with respect to θ (Warm 1989). Also, the standard error ofθ W L is given by and J (θ) is the first derivative of J(θ) with respect to θ.
In fact, J(θ) is the first derivative of the weight function with respect to θ, but the latter has no algebraic expression under the general 3PL model. Interestingly, Hoijtink and Boomsma (1995) and Warm (1989) noticed that under the 1PL and the 2PL models, the WL estimator is completely equivalent to the BM estimator with Jeffreys' prior distribution; see also Meijer and Nering (1999).

Principles of CAT
Any CAT process requires a calibrated item bank and can be split into four steps. The first step is the initial step and consists in selecting one or several appropriate items as the first test items. The second step is the test step, in which the items are successively chosen from the bank and the ability level is re-estimated after each item administration. The third step is the stopping step and sets the parameters for the stopping rule of the CAT. The final step yields the final estimation of ability level and possibly other additional information. Figure 1 is a schematic representation of the full CAT process, including the four steps. These are further presented in the next sections. Two additional CAT-related topics are also briefly outlined: item exposure and content balancing.  Figure 1: Schematic representation of a CAT process.

Item bank
The central tool for adaptive tests is the item bank. An item bank is a collection of items that can be administered to the examinees. In order to generate response patterns, it is sufficient to have access to the item parameter values. In this framework, the item bank is assumed to be calibrated prior to the start of the CAT process. That is, a large collection of items, which have been administered and calibrated by many anterior studies about the same topic, is available for adaptive testing. This might be a strong practical assumption with heavy financial impact, but this remains a realistic assumption anyway.
Very little is known on how large an item bank should be for optimal adaptive testing. The larger the item bank the better it is for CAT process, but it is not always possible to construct and to calibrate a large number of items. Also, a balanced item bank should contain items on the whole range of difficulty levels, from very easy to very difficult items. This would enable the accurate estimation of extreme ability levels. The absence of difficult items does not allow estimates of very large ability levels, while too many difficult items in the bank are not adequate for the estimation of low ability levels. Easy items are most informative for low ability levels, while difficult items are most informative for high ability levels.

Initial step
In order to start any CAT process, one has to select at least one item in the item bank and to administer it to the examinee. Most often, a single item is selected at this step, and in the absence of any prior information about the examinee's ability level, one fixes this ability level equal to the prior mean ability value, usually zero. With this prior belief, the initial item is selected as the most informative in the item bank for this ability value.
Although this is the standard approach, it can be improved by many aspects. First, if some prior information is available about the examinee, it can be incorporated into the initial step. For instance, knowing from previous tests that the examinee has rather high or low ability level, one can adjust the initial ability level to values larger or smaller than the average prior ability level, respectively. Another slight modification concerns the criterion for selecting the first item. As pointed out above, the most informative item is the standard choice, but one could consider as optimal selection, the item whose difficulty level is closest to the prior ability value. This reflects some intuitive reasoning that the most adequate item has a difficulty level very close to the ability level of the examinee. This is known as Urry's rule for selecting the next item (Urry 1970), but it is rarely applied at the initial step of a CAT.
A final improvement is to select more than one item at the initial step. This possibility is not clearly stated in the literature, and most of the CAT software permit to select only the first item in this initial step. However, another approach could be to select two or three items, each item referring to a different prior ability level in order to cover some range of abilities. For instance, fixing two prior ability levels to -1 and 1, and selecting the corresponding items (one per ability level), might reduce the issue of inadequate early item administration because of a lack of information about the subject's ability.

Test step
Once the initial items have been administered, one can get a first provisional estimate of ability by using the current set of responses. The second part of CAT, the test step, can be sketched as follows.
(a) Estimate the ability level by using the currently available information (previous response pattern to which is added the latest administered items). Set this as the provisional ability estimate.
(b) Select the next item among the bank items that have not been administered yet, according to the provisional ability estimate and the method for the next item selection.
(c) Administer the selected item to the examinee. Update the response pattern.
(d) Repeat steps (a) to (c) until the stopping criterion is satisfied (see later).
Any kind of ability estimator can be considered for the test step, but the ML estimator is often avoided because it often returns infinite estimates at the early steps of the adaptive test. The BM estimator with a prior normal distribution is a common choice for the test step.
In general, the same estimator is considered throughout the test step. However, it is possible to build a hybrid rule, starting with one ability estimator and switching to another one at some stage of the process. For instance, one could start with a Bayesian estimator, and switch Criterion Objective function Table 1: Criteria for next item selection and related objective functions. The "Optimization" column indicates whether the objective function must be maximized or minimized over all available items.
to the ML estimator when the response pattern has at least one success and one failure. The first estimator avoids infinite ability estimates by the inclusion of a prior density, while the ML estimator works independently of any prior distribution. Similarly, the WL estimator could be considered as it is less biased than the ML estimator. Other hybrid rules could be constructed similarly, by taking the advantages and drawbacks of each method into account.
There are several methods for next item selection. The most known ones are: maximum information criterion (MFI), minimum expected posterior variance (MEPV criterion), Urry's criterion (Urry 1970), maximum likelihood weighted information (MLWI) criterion (Veerkamp and Berger 1997), maximum posterior weighted information (MPWI) criterion (van der Linden 1998), maximum expected information (MEI) criterion (van der Linden 1998), and completely random selection. A brief overview of most of these techniques is proposed in Choi and Swartz (2009) and van der Linden and Pashley (2000).
In order to provide a detailed description of these criteria for next item selection, we set the following notations. Assume that k − 1 items have been administered. Set X as the current response pattern, made by the k − 1 responses to the first administered items, and setθ k−1 (X) as the provisional ability estimate after the first k − 1 items. If item j has not yet been administered, setθ k (X, X j ) as the provisional ability estimate when this item j is administered and its response X j is included into the current response pattern. We further refer to the items not yet administered as the set of available items and we denote this set by S k−1 . Note that the subscripts k − 1 and k refer to the length of the current response pattern and not to any item number. Moreover, the notations L(θ|X) and I j (θ|X) refer respectively to the likelihood function (2) and the item information function (4) evaluated at θ and given the response pattern X. We also denote by P(X j = t|θ), the probability (1) that the response X j to item j is equal to t, and t is either zero or one for incorrect and correct responses, respectively. Finally, f (θ) stands the prior ability distribution, and V ar j (θ|f (θ), X, t) is the posterior variance of θ, given its prior distribution, the current response pattern X augmented by item j whose response value X j is equal to t. Table 1 summarizes the objective functions of each of five of the criteria listed above. They correspond to the functions to be maximized or minimized in order to determine the best item to be administered next. The maximization (or minimization) is taken among all items in S k−1 , that is, among all available items. Applying either the maximization or minimization is listed in the "Optimization" column of Table 1.
With the MFI criterion, the next item is selected as the available item with maximum Fisher information at the current ability estimate. If these information functions were previously computed into an information matrix, the selection of the next item by MFI is very fast. However, this relies on the provisional ability estimate, which might be severely biased, especially in the first steps of the test. The MLWI and MPWI criteria overcome this problem, from a marginalization process, by selecting the available item that maximizes a weighted form of the information function. For the MLWI criterion, the information is weighted by the likelihood function of the items already administered, while for the MPWI criterion, it is the posterior distribution of that acts as weighing function.
Instead of maximizing the information function, or some weighted form, over the current available items, another approach is to compute some expected optimal function based on the possible responses to the next item administered. Two criteria of that kind are the MEPV and the MEI criteria. In both cases, the optimal function is obtained by computing the probabilities of answering the next item correctly or incorrectly, and by updating the response pattern conditionally upon these two possible responses. With the MEI criterion, the optimal function is the expected information function, and with the MEPV, it is the expected posterior variance of the ability level. Both methods require the computation of the ability estimate, or its posterior variance, when the response pattern is updated either by a zero (incorrect response) or a one (correct response). The next item to be administered is the one among all available items that maximizes the expected information function (for the MEI criterion) or that minimizes the expected posterior variance (for the MEPV criterion). (2000), the so-called Maximum Expected Posterior Weighted Information (MEPWI) criterion, as a combination of both MEI and MPWI criteria. However, Choi and Swartz (2009) demonstrated that this method is completely equivalent to the MPWI approach. For this reason, the MEPWI was not considered in this paper.

Another method was proposed by van der Linden and Pashley
Finally, the last two possible methods are Urry's criterion and the completely random selection. Urry's criterion consists in selecting the available item whose difficulty level is as close as possible to the provisional ability estimate (Urry 1970). This is a straightforward method, and under the 1PL model it is completely equivalent to the MFI criterion. With other models however, some slight differences can occur. Completely random selection of the next item consists in a random draw from the set of available items. This might not be the optimal method yielding most informative tests, but its simplicity justifies its presence into the catR package. Moreover, performing some random item selection during the test might reduce the risk of item over exposure (Georgiadou, Triantafillou, and Economides 2007), at the risk of selecting less informative items. It mostly serves as a baseline method that can be compared against other, more efficient methods.

Stopping step
Any CAT process stops when the stopping criterion is fulfilled. Three main stopping rules are considered: the length criterion, the precision criterion and the classification criterion.
The length criterion imposes a maximum number of items to be administered. The CAT process stops when this maximal test length is attained. Longer tests obviously increase the precision of the estimates of ability, but shorter tests might be considered for investigating some issues in the early steps of a CAT, e.g., the effects of early mistakes on the item selection process (Rulison and Loken 2009).
The precision criterion forces the CAT process to stop as soon as the precision of the pro-visional ability estimate reaches the pre-specified level of precision. Typically, the precision is measured by the standard error of the ability estimate, the lower the standard error the better the precision. Thus, one usually fixes a standard error as threshold value, and items are iteratively administered until the standard error of the provisional ability estimate gets smaller than or equal to that threshold.
Finally, the classification criterion is often used when the goal of the CAT is to classify examinees with respect to some ability threshold, rather than to estimate their ability level. For instance, one is interested in classifying students according to whether their ability level is larger or smaller than the value 0.5. Then, one examinee will be flagged with ability higher than 0.5 if there is enough confidence to assess this classification, according to the CAT response pattern. In practice, a confidence interval for the ability is built at each step of the test, and the CAT process goes on until the ability threshold is not included in the confidence interval anymore. This implies that the subjects' ability is either larger or smaller than the threshold, and the test stops. In sum, two parameters are needed to set up the classification criterion: the ability threshold and the confidence level of the interval. The larger the confidence, the longer the test will be to obtain a final classification. Also, very large or very small thresholds often lead to shorter tests, because it is always easy to discriminate examinees with extreme abilities (high or low) from middle-level ability examinees.

Final step
The final step of a CAT process provides the final estimation of the examinee's ability level, using the full response pattern to the adaptive test. The standard error of the estimate can also be displayed, and in the case of the classification stopping rule, the final classification of the examinee is also available. Any of the four estimators (ML, BM, EAP or WLE) can be considered in this final step, and there is no reason to use a different one than in the test step. However, it is possible to combine different estimators in the test and the final steps. For instance, a Bayesian estimator (EAP or BM) or the weighted likelihood method can be used throughout the test, and especially in the first steps of the process, to avoid the issue of infinite estimates with fully correct or fully incorrect patterns. At the final stage, however, the simple ML estimator can be used in order to get a final estimate that is free of any prior or weighing system.

Item exposure and content balancing
Apart from the four aforementioned steps, two additional issues are often controlled during a CAT process.
The first issue is called item exposure, and refers to the problem that some items might be too often administered with respect to other items. One reason might be that these items are often selected as initial items, or because they are very informative at average ability level (Davey and Parshall 1995;van der Linden 1998). However, allowing such items to be too often administered yields a security problem for the item bank: pre-knowledge about items too often exposed can become available for examinees by gathering related information from previous test takers. Another related problem is an increased cost in developing and calibrating new items to be introduced in the item bank. For those reasons, it is important to ensure that items are not administered too frequently (Chang and Ying 1999;Stocking and Lewis 1998).
To control for item exposure, several methods were suggested. Some rely on the selection of more than one optimal item in the neighbourhood of the current ability estimate, and some random selection is made among these items to select the next administered one. Such methods include the so-called randomesque method Zara 1989, 1991) and the 5-4-3-2-1 technique (McBride and Martin 1983;Chang and Ansley 2003). More sophisticated methods, based on maximum allowed item exposure rates that are determined by means of prior simulations, include the Sympson and Hetter method (Hetter and Sympson 1997), the Davey and Parshall approach (Davey and Parshall 1995;Parshall, Davey, and Nering 1998) and the Stocking and Lewis conditional and unconditional multinomial methods Lewis 1995, 1998). It is also worth mentioning the recent method of maximum priority index (Cheng and Chang 2009).
The second issue is often referred to as content balancing and it consists in selecting items from various subgroups of an existing structure of the item bank. The selection must be such that it is balanced with respect to some predefined percentages of items coming from each subgroup (Kingsbury and Zara 1989). This is particularly usefiul when the item bank has a natural structure in subgroups of targeted goals or competencies, for instance an item bank of mathematics items with a natiral classification into addition, substraction, multiplication and division problems. Forcing the CAT content to be balanced ensures that at least some percentage of the test items come from each subgroup of items.
In sum, controlling for content balancing requires: (a) an intrinsic classification of items into subgroups of targeted areas; (b) a set of relative proportions of items to be administered from each subgroup. With these elements, Kingsbury and Zara (1989) proposed a simple method, sometimes referred to as the constrained content balancing method (Leung, Chang, and Hau 2003): 1. At each step of the CAT process, compute the current empirical relative proportions for each subgroup.
2. Determine the subgroup with largest difference between the theoretical relative proportion and its empirical value.
3. Select the next item from this subgroup and return to step 1.
More sophisticated content balancing methods are available; see for instance Leung et al. (2003) and Riley, Dennis, and Conrad (2010). Interestingly, the method of maximum priority index (Cheng and Chang 2009) can also be considered for content balancing.

The catR package
The main steps of a CAT process have been sketched in the previous section. In this section we provide details about the functionalities of the catR package, in relationship with the different CAT steps.

Structure
The catR package is all organized around a single command called randomCAT. It generates a response pattern given the true ability level of the examinee, an item bank, the maximal number of items to be administered, and four lists of input parameters. These lists correspond to the different steps described in Section 2 and are detailed below. The basic R code for randomCAT is as follows: randomCAT (theta, bank, maxItems, cbControl, start, test, stop, final) In this code, theta is the true ability level; bank is an item bank with the appropriate format (see later); maxItems is the maximal test length (set to 50 items by default); cbControl is an inpout parameter that controls for content balancing (whenever required); and start, test, stop and final are four lists of input parameters. These lists are fully described in the forthcoming sections.
Note that to avoid misspecifications of the input lists, catR has an integrated function called testList, which determines whether the lists have the appropriate format. If at least one of the lists is not correctly specified, testList returns an error message with indications on the detected misspecifications.

Item bank generation
To create an item bank in an appropriate format for the randomCAT function, catR makes use of the createItemBank function. This command has twelve input parameters, some are mandatory, others are optional with default values. These parameters are listed in Table 2, together with their role, possible values, default values, and the cases where they are ignored by the function.
The starting point to generate an item bank is the matrix of item parameters. This matrix must have one row per item, and at least four columns. The columns hold respectively the values of the item discriminations, difficulties, pseudo-guessing and inattention parameters, in this order. This matrix can be passed through the items argument. A fifth column can be added to the matrix, holding the names of the subgroups of items for content balancing. In this case, the full matrix can be passed as a data frame. If this fifth column is absent then content balancing cannot be controlled. Note that to allow for content balancing control, the argument cb must be additionally set to TRUE (default value is FALSE), otherwise the fifth column is discarded for creating the item bank.
Alternatively, the user can let the software generate the item parameters. In this case, the items argument contains the number of items to be included in the bank. Furthermore, the argument model must be set to determine which IRT model is used to generate item parameters. Possible values of model are "1PL", "2PL", "3PL" and "4PL", with reference to each of the four logistic models. By default, the "4PL" model is considered, and the following distributions are used: N (0, 1) for item difficulties, N (1, 0.04) for item discriminations, U (0, 0.25) for item pseudo-guessing parameters, and U (0.75, 1) for item inattention parameters.
These prior distributions can be modified by setting optional arguments aPrior, bPrior, cPrior and dPrior accurately. These arguments take as values, a vector of three components, the first one coding for the distribution name, and the last two holding the distribution parameters. For item discriminations, available distributions are the normal, the log-normal, and the uniform densities. For item difficulties, both normal and uniform distributions are currently available. Finally, for both pseudo-guessing and inattention parameters, the Beta and the uniform distributions are possible values (see the help file of createItemBank for further details). an integer model specifies the IRT model for "1PL", "2PL", "4PL" items is item parameter generation "3PL" or "4PL" a matrix aPrior specifies the prior a vector with c("norm", items is distribution of item distribution 1, 0.2) a matrix discriminations components bPrior specifies the prior a vector with c("norm", items is distribution of item distribution 0, 1) a matrix difficulties components cPrior specifies the prior a vector with c("unif", items is distribution of item distribution 0, 0.25) a matrix pseudo-guessing components dPrior specifies the prior a vector with c("unif", items is distribution of item distribution 0.75, 1) a matrix inattention components seed fixes the seed for random a real value 1 items is generations a matrix thMin fixes the minimum ability a real value -4 /// value for information grid thMax fixes the maximum ability a real value 4 /// value for information grid step fixes the step between ability a positive real 0.01 /// values for information grid value D fixes the constant metric a positive real 1 items is value a matrix For simpler models, the corresponding parameters are constrained accordingly; for instance, under the 2PL model, the third and fourth columns will be all zeros and ones, respectively. The D sets the constant metric, by default 1 for the logistic metric. The other common choice is to set D to 1.7 for the normal metric (Haley 1952). Arguments model, aPrior, bPrior, cPrior, dPrior, seed and D are ignored if the user provides a matrix (or data frame) through the items argument.
Once the matrix of item parameters is available, an item information matrix is created. This matrix contains the values of the item information function for each item in the bank, and for a set of predefined ability levels. These abilities are chosen from a sequence of values entirely defined by the argument thMin, thMax and step. The arguments display respectively: the minimum ability level, the maximum ability level, and the step value between two abilities of the sequence. The default values are -4, 4 and 0.01, which means that the default information matrix refers to ability levels from −4 to 4 by steps of 0.01. Each row of the information fixes the bandwidth of a positive real 4 fixItems or seed the range of abilities value is not NULL startSelect specifies the method for "bOpt" or "bOpt" fixItems or seed item selection "MFI" is not NULL Table 3: Arguments of the start list for the initial step. matrix refers to one ability level, and each column to one of the bank items.
Note that for a correct application of randomCAT, the item bank must be generated prior to starting the CAT process. It is therefore recommended to generate the item bank with the createItemBank function and to save it into an R object. This object can then be used in any further application of randomCAT.

Initial step
The initial step, that is, the selection of the first item(s) to be administered, is coded by the start list in the randomCAT function. Table 3 summarizes the six possible arguments of this list, together with additional information such as their default values and the situations where they are ignored.
The three methods for selecting the first item(s) are: by user specification, by random selection, or by optimal selection according to some fixed constraints. The three methods are set up as follows.
First, the user specification of the first items is done through the fixItems argument. It is a vector of integer values coding for the item numbers, in the order they are listed in the item bank. By default, fixItems takes the NULL value, which means that the user does not select the first items. Then, one can choose these first items randomly. Two arguments must be specified: seed takes a numeric value that fixes the random seed for item selection, and nrItems that indicates how many items must be randomly chosen. If seed is NULL (which is the default value), random selection is ignored, and the third method applies. This third method, the optimal selection of the starting items, relies on two aspects. The first one is the subset of ability levels that will be used for optimal item selection. Each value in this subset corresponds to one selected item. The subset is defined by fixing an interval of ability levels and by splitting this interval into a subset of equidistant values. The interval is set up by its centre, through the argument theta, and by its half-range (i.e., half the range of the interval), through the halfRange argument. The number of values is still given by the nrItems argument. For instance, the subset −1, 0, 1 of ability values is set up by the triplet of values (3, 0, 1) for the triplet of arguments (nrItems, theta, halfRange). As another example, the subset (−1.5, 0, 1.5, 3) is given by (4, 0.75, 2.25). A single item can also be selected: the triplet of values (1, th, hr) reduces the subset to the value th, and the value hr of the half-range is irrelevant.
When this subset of ability values is fixed, the second aspect of the optimal item selection method is the criterion used for selecting these items. Two possible methods are implemented, through the startSelect argument: Urry's rule and the information function rule. Urry's rule, coded by startSelect = "bOpt", selects the items whose difficulty levels are as close as possible to the ability values in the subset. The information function rule, coded by startSelect = "MFI", selects the items that maximize the item information function at the given ability levels. By default, items are selected according to their difficulty level.
It is important to notice that the three methods are hierarchically ordered. If fixItems is not NULL, items will be chosen according to the user's pre-specifications, and all other arguments of the start list will be ignored. If fixItems is NULL and seed is not NULL, then nrItems will be randomly selected according to the seed value, and all other arguments are again ignored. Finally, if both fixItems and seed are NULL, then the optimal selection of the first items is achieved, according to the values of the nrItems, theta, halfRange and startSelect arguments. Note also that not all the arguments need be specified. For instance, if only seed is specified, then a single item will be randomly chosen in the item bank, as the default value of nrItems is one.

Test step
The test step is specified in randomCAT by means of the test list, which contains at most nine arguments as listed in Table 4. These arguments refer to the selected method of provisional ability estimation and the rule for next item selection, as well as for item exposure control.
First, the method argument fixes the estimation method. The possible values are "ML", "BM", "EAP" and "WL" for the respective methods, and the default choice is "BM". If the method is either Bayesian modal ("BM") or expected a posteriori ("EAP"), the priorDist and priorPar arguments determine the prior distribution. The argument priorDist fixes the distribution itself, while priorPar specifies the related prior parameters.
Three cases can arise. First, priorDist takes the (default) value "norm", corresponding to the normal distribution, and priorPar is a vector of two numeric components, the prior mean and the prior standard deviation. Second, priorPar takes the value "unif" for the uniform distribution, and priorPar is a vector of two numeric components for the range of the uniform distribution. Third, priorPar takes the value "Jeffreys" for Jeffreys' prior distribution. In this case, the priorPar argument is ignored.
Moreover, for EAP estimation, it is possible to set some parameters for the numerical approximation of the integrals. These parameters are the limits for adaptive quadrature and the number of quadrature points, and are provided altogether through the parInt argument. The default value of parInt is the vector c(-4, 4, 33), specifying 33 quadrature points on the range [−4, 4], that it, the sequence from −4 to 4 by steps of 0.25.
The next two arguments are D and range. The argument D fixes the metric value (see Table 2), and the argument range sets the range of allowable ability estimates. Its primary use is to avoid infinite estimates. The default range is [−4, 4] and can be changed by providing the Argument Role Value Default Ignored if method specifies the method for "BM", "ML" "BM" /// ability estimation "EAP" or "WL" priorDist specifies the prior "norm", "unif" "norm" method is distribution or "Jeffreys" neither "BM" nor "EAP" priorPar specifies the parameters a vector of two c(0, 1) method is of the prior distribution real values neither "BM" nor "EAP", or priorDist is "Jeffreys" range fixes the maximal range a vector of two c(-4, 4) method is of ability values real values "EAP" D fixes the value of the a positive real 1 /// metric constant value parInt fixes the parameters for a vector of c(-4, 4, 33) method is numerical integration three numeric not "EAP" values itemSelect specifies the method "MFI", "MEPV", "MFI" /// for next item "MEI", "MLWI", selection "MPWI", "Urry" or "random" infoType specifies the type of "observed" or "observed" itemSelect information function "Fisher" is not "MEI" randomesque specifies the number a positive 1 /// of optimal items integer for 'randomesque' item exposure bounds of the interval through the range argument.
The itemSelect argument specifies the method for selecting the next item. Currently seven methods are available: MFI, MEI, MEPV, MLWI, MPWI, Urry's rule and random selection. The first five methods are set up by their acronym (i.e., "MFI" for MFI method etc.), Urry's rule by the value "Urry" and random selection by "random" value. The default method is the MFI criterion. In addition, the type of information function, either observed or Fisher, can be defined by the infoType argument. The two possible values are "observed" and "Fisher" and the default value is "observed". Note that it is only useful for MEI criterion, and is ignored with all other methods.
Finally, item exposure can be controlled with the randomesque approach (Kingsbury and Zara 1989). It consists in selecting not only one but several optimal items, according to the specified criterion, and to randomly pick-up one of these optimal items for the next step of the CAT process. This is controlled by the argument randomesque, taking a positive integer value, with default value 1 (that is, the usual selection of the optimal item). This is currently the only available method for item exposure control within catR.

Argument Role
Value Default Ignored if rule specifies the "length", "length" /// stopping "precision" or rule "classification" thr specifies the a real value 20 /// threshold related to the stopping rule alpha specifies the alpha a real value 0.05 rule is not level for the "classification" provisory confidence intervals

Stopping step
The stopping step is defined by the stop list. This list has at most three arguments, which are listed in Table 5.
The first argument, rule, specifies the type of stopping rule. Three values are possible: "length", "precision" and "classification", each referring to a particular rule as described in Section 3.3. The default rule is the "length" criterion. The second argument is called thr and holds the specific threshold of the stopping rule. The values of thr are either: an integer fixing the test length (for the length criterion), a real positive value giving the maximum allowable standard error of the ability estimate (for the precision criterion), or the value of the ability level to be considered for subject classification (for the classification criterion). Because the default stopping rule is the length criterion, the default thr value is 20, that is, the CAT process will stop by default after 20 items are administered.
The last argument is called alpha and it permits to set the confidence level of the intervals for the classification criterion. More precisely, the confidence level is one minus alpha, and the default alpha level is 0.05. This argument is obviously ignored if the rule is either "length" or "precision".
It is important to notice that the stopping rule might not be satisfied if the stop list is badly specified. For instance, setting a very small standard error for the precision criterion would maybe require too many items, so that the CAT process becomes useless. Another problem would arise with the classification criterion, if either the threshold or the significance level is misspecified, so that too many items are also required to classify the subject. In order to avoid such an issue, randomCAT has a "meta-argument", called maxItems, which imposes a maximal test length and forces the test to stop if this length is reached, even though the stopping rule is not yet satisfied. In this case, the output will display a warning message reminding that "the stopping rule was not satisfied". Note that if the length criterion is chosen, then the test length is taken as the minimum of the maxItems meta-argument and the thr argument of the stop list. The default value of maxItems is 50.

Final step
The final list sets the ability estimation method for the final step of the CAT process, with two exceptions. First, the itemSelect and infoType arguments are useless for the final step, and are therefore not allowed. Second, the final list has an additional argument, alpha, which fixes the significance level for the final confidence interval of the ability level. This alpha argument performs similarly to that of the stop list (see Table 5), and takes the default value 0.05. Note also that the final confidence interval is always computed, even if the stopping rule is not the classification criterion. All other arguments of the final list are completely identical to those of the test list. However, they may take different values, so that the provisional and final ability estimators are possibly different.

Content balancing
If the item bank was specifically created for content balancing, the latter can be controlled through the argument cbControl of the randomCAT function. The only content balancing method currently available in catR is the constrained content balancing method Zara 1989, 1991). By default, cbControl takes the NULL value, so nothing is performed. Otherwise, cbControl must be set as a list of two elements called names and props. The element cbControl$names contains the names of the subgroups of items, while cbControl$props holds the expected relative proportions of items per subgroup. For compatibility issues, the names of the subgroups must exactly match the subgroups' names provided in the item bank, and the relative proportions must be non-negative (they may possibly not sum to one; in this case they are internally normalized to sum to one). Note that catR has an internal function called test.cbList that tests the format of the cbControl list, and returns an error message if the list is not accurate for content balancing.

Output
The structure of an output from randomCAT is a list of class "cat" with many arguments. Basically, the function returns the final results: final ability estimation, standard error and confidence interval. Moreover, the complete response pattern and the matrix of item parameters of selected test items are also returned. For completeness, the vectors of provisional ability estimates and standard errors are displayed, so that one can plot the curve of ability estimation through the CAT process. Finally, all input arguments of the start, test, stop and final lists are returned within the randomCAT output, as well as the value of the cbControl argument. The complete list of argument names is described in the catR help manual.
Two additional functions are useful for a simpler and more suitable display of the CAT results. First, the print function for "cat" objects organizes the output and displays it in a visually attractive way (see next section for an example). The printing of the results include: a summary of the basic options for the selection of early test items, the provisional ability estimator, the stopping rule, the control for item exposure etc. Then, the different steps of the CAT process are displayed in a table format, with the administered item, the response to this item, and the updated provisional estimate of ability and standard error. Next, the final results are displayed: final ability estimator, final estimate, standard error and confidence interval. If content balancing was controlled, additional output is also displayed: a summary table of both expected and observed relative proportions of items administered per subgroup, and the list of items administered per subgroups.
The second utility function is the plot command for "cat" objects. It permits a visual representation of the CAT process by plotting the provisional ability levels against the test length. The function can be used as follows: plot(x, ci = FALSE, alpha = 0.05, trueTh = TRUE, classThr = NULL) In this code, x is the output of the randomCAT function; ci is a logical argument that specifies whether or not confidence intervals must be drawn around each provisional ability estimate; and alpha is the significance level for building the confidence intervals (0.05 by default). In addition, trueTh is a logical value specifying whether the true ability level must be drawn additionally to the plot (as a horizontal line), and classThr is the value of the classification threshold to be drawn in addition (the value NULL disables this argument). Only the x argument is mandatory; by default, the set of provisional and final ability estimates are drawn, without each confidence interval, and the true ability level is superposed to the plot.

Examples
We illustrate the package by creating two item banks and by generating response patterns under pre-specified CAT processes. In the first example, a large item bank is considered and the main goal of the test is to discriminate between very good examinees and the others. In the second example, a real item bank is taken into account and both item exposure and content balancing are under control. Because the response patterns are drawn by a random process, the following output may differ from two successive runs of the same R code.

Example 1
The item bank is made of 500 items whose parameters are randomly generated under a 3PL model. The information matrix uses the sequence of ability levels from −4 to 4 by steps of 0.05. The R code for creating this item bank is given below (both the random seed and the constant metric D are kept to their default values of 1 and the resulting item bank is stored into the so-called Bank object): R> Bank <-createItemBank(items = 500, model = "3PL", thMin = -4, thMax = 4, + step = 0.05) Now, the four lists of tuning parameters are defined as follows. To start the test, a single item will be chosen such that it maximizes Fisher information function at the ability level 0. This is to start the test by administrating one average difficulty item to the examinee, which is a common starting situation in CAT. The corresponding list is stored into the Start object: R> Start <-list(nrItems = 1, theta = 0, startSelect = "MFI") Note that neither fixItems nor seed are specified as they take the value NULL by default.
For the test step, the WL estimator is chosen, and the usual MFI criterion is considered for next item selection. This is summarized by the following R code, and the corresponding list is stored into the Test object: R> Test <-list(method = "WL", itemSelect = "MFI") The selected stopping rule is the classification rule, and it is decided that the CAT process should stop when the ability level is significantly different from the value 2 with a confidence level of 95%. This situation depicts a potential CAT process wherein one wishes to discriminate between very good examinees with ability levels larger than 2, and the other examinees.
The following R code is suitable for this rule, by storing the corresponding list to the Stop object: R> Stop <-list(rule = "classification", thr = 2, alpha = 0.05) When the adaptive test ends, the final ability estimate is obtained by weighted likelihood estimation, as during the adaptive test. Though it is not mandatory to keep the same ability estimator for both test and final steps, this is often the case in practice. Moreover, the 95% confidence interval for ability is built up using the final standard error of the WL estimator. This is summarized by the following code: R> Final <-list(method = "WL", alpha = 0.05) The four steps of the CAT process are now plugged in the randomCAT function as given below.
For the particular simulation to discuss later on, one examinee with true ability level equal to 1 is under examination. It is expected that the examinee will be classified as having an ability level lower than 2 with a rather short adaptive test. The maximal number of items to be administered is kept to 50, the default value. The result of this random CAT generation is stored into the R object res and displayed below: The first part of the output is a summary of the different settings of the test step. The true ability level is mentioned first, followed by the parameters for the initial step, the test step, and the stopping rule. Item exposure control is then described; currently only the "randomesque" method is available, and in this example only one optimal item is selected. That is, there is no control of item exposure. Also, no control for content balancing was performed, which is also indicated in the output.
The details of the adaptive test are then returned. It consists in a matrix with one column per administered item, and five rows with respectively the item number of administration, the item identification number (as listed in the item bank), the generated response to this item, the provisional ability estimate and its standard error. This output ends with the results of the final step: final ability estimator, final estimate and standard error, final confidence interval and the final conclusion with respect to the classification rule.
In the simulated example, 16 items were administered before the conditions of the stopping rule were met. As the number of items increases, the standard error of ability estimate decreases, as the amount of available information becomes larger as the test goes on. After 16 items, the ability estimate is close to 1, the true ability level, and the standard error is sufficiently small to classify the examinee as having an ability smaller than 2, with 95% confidence. Indeed, the final 95% confidence interval is [0.340; 1.933] and does not contain the classification threshold 2. Note that, on the other hand, this interval covers the true ability level.
The provisional ability estimates can be graphically displayed by using the plot.cat function. In addition to each point estimate, provisional 95% confidence intervals are also drawn, together with the classification threshold of 2 (dashed line) and the true ability level of 1 (solid line). This is summarized with the code: R> plot(res, ci = TRUE, trueTh = TRUE, classThr = 2) and the output plot is given in Figure 2. As the test goes on, the available information increases and the provisional standard errors of the ability estimates decrease accordingly. Consequently, narrower confidence intervals are obtained. After the 16th item administration, the confidence interval does not contain the threshold value 2 anymore, which forces the CAT process to stop. This is obviously in agreement with the R output above, but is visually displayed in this plot. Note also that the ability estimates are getting closer to the true level 1 as the items are being administered.

Example 2
In the second example, the TCALS item bank is considered. The TCALS is an English skill assessment test assigned to students entering into college studies in the Canadian province of Quebec (Laurier, Froio, Pearo, and Fournier 1998). The test is made of 85 items, calibrated under the 3PL model (Raiche 2002), and can be merged into five categories: two categories of oral items (Audio1 and Audio2) and three categories of written items (Written1, Written2 and Written3). This item bank is available from catR directly; see the corresponding help file for further details.
The generated CAT process has the following characteristics: 1. Five initial items are selected as the most infomrative at target ability levels from -2 to 2 by steps of 1.
2. Ability level is estimated with the EAP estimator and prior standard normal distribution, both during the test and at the final step.
3. The criterion for next item selection is Urry's rule.
4. The CAT process stops when 12 items are administered.
5. Item exposure is controlled by selecting, at each step of the CAT process, the four most optimal items (i.e., the "randomesque" number of items is four).

Random generation of a CAT response pattern
True ability level: 0 Starting parameters: Number of early items: 5 Early items selection: matching item difficulties to starting abilities Starting abilities: -2, -1, 0, 1 and 2 Order of starting abilities administration: 2, 1, -1, 0 and -2 Adaptive test parameters: Next item selection method: Urry's procedure Provisional ability estimator: Expected a posteriori (EAP) estimator Provisional prior ability distribution: N(0,1) prior The output is similar to the previous example, and up to the modifications of the CAT design (listed in the top of the output) and the output of this particular simulated run, the main two differences concern the item exposure and content balancing output. First, item exposure is controlled by selecting the four optimal items at each step, which is returned under the Item exposure control section. Second, the expected relative proportions of items per subgroup are printed, as input information for the CAT process and because content balancing control was required.
Finally, at the end of the output, two pieces of additional information are printed. First, a two-row table is printed; it contains both the expected and the observed (empirical) relative proportions of items per subgroup. This is an easy way to check the discrepancy between the expected content balancing and the one that was actually obtained after the test. The second information is a merging of the test items into the five subgroups of items, with their names. This permits to extract straightforward information about which items of which type were administered.

Discussion
This paper describes an R package as a technical tool for generating CAT response patterns. Dichotomous logistic item response models are used to generate such response patterns. Several criteria for selecting the first items of the test, for selecting the next item, for stopping the adaptive test, and for final ability estimation, are available. Input options reflect a wide variety of real situations, and the output information is complete and easy to extract. For those reasons, catR can be used as is for simulation studies with the R environment, or as part of more sophisticated software with user interface (similarly as Firestar software).
The main assets of catR are the flexibility with respect to the logistic IRT model (to our knowledge, it is the first CAT software incorporating the 4PL model), the selection of the first items of the test and the efficiency to generate a large number of CAT response patterns. It is comparable to Firestar with respect to the methods for next item selection and item exposure, to CATSim with respect to the stopping rules, and to both software with respect to the selection of ability estimators. In addition, control for content balancing seems to be an asset of catR with respect to other software. Its main drawback, however, is that the current version of catR is limited to dichotomous items, whereas both Firestar and CATSim can handle polytomously scored items. The polytomous version of catR with Bock's nominal response model (Bock 1972) as underlying IRT model, is currently under development. In addition, other methods for item exposure (Chang and Ansley 2003) and content balancing (Leung et al. 2003;Riley et al. 2010) could be implemented.
In addition to the presentation of the package itself, the main objective of this paper is to provide a clear and simple overview of the CAT framework. The four steps of this method (initial step, test step, stopping rule, final step) are discussed according to the current standards and technical accuracy. The primary intention is to provide the reader a short but straightforward introduction to the CAT environment, with suggested references for further reading to the interested reader. Wainer (2010) is claiming that one should focus further on the possibilities that CAT can offer with respect to standard, non adaptive, assessment tests. We believe this paper is one small step towards this direction, both on the methodological and the practical aspects.
Interestingly, catR was already combined with the web-based platform Concerto (Kosinski and Rust 2011a,b) to create web-based adaptive tests. Although still in development, Concerto would eventually permit any interested researcher or CAT user to build specific CAT processes, using the web platform for item bank creation and catR as underlying package for routine calculations. See http://code.google.com/p/concerto-platform/ for further information.

Computational details
The currently available version of catR is 2.3. Version 2.12.0 (or later) of the R software should be installed for optimal working of catR.