Fitting diffusion item response theory models for responses and response times using the R package diffIRT

In the psychometric literature, item response theory models have been proposed that explicitly take the decision process underlying the responses of subjects to psychometric test items into account. Application of these models is however hampered by the absence of general and ﬂexible software to ﬁt these models. In this paper, we present diﬀIRT , an R package that can be used to ﬁt item response theory models that are based on a diﬀusion process. We discuss parameter estimation and model ﬁt assessment, show the viability of the package in a simulation study, and illustrate the use of the package with two datasets pertaining to extraversion and mental rotation. In addition, we illustrate how the package can be used to ﬁt the traditional diﬀusion model (as it has been originally developed in experimental psychology) to data.


Introduction
In the behavioral sciences, inferences about traits such as motivation, extraversion, arithmetic ability, and attitudes require measurable indicators for these traits. Commonly, such indicators are responses to tests (e.g., intelligence tests like the Wechsler Adult Intelligence Scales; Wechsler 1997) and questionnaires (e.g., personality questionnaires like the Big-5 Personality Inventory ;Digman 1990). To enable inferences about the traits underlying these observed data, the indicators are linked to the trait by specifying a measurement model. The exact mathematical form of the measurement model depends on the distribution of the observed data and the assumed distribution of the trait.
The family of item response theory (IRT) models includes popular measurement models like the Rasch model (Rasch 1960), the 2-parameter logistic model (2PL; Birnbaum 1968), the graded response model (GRM; Samejima 1969), and the common factor model (Spearman 1904;Thurstone 1931). Traditionally, these measurement models have been formulated purely on basis of desirable statistical properties like sufficiency of the total score for the trait (in the case of the Rasch model, see Fischer 1995), model flexibility (in case of the 2PL and the GRM; see Glas 1999), and linearity between the indicators and the trait (in case of the common factor model, see Bollen 1989). These properties make these measurement models extremely useful for a wide range of applications, e.g., for investigating differential item functioning (Mellenbergh 1989;Meredith 1993), for testing group differences (Jöreskog 1971), for the construction of tests and questionnaires (Kline 1986), for investigating the structure of theoretical constructs (i.e., structural equation modeling; Bollen 1998), and for testing the effects of experimental manipulations (Wicherts, Dolan, and Hessen 2005). Despite these advantages, these measurement models are purely statistical. That is, the traits in these models have no direct connection to the psychological process that allows the respondent to make a certain response to the test or questionnaire items (Borsboom, Mellenbergh, and van Heerden 2004). As argued by van der Maas, Molenaar, Maris, Kievit, and Borsboom (2011) this is problematic for a number of reasons. Specifically, (1) it hampers the investigation of test validity; (2) it obscures the connection between intra-individual differences and inter-individual differences; and (3) it makes the interpretation of the data in terms of substantive processes more ambiguous.
Effort has been devoted to formulate more substantively informed measurement models that explicitly incorporate the underlying psychological process that elicited the response to a given test item. These process IRT models mainly draw from process models that already exist in the field of mathematical psychology. For instance, Ranger and Kuhn (2014) proposed a model based on the proportional hazard model (see, e.g., Luce 1986), Tuerlinckx and De Boeck (2005) and Rouder, Province, Morey, Gomez, and Heathcote (2015) proposed extensions of the Race model (Audley and Pike 1965), and Tuerlinckx, Molenaar, and van der Maas (2016), Tuerlinckx and De Boeck (2005), and van der Maas et al. (2011) proposed extensions of the so called diffusion model (Ratcliff 1978).
The diffusion model is arguably the most popular model for decision making used in experimental psychology. It is a model appropriate for trials in which subjects need to decide between two answer options (e.g., "yes/no", "left/right", etc.). A typical example of a psychological experiment that is suitable for diffusion modeling is the lexical decision task. In this task, subjects have to decide as quickly as possible whether a sequence of letters, shown on a computer screen, form a word (e.g., in the case of "table") or a non-word (e.g., in the case of "telab"). In the diffusion model, it is assumed that once a subject has encoded the letters presented on the screen, information starts accumulating over time in favor of either a "word" response or a "non-word" response. Each response option is characterized by a boundary which quantifies the amount of information that is needed for that option to elicit a response by the subject in favor of that option. If the amount of information for a given option reaches the boundary of that option, the subject starts making a response. The response time is then a function of the average rate with which the information accumulated (drift rate, µ), the distance between the two boundaries (boundary separation, α), and the time needed for encoding of the letters on the screen and the physical responding (non-decision time, Ter ), see and McKoon 2004) and that instructions that emphasize speed lead to closer boundaries α (Wagenmakers, Ratcliff, Gomez, and McKoon 2008). More extended versions of the diffusion model also include a bias parameter which models the tendency to favor one option over the other irrespective of the stimulus content. However, here we focus on the more basic model as described above.
On its own, the diffusion model is not a measurement model as it does not disentangle person and item characteristics which is the key property of a measurement model (Borsboom and Molenaar 2015). Van der Maas et al. (2011) however proposed such a decomposition for personality questionnaires and ability tests separately (see also Tuerlinckx et al. 2016). The resulting models are referred to as diffusion IRT models. At present, no general model fit routines are available to fit diffusion IRT models to real data. Van der Maas et al. (2011) used a Bayesian model implementation within the WinBUGS program (Lunn, Thomas, Best, and Spiegelhalter 2000). However this implementation is specific to the data in the van der Maas et al. (2011) study and could thus not straightforwardly be used for different datasets. In addition, it requires advanced programming knowledge to adapt the scripts as they make use of the WinBUGS add-on package wbDEV (Lunn 2003). Tuerlinckx and De Boeck (2005) and Tuerlinckx et al. (2016) used SAS (SAS Institute Inc. 2013) code to fit a specific instance of the diffusion IRT model (for personality data only, see below), which is inflexible as it needs adaptation when there are a different number of items or when there are parameter constraints. In addition, the procedure is slow and the assessment of absolute goodness of fit of the model is not readily possible. Finally, none of the above studies focused on parameter recovery of the different models to establish the feasibility of the estimation procedures.
In the present paper we present a general approach to fitting diffusion IRT models to real data using the package diffIRT (Molenaar 2014) within the statistical software environment R (R Core Team 2015). Package diffIRT is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=diffIRT. As compared to the approaches above, the package is easy to use (no programming knowledge is needed besides some basic R understanding like reading in data), the package is reasonably fast, it contains tools to assess absolute goodness of fit, and it is flexible in specifying parameter constraints. In addition, we show in a simulation study that true model parameters are well recovered and that the proposed statistics to assess model fit follow their theoretical distribution. Finally, the package can also be used to fit the traditional diffusion model to data.
The outline of this paper is as follows. First, we introduce the general family of diffusion IRT models together with instances for personality questionnaires and ability tests. Next, we discuss how parameters are estimated and how model fit is assessed in the diffIRT package. Next, we present a simulation study to show the viability of the estimation procedure. We then illustrate the use of the package with a single simulated dataset, followed by two applications to real data pertaining to extraversion and mental rotation. In a third application, we illustrate how the diffIRT package can be used to fit the traditional diffusion model to data. We end with a discussion on limitations and future possibilities.

Diffusion IRT models
In the traditional diffusion model, a given response of subject p, x pi , with response time t pi , is assumed to be originating from the following joint distribution function: As can be seen, no separation between person and item parameters is involved. That is, in the majority of applications of the diffusion model, subjects are given a large number of items and boundary separation, α, drift rate, µ, and non-decision time, Ter , are estimated for each subject separately assuming that all items are interchangeable.
To make the diffusion model more appropriate as a measurement model, Tuerlinckx and De Boeck (2005) and van der Maas et al. (2011) introduced person and item characteristics by separating µ, and α, into a person specific part and an item specific part, i.e., where v i and a i are the item drift and item boundary parameter of item i, and θ p and γ p are the person drift and person boundary parameter of subject p. Choices for the functions u(.) and w(.) could be made on statistical grounds. For instance, Vandekerckhove, Tuerlinckx, and Lee (2011) proposed a multilevel version of the model above, implying additive functions for u(.) and w(.). However, as pointed out by van der Maas et al., substantive considerations lead to different forms of u(.) and w(.) which are not necessarily linear. We will discuss these next. Tuerlinckx and De Boeck (2005) proposed the following functions to decompose drift rate and boundary separation into a person and item contribution:

D-diffusion
Substituting this in Equation 1 and integrating out t pi , the probability of a correct response is given by which is referred to as the D-diffusion IRT model (see also Tuerlinckx et al. 2016). In this model, the item boundary, a i , is interpreted as time pressure, the item drift, v i , as item difficulty, person boundary, γ p , as response caution, and person drift, θ p , is interpreted as the actual construct being measured. Application of the model requires both the response time data, t pi , and the response data, x pi . In the case of personality data (e.g., extraversion items "Do you like to meet new people?" with answer options 0: no and 1: yes), x pi = 0 corresponds commonly to a response indicative for the lower end of the underlying personality construct, and x pi = 1 is indicative for the upper end of the underlying personality dimension. That is a "yes" response, x pi = 1, indicates that the subject is more extraverted and a "no" response, x pi = 0, indicates that the subject is less extraverted. In the case of ability items however (e.g., an arithmetic problem) x pi = 0 indicates an incorrect answer and x pi = 1 indicates a correct answer.
Assuming the coding of the observed responses for x pi as discussed above, van der Maas et al. (2011) discuss why the D-diffusion model is appropriate for personality data but not for ability data. Specifically the D-diffusion model in Equation 4 predicts: 1. Response times are slowest for µ = θ p − v i = 0 (i.e., no evidence accumulation) and get faster when θ p − v i deviates more from 0. That is, subjects with a low position on θ p are as fast in responding as subjects high on θ p .
2. When the item time pressure, a i , decreases, α increases, which causes subjects on the lower end of the θ-range to have a smaller probability obtaining x pi = 1. The subjects on the higher end of the θ-range have a larger probability of obtaining x pi = 1.
Note that prediction 1 holds only for personality data: When θ p ≈ v i the subject is near the middle of the extraversion continuum of that item. That is, on the question "Do you like to meet new people?", the subject tends a bit toward answering "no" or the subject tends a bit towards answering "yes". A response will take considerably longer as compared to extreme subjects as the subject in the middle needs to carefully consider the item and decide whether it is a "yes" or a "no". Extreme subjects, that is θ p v i and θ p v i , are extremely extraverted and introverted respectively, so they will immediately know to respond "yes" or "no" respectively. Note that this is in line with prediction 1. For ability items, subjects with θ p v i (the high ability subjects), will also respond fast as these subjects have no difficulty solving the arithmetic problem. However, subjects with θ p v i (the low ability subjects) will take considerably longer to answer the item as for them the item is challenging. Note that this is not in line with prediction 1.
In addition, prediction 2 also only holds for personality data: When the time limit of a test decreases, subjects have more time to think about their answer. This will result in more extraverted subjects choosing "yes" (i.e., x pi = 1) and more introverted subjects choosing "no" (i.e., x pi = 0). Note that this is in line with prediction 2. For ability items, decreasing the time limit will commonly result in a larger probability of correct (i.e., x pi = 1) for all subjects. Note that this is not in line with prediction 2.

Q-diffusion
As the D-diffusion model is only suitable for personality data, a different parameterization of the diffusion IRT model is needed for ability tests. Taking into account the above, for ability data, the resulting model should predict: (1) increasing response times for decreasing θ p (i.e., low ability subjects have more difficulty solving the items), and (2) increasing probability corrects for the whole θ p -range for decreasing time pressure, a i . As discussed by van der Maas et al. (2011) the following parameterization conforms these predictions: Substituting in Equation 1 and integrating out t pi , the so-called Q-diffusion model is obtained, i.e., in which the interpretation of the parameters is the same as in the D-diffusion model, that is, a i , is the time pressure, v i is the item difficulty, γ p is the response caution, and θ p is the actual ability being measured by the test. As a i , v i , θ p , and γ p are strictly positive, µ has a lower bound of 0 which ensures that response times are slowest for µ p = 0 and get faster for increasing µ, because of more difficult items (larger v i ) or higher ability (larger θ p ).

Relation to other models
The diffusion IRT model family has some interesting relations to existing models. Most notably, van der Maas et al. (2011) discuss how the parameters from the Q-diffusion model have a relation with the parameters from van der Linden's hierarchical model for responses and response times (van der Linden 2007). Specifically, the time intensity and speed parameter from the hierarchical model are related linearly to log θ p /γ p and log v i /a i respectively for large µ × α. In addition, the D-diffusion model can be applied to investigate person fit (Meijer and Sijtsma 2001). That is, the D-diffusion model in Equation 4 is equivalent to a two-parameter IRT model with a random discrimination parameter, a p . Such models are commonly used in assessing person fit (e.g., Strandmark and Linn 1987). Next, a similar model as the Q-diffusion model in Equation 7 was proposed by Ramsay (1989). Specifically, Ramsay proposed the so-called quotient IRT model which -for two-choice data -is given by: logit[P(x pi = 1|θ)] = θ p /λ i , where λ i is an item parameter (see also van der Maas et al. 2011). Finally, Tuerlinckx et al. (2016) discuss how the D-diffusion model can be seen as a generalized linear latent variable model with two interacting latent variables. Specifically, if γ p from Equation 4 is rewritten as a normally distributed zero-centered latent variable plus a mean, γ p + κ, then it follows that which is a latent variable model with two latent variables, γ p and θ p , and their interaction.

The diffusion IRT model package
Using the package diffIRT in the R statistical programming environment, the general model given by Equation 1 subject to Equations 2 and 3 for the D-diffusion model and subject to Equations 5 and 6 for the Q-diffusion model can be fitted to responses and response time data.
In this section, we discuss the main modeling tools, i.e., the parameter estimation procedure and the assessment of model fit.

Parameter estimation
The likelihood function Parameters of the diffusion IRT model are estimated using marginal maximum likelihood (MML; Bock and Aitkin 1981). In this likelihood-based procedure, the person parameters from a given statistical model (in this case θ p and γ p ) are treated as nuisance parameters. That is, by assuming a distribution for these parameters, they can be integrated out of the likelihood equation. The resulting marginal likelihood is only a function of the item parameters and the population parameters that describe the distribution of the person parameters in the population. In present case, we choose a normal distribution for the person parameters that are in R (i.e., θ p in the D-diffusion model, see Equation 2) and a log-normal distribution for the person parameters that are in R + (i.e., γ p and θ p in case of the Q-diffusion model, see Equations 5 and 6; and γ p in case of the D-diffusion model, see Equation 3). The log-normal distribution is chosen mainly for reasons of convenience, that is, the log-normal distribution is appropriate for the parameters at hand as this distribution is bounded by zero. In addition, a logarithmic transformation of a log-normal variate has a normal distribution which numerically facilitates the use of Gauss-Hermite quadrature approximation of the integrals in the likelihood function (see below). Note that van der Maas et al. Given the above, the marginal log likelihood of the N × k matrix with responses X and response times T with elements x pi and t pi respectively, is given for the Q-diffusion model by In this equation, h(.) is given by Equation 1 with µ and α given by Equation 5 and Equation 6 respectively. Functions f (γ * p ; ω γ ) and g(θ * p ; ω θ ) are normal density functions with mean 0 and standard deviation ω γ and ω θ respectively. Note that in the above we used θ p = exp(θ * p ) and γ p = exp(γ * p ) such that θ * p and γ * p are normal variates and θ p and γ p are log-normal variates. τ is a vector of free parameters in the model, in which the star denotes that the corresponding parameter is log-transformed to parameter space (−∞, ∞) for numerical convenience.
The likelihood function in Equation 9 is given for the Q-diffusion model. However, the likelihood function for the D-diffusion model is straightforwardly obtained by first using Equation 2 for α and Equation 3 for µ in function h(.), and then by dropping the transformation of θ p , i.e., θ p = θ * p , and by dropping the transformation of v i , i.e., v i = v * i . That is the likelihood function for the D-diffusion model is given by with h(.) given by Equation 1 with µ and α given by Equation 2 and Equation 3 respectively.

Evaluation of function h(.) in Equation 9
and Equation 10 can be numerically demanding due to the presence of the infinite sum (see Equation 1). Therefore, in the diffIRT package, we evaluate this function using the procedure outlined in Navarro and Fuss (2009). In short, this procedure uses two different infinite series expansions of h(.), one which converges quickly for small response times, and one which converges quickly for larger response times. We refer to Navarro and Fuss (2009) for more details.

Approximation of the integrals
The marginal likelihood function above contains two integrals that do not have a closed form expressions. To enable optimization of this function, we approximate the integrals using Gauss-Hermite quadratures. Within IRT modeling, this is a commonly used approach, e.g., it is used in R package ltm (Rizopoulos 2006) and in the software packages Mplus (Muthén and Muthén 2007), BILOG (Mislevy and Bock 1990), and MULTILOG (Thissen 1991). Using Gauss-Hermite quadratures, the log likelihood function above can be written as That is, the integrals are replaced by weighted sums. Within these summations, N 1r and N 2s are the so-called "nodes" which are positions on the dimensions of integration and W 1r and W 2s are the corresponding weights. In addition, R and S denote the number of nodes that are being used for the θ * p and γ * p dimension, respectively. The nodes and weights can be found in standard tables (Stroud and Secrest 1966). In diffIRT, we used the function gauss.quad from the R package statmod (Smyth, Hu, Dunn, Phipson, and Chen 2015) to obtain the weights and nodes. Generally, the more nodes are used, the better the approximation of the likelihood will be. However, computational demands increase rapidly as, for instance, with 10 nodes for each dimension, 10 × 10 = 100 evaluations of h(.) are necessary for each response of a single subject. In the package diffIRT, the number of nodes can be set by the user, with default K = S = 7. This number of quadrature points showed satisfactory results during the package development. This is also demonstrated below in the simulation study. However, in practice it is advisable to try different numbers of nodes to check the stability of the results. In Equation 11, missing values (NA) are allowed in X and T . If a given element in X is missing, the corresponding element in T is also assumed treated as missing and vice versa. For missing elements, calculation of the likelihood is skipped.

Optimization and starting values
As the likelihood function is now numerically tractable and efficiently formulated, we move on to discuss optimization of the function. Within the diffIRT package, we implemented −2 × (τ ; X, T ) for the D-and Q-diffusion model in C code to increase speed of computation. The C code is subsequently linked to R in which we minimize this function using the builtin R function optim. We use the Broyden-Fletcher-Goldfarb-Shanno algorithm (BFGS; see, e.g., Nocedal and Wright 2006, p. 194) which is a quasi-Newton algorithm that uses function evaluations and first-order derivatives. We approximated the first-order derivatives of −2 × (τ ; X, T ) with respect to τ by a finite difference approximation. The spacing of the finite difference between two function evaluations equals 1e-7 by default as this value turned out to be sufficiently small during the package development.
As the procedure might be sensitive to the starting values for the parameter vector τ , these values should be carefully chosen. In the diffIRT package, starting values can be chosen by the user. However, by default, starting values are calculated on basis of the EZ-diffusion moment estimators of the diffusion model (Wagenmakers, van der Maas, and Grasman 2007). In the EZ-diffusion approach, α, µ, and Ter from the traditional diffusion model in Equation 1 are calculated using closed form expressions for these parameters. The expressions are derived by equating the expected mean response time, the expected variance of the response times, and the expected proportion correct of the responses to the corresponding observed mean, variance and proportion correct. Here, these expressions are used in diffIRT to obtain starting values for a * i , v * i , Ter * i , ω * θ , and ω * γ by adapting the R code provided by Wagenmakers et al. (2007). That is, item parameters are calculated by applying the EZ-diffusion model on the responses and response times of each item separately. Starting values for ω * θ and ω * γ are obtained by applying the EZ-diffusion model on the item responses and response times for each subject separately and calculating ω * θ and ω * γ using the formula for the log-normal distribution (in case of the D-diffusion model, this is not necessary for ω * θ as in this model θ p follows a normal distribution). The resulting values are biased by definition as for boundary separation, α, for instance, it holds that α = γ p /a i (Equations 3 and 6), i.e., the starting values for a i are conflated by γ p . We reduce this bias by multiplying the initial value for a i by E(γ p ) and the initial value for v i by E(θ p ) for the Q-diffusion model. In case of the Q-diffusion model, both E(θ p ) and E(γ p ) can be calculated using the initial value for ω γ . In case of the D-diffusion model, only E(γ p ) needs to be calculated from the initial value of ω γ as E(θ p ) = 0. This method of correcting the starting values reduces bias but does not eliminate it totally. This is plausible as item and person effects cannot be fully disentangled using the EZ-diffusion model as it assumes interchangeable items and subjects. However, most importantly, the starting values obtained in this way showed good performance. This will be illustrated in the simulation study.

Assessment of model fit
Assessment of model fit is challenging in models such as the present as a general statistical method to assess the absolute goodness of fit on the responses and response times simultaneously does not exist. We therefore follow Tuerlinckx et al. (2016) and assess absolute goodness of fit on the responses and response times separately. In addition, we propose some model fit indices to assess comparative goodness of fit.

Absolute goodness of fit
Responses. For the responses we use the limited-information test for multivariate contingency tables proposed by Joe (2005, 2006). This test consists of comparing the observed and expected joint moments of the binary data up to order r. As advocated in the papers by Maydeu-Olivares and Joe, the traditional full-information Pearson χ2 test statistic (which arises when r equals the total number of joint moments, i.e., when r is equal to k, the number of items), does not follow its hypothetical χ 2 -distribution in the case of small cell values. In addition, as the χ 2 -statistic needs all k joint moments, this involves calculations of 2 k − 1 predicted score patterns, which quickly becomes numerically demanding. As Maydeu-Olivares and Joe show, in case of both limited sample sizes (100 subjects) and large sample sizes (2500 subjects), their test -using r = 2 or r = 3 -outperforms the traditional χ 2statistic in terms of type I error rate for a two parameter IRT model. We therefore adopt this statistic in the diffIRT package. The statistic, M r , is given by where p r is the vector of observed joint moments up to order r, e.g., for 2 items and r = 2, this vector contains [p x1=1 , p x2=1 , p x1=1,x2=1 ], π r is a vector containing the corresponding joint moments predicted by the statistical model given the parameter estimates (obtained with maximum likelihood estimation or another minimum variance estimator), and C r is given by where Ω r is the covariance matrix of the residuals (i.e., p r −π r from Equation 12), and ∆ r is a matrix containing the first-order derivatives of π r with respect to the model parameters. The test statistic M r in Equation 12 has an asymptotic χ 2 -distribution with degrees of freedom equal to where the summation gives the number of observed proportions (i.e., the observed joint moments up to order r) on which the statistic is based, and s will generally be equal to the number of free parameters in the statistical model. See for more details Maydeu-Olivares and Joe (2005). In the diffIRT package, the procedure as outlined above is implemented using for π r the proportions of correct responses as predicted by the D-or Q-diffusion model given the MML parameter estimates. It is important to note that the statistic M r is based on the response data only which implies that s above will not be equal to 3 × k + 2 (which is the total number of parameters in the full diffusion IRT model). At the level of the response data, for the Q-diffusion model only, s = k + r + 2 parameters are identified, and for the D-diffusion s = 2 × k + r + 2 parameters are identified. Note that this implies that the M r test is not possible for r = 1 as the degrees of freedom will be smaller than 0 (i.e., in that case, we have more parameters than observed statistics). In the simulation study below we show that using the correction above, the M r statistic follows its theoretical distribution. In calculating M r within the diffIRT package, the order r can be inputted by the user with default r = 2 as this number showed satisfactory results in Maydeu-Olivares and Joe (2005).
Response times. For the response times, we use QQ-plots to investigate goodness of fit. That is, the quantiles of the observed response time distribution for a given item, q 1i , . . . , q zi , are plotted against the predicted quantiles given the parameter estimates,q 1i , . . . ,q zi . If the model fits the data, the observed and predicted quantiles are on a straight line. The probability between successive quantiles is given by whereh(.) is the joint density of the diffusion model (Equation 1) evaluated at the estimated parameters. By approximating the integral in Equation 14 using the R function integrate, q li can be solved for all l and all i by using the R function uniroot. This procedure requires an approximate interval which covers the predicted quantiles (0, c i ), where c i is the upper bound of this interval for item i. As the range of the predicted quantiles it commonly not exactly known, c i could be loosely chosen. However, choosing a very large value for c i will generally result in a successful solution by uniroot, but will be computationally demanding. Choosing a very small value for c i will be less computationally demanding, but if c i is too small, (0, c i ) will not contain the solution and uniroot will fail. In diffIRT, the upper bound of the interval c i equals 2 × MAX(t pi ) by default. This default was chosen more or less intuitively (i.e., the upper bound should be large enough to cover the solution, but not too large to avoid heavy computational burden) and worked well in the testing stage of the package. If the value is not large enough, an error message will be produced. In this case c i could be manually raised. indices that a lower value indicates a better model fit. Indices above can be used to compare the diffusion IRT model to different non-nested models. In addition, standard output of diffIRT contains −2 × (τ ; X, T ) evaluated at the estimated parameters,τ . This value can be used to conduct likelihood ratio tests between two nested diffusion IRT models. For instance, a full model with estimated parameter vector,τ A , could be compared to a model in which the a i parameters are constrained to be equal with estimated parameter vectorτ 0 . To do so, −2 × ( (τ A ; X, T ) − (τ 0 ; X, T )) needs to be calculated which is asymptotically χ 2 -distributed with degrees of freedom that are equal to the difference in the number of free parameters between both models. A significant likelihood ratio test indicates that the parameter constraints in the restricted model are not tenable. As the likelihood ratio test is known to be sensitive to large sample size (see Schermelleh-Engel, Moosbrugger, and Müller 2003, p. 34), it is advisable to consider other fit indices as well.

Simulation study
To show that the models discussed in this paper are feasible, we conducted a simulation study to demonstrate (1) that true parameter values are adequately recovered; and (2) that the test statistics proposed above follow their theoretical distributions. To do so, we simulated data according to the D-diffusion and the Q-diffusion IRT model with the function simdiff from the diffIRT package. This function uses the rejection algorithm described in Tuerlinckx, Maris, Ratcliff, and De Boeck (2001) with the appropriate Q-and D-diffusion decompositions for boundary, α, and drift, µ.

Design
We used the item parameter setup in Table 1. In addition, the population parameters were chosen to equal ω γ = 0.3 and ω θ = 0.3 for the Q-diffusion model, and ω γ = 0.3 and ω θ = 1 for the D-diffusion model. As can be seen in the table, we systematically varied the item parameters across items resulting in different expected values for the responses, x pi , and the response times, t pi . These expected values are calculated using the expression for the mean, variance, and probability correct in Wagenmakers et al. (2007), together with the appropriate decomposition of α and µ (Equations 2, 3, 5, and 6) and integrating out the person variables γ p and θ p .
For both the Q-and D-diffusion model we simulated 100 datasets for N = 100 and N = 200. Note that we chose not to study the asymptotic behavior of the model (i.e., by taking a sample size of for instance 10000) as we are mainly interested in establishing whether the parameter recovery is acceptable in more realistic sample sizes. However, given the results as presented below, we have no reason to doubt the asymptotic properties of the model. To the data we fitted four models: (1) the full model; (2) a model with a 1 to a 4 equal, a 5 to a 8 equal, and a 9 to a 12 equal; 3) a model with v 1 , v 5 , v 9 equal, v 2 , v 6 , v 10 equal, v 3 , v 7 , v 11 equal, and v 4 , v 8 , v 12 equal; 4) a model with Ter 1 to Ter 6 equal and Ter 7 to Ter 12 equal. Note that all these equality constraints hold (see Table 1). In each replication we conducted a likelihood ratio test between a model with and without the above constraints. In addition, we conducted the M r test on the responses using r = 2. All settings not discussed (e.g., number of quadrature points, etc.) equaled the default values of the diffIRT package.      Table 3: Mean (and standard deviation) of the item parameter estimates over the replications in the simulation study for the D-diffusion model. True value for ω γ equaled 0.3 which was estimated to be 0.30 (0.02) for N = 100 and 0.29 (0.02) for N = 200. True value for ω θ equaled 1.0 which was estimated to be 0.87 (0.08) for N = 100 and 0.85 (0.05) for N = 200.

Results
In Figure 2, the true parameter values are plotted against the starting values for a i and v i for N = 200 for both the D-diffusion and Q-diffusion model. As can be seen, the starting values are well correlated to the true values, but biased. As discussed above, this bias is due to the fact that γ p cannot be disentangled from a i , and θ p cannot be disentangled from v i using the EZ-diffusion model. However, as starting values, these values have utility as will appear below.
Results concerning the item parameter recovery are displayed in Table 2 for the Q-diffusion model and in Table 3 for the D-diffusion model. For both models, all parameters are recovered well with the v i estimates having somewhat more variability as compared to the a i and Ter i estimates. As can be seen in the footnotes of the tables, the population parameters, ω γ and ω θ , are also recovered well, with ω θ slightly underestimated in case of the D-diffusion model  In Figure 3, the theoretical and observed distributions of the M r statistic are depicted for the M 2 test on the responses in the case of N = 200. As argued above, the theoretical distribution is expected to be χ 2 with degrees of freedom equal to df = r q=1 n q − s, which gives 62 for the Q-diffusion model and 50 for the D-diffusion model. Note that s needs to be adjusted as described above. As can be seen in the figure, there is no reason to suspect systematic departures of M 2 from these theoretical distributions. Figures 4 and 5 contain similar graphs for the likelihood ratio test statistics under H 0 in the case that respectively a i and v i are constrained to the equality pattern of their true values (graphs for Ter i are not given to save space, but results for this parameter are the same). In case of a i , the full model containing 12 a i parameters, is compared to a model with only 3 a i parameters (i.e., a 1 = a 1 = a 2 = a 3 = a 4 ; a 2 = a 5 = a 6 = a 7 = a 8 ; and a 3 = a 9 = a 10 = a 11 = a 12 ). Thus, a likelihood ratio test between these models will result in a χ 2 (9) distribution under H 0 . In case of v i , the constrained model contains only 4 v i parameters (i.e., v . Thus, a likelihood ratio test between these models will result in a χ 2 (8) test. As can be seen in the figure, there are no systematic departures from these theoretical distributions.  comparing the full model with a constrained model subject to a 1 = a 1 = a 2 = a 3 = a 4 ; a 2 = a 5 = a 6 = a 7 = a 8 ; and a 3 = a 9 = a 10 = a 11 = a 12 in the simulation study for N = 200. The theoretical distributions are χ 2 (9) for both the Q-diffusion model and for the D-diffusion model.

Package description
In this section we describe and illustrate the basic functions of the diffIRT package. For an overview of all the functions, see Table 4.
To start, the function simdiff can be used to simulate data according to either the D-diffusion model or the Q-diffusion model. Here we simulate data for 100 subjects and 10 items that follow a Q-diffusion model: R> set.seed(1310) R> data <-simdiff(100, 10, model = "Q") True values are randomly chosen by the function. It is also possible to provide user-specified true values to the function. We fit the Q-diffusion model to these simulated data using: R> out <-diffIRT(data$rt, data$x, model = "Q", se = TRUE) As can be seen, we explicitly requested standard errors of the parameter estimates. By default these are not calculated as it will increase estimation time. We can now check the results:   Fits the Q-diffusion model (model = "Q") or the D-diffusion model (model = "D", default) to the response time data in matrix T and response matrix X.

RespFit
RespFit(out, 2) Calculates the M 2 statistic from the modeling results in object out. QQdiff QQdiff(out, item = 1:4) Plots a histogram and corresponding QQ-plot of the predicted and observed response time distribution for items 1 to 4. simdiff data <-simdiff(100, 10, model = "D") Simulates data according to the Q-diffusion model (model = "Q") or the D-diffusion model (model = "D", default) for 100 subjects and 10 items. factest factest(out) Estimates the factor scores of γ p and θ p for all subjects. anova anova(out1, out2) Conducts a likelihood ratio test between two nested models. coef coef(out) Returns the estimated parameters in object out. summary summary(out) Returns a summary of the model fitting results in object out including parameter estimates and fit statistics. simdiffT data <-simdiff(1000, 2, 1, .3, 3) Simulates data according to the traditional diffusion model for a single subject and 1000 trials.  values (which are stored in object data) against the estimated parameters (that are in object out): R> plot(data$ai, coef(out)$item[, "ai"]) R> abline(0, 1) The resulting plot is in Figure 6. As can be seen, the a i parameters are adequately recovered. Next, we study the model fit. First, we use the M r -test of order 2 on the responses using: R> resp_out <-RespFit(out, 2) R> resp_out The M r statistic is non-significant indicating that the model fits. In case of a significant M r statistics, residuals could be examined using resp_out$Z, which gives the standardized residual proportions for each score pattern of the first r moments. Next we examine the model fit on the response times using the function QQdiff:

Q-diffusion Model Fit of Responses
R> QQdiff(out, item = 1:3) which produces the plot in Figure 7. It turns out that for the first three items, the predicted and observed distributions appear to coincide, with some minor misfit in the tails as is common in QQ-plots due to the relatively few observations in this region of the distribution.
Next, we conduct a likelihood ratio test to see whether the v i parameters are equal across items. To do so we fit a Q-diffusion model subject to the constraint that v 1 = v 2 = . . . = v 10 : R> out_vi <-diffIRT(data$rt, data$x, model = "Q", + constrain = c(1:10, rep(11,10), 12:21, 22, 23)) As can be seen, the constraint is introduced by providing a vector of parameter numbers. Each parameter should have a unique number. The order in which the parameters should be labeled is a 1 , . . . , a 10 , v 1 , . . . , v 10 , Ter 1 , . . . , Ter 10 , ω γ , and ω θ , i.e., in the same order as the parameter vector τ in Equations 9 and 11. Since we only have one v i parameter in the present case (as we want to constrain all v i to be equal across items), we label all v i parameters using the same number, that is 11 in this case. The following code will do exactly the same: R> out_vi_alt <-diffIRT(data$rt, data$x, model = "Q", + constrain = "vi.equal") This code makes use of the pre-programmed constraint option, which is more easy to use but less flexible. Other pre-programmed constraint arguments are "ai.equal", "ter.equal", and "scale.equal". In which the latter means that ω γ and ω θ are fixed to be equal in the model. For the model with equal v i in object out_vi, the output (not printed here) displays fit indices AIC, BIC, sBIC, and DIC which are all larger for this constraint model as compared to the full model in out except for BIC. To conduct a likelihood ratio test between the two models we use: R> anova(out_vi, out) Likelihood Ratio i.e., at a 0.05 level of significance, we reach the same conclusion as compared to the AIC, sBIC, and DIC, that is, the restrictions in the model in object out_vi are not tenable.
We now illustrate the use of fixed parameter constraints. As an example, we fix all a i parameters in the simulated dataset to be equal to 0.5 as follows: R> out_fix <= diffIRT(data$rt, data$x, model = "Q", constrain = c(rep(0, + nit), 1:10, 11:20, 21, 22), start = c(rep(.5, nit), rep(NA, 22))) As can be seen, in the constrain argument we assigned all a i parameters a number of 0 denoting that these parameters are fixed. In addition, we assigned the value 0.5 to the corresponding elements in the start argument, leaving the other elements NA. Requesting output using the print comment gives: All a i are fixed to 0.5. We could again conduct a likelihood ratio test, which has 10 degrees of freedom this time: R> anova(out_fix, out) Likelihood Ratio Not surprisingly, the test is significant. Finally, we estimate the factor scores, i.e., we obtain estimates for γ p and θ p for each subject in the sample. To do so we use:

R> fs <-factest(out)
Note that we use the full model object, as this was the best fitting model. Matrix fs now contains estimates of γ p and θ p in the first and second columns, respectively. We plot these estimates against the true values in Figure 8. As can be seen, parameters are adequately recovered, where the estimates of γ p are subject to more variability as compared to estimates of θ p .

Applications
In this section we present three applications of the models presented in this paper to real data.
In the first applications we fit the D-diffusion IRT model to an extraversion dataset. In the second application we fit the Q-diffusion IRT model to data pertaining to mental rotation. In the third application we illustrate how the package can be used to fit the traditional diffusion model to experimental data.

Application I: D-diffusion modeling of extraversion
The first application concerns an analysis of unpublished data comprising scores of 146 subjects on 10 items purported to measure extraversion. Each item consists of a particular description related to extraverted behavior, e.g., active or noisy. Subjects were asked to indicate whether (yes/no) these descriptions are applicable to their personalities. Both responses and response times were recorded. The data is available within the diffIRT package. The first 10 columns contain the responses, the next 10 columns contain the response times in seconds. Below we analyze the extraversion data using a D-diffusion model. First, we open the data and fit a D-diffusion model: R> data("extraversion", package = "diffIRT") R> x <-extraversion[, 1:10] R> rt <-extraversion [, 11:20] R> res1 <-diffIRT(rt, x, "D", se = TRUE) In Table 5, the results from object res1 are depicted (i.e., parameter estimates and standard errors) together with item content and results from a traditional two parameter model (obtained using R package ltm; Rizopoulos 2006). As can be seen some differences are present in the ordering of the items on basis of the difficulty parameters from the two models, that is, v i in the D-diffusion model and β i in the two parameter model. For instance, in the D-diffusion model eupeptic is the least difficult item while in the two parameter model the least difficult item is jovial. In addition, in the D-diffusion model noisy is the most difficult item, while in the two parameter model impulsive is the most difficult item. Standard errors of the difficulty parameter in the D-diffusion model, v i are smaller as compared to those from the two parameter model, β i . This is due to the additional information in the response times that is used in fitting the D-diffusion model. Note that the standard errors of the other parameters cannot be straightforwardly compared as these are on a different scale. Next an M 2 test is conducted:  Table 5: Parameter estimates (standard errors) and item content of the extraversion data in application I for the D-diffusion model and the traditional two parameter model. λ i and β i are respectively estimates of the discrimination parameter and difficulty parameter from a traditional two parameter model. As can be seen, the M 2 test indicates that the model fits to the responses. QQ-plots for the first 6 items are obtained by using: R> QQdiff(res1, item = 1:6, plot = 1) which produces the plots in Figure 9. As can be seen, the plots also indicate that the model fits the observed response times.

Application II: Q-diffusion modeling of mental rotation data
The data comprises scores of 121 subjects on 10 items purported to measure mental rotation. These data are taken from a larger database published in Kievit (2010); see also Borst, Kievit, Thompson, and Kosslyn (2011 (2011). Each item consists of a graphical display of two 3-dimensional objects. The second object was either a rotated version of the first object, or a rotated version of a different object. Subjects were asked whether the second object was the same as the first object (yes/no). The degree of rotation of the second object was either 50, 100, or 150 degrees. Answers are coded to be correct (1) or false (0). Response times were recorded in seconds. The data are available in the package diffIRT. The first 10 columns contain the responses, the next columns contain the response times in seconds.
From the original data we omitted four response times (i.e., we replaced these by NA) that were smaller than 0.3s, as these suspiciously fast response times are likely to be invalid and could cause problems in estimating Ter i . Below we report a Q-diffusion model analysis of the resulting dataset. To start, we open the data and fit a Q-diffusion IRT model to it using: R> data("rotation", package = "diffIRT") R> x <-rotation[, 1:10] R> rt <-rotation[, 11:20] R> res <-diffIRT(rt, x, "Q", se = TRUE) Table 6 contains the results in object res (i.e., parameter estimates and standard errors) together with the degree of rotation of the object in each item. As appears from the estimates, the effect rotation degree is most notable in the Ter i parameters. That is, the items with a more rotated object are associated with a higher Ter i estimate. This indicates that items with a higher degree of rotation need more stimulus encoding time than items with a smaller degree of rotation. To investigate model fit, we conduct an M 2 test using R> RespFit(res, 2)   Table 7: Model fit indices of various Q-diffusion models on the mental rotation data in application II. 'rotate' denotes a model in which the corresponding item parameters are constrained equal for the items that have the same amount of rotation, see Table 6. All LRTs are against the full model.

Q-diffusion Model Fit of Responses
Thus, according to the M 2 test, the model did not fit well. QQ-plots (not displayed), however, looked reasonable. Next we fitted four constraint models: (1) a model with equal a i ; (2) a model with equal v i ; (3) a model with equal v i for items that have the same amount of rotation; and (4) a model with equal Ter i . This required the following R code: R> res_ai_equal <-diffIRT(rt, x, model = "Q", constrain = "ai.equal") R> res_vi_equal <-diffIRT(rt, x, model = "Q", constrain = "vi.equal") R> res_vi_rotation <-diffIRT(rt, x, model = "Q", constrain = + c(1:10, c (11,12,13,11,12,13,11,12,11,13), 14:23, 24, 25)) R> res_ter_equal <-diffIRT(rt, x, model = "Q", constrain = "ter.equal") In Table 7, output is summarized for the full model (i.e., the model we fit above, see Table 6) and the four constraint models. As we are interested in making an inference about the tenability of the equality constraints introduced in the constraint models, the full modelwith all parameter unconstrained -, serves as a baseline model here. As can be seen from the table, the model with equal a i parameters fits best according to the fit indices, i.e., AIC, BIC, sBIC, and DIC are smallest for this model. In addition, the likelihood ratio test between the full model and this model is insignificant as appears from We fit the traditional diffusion model by using R> res <-diffIRT(data$rt, data$x, model = "D", + constrain = c(1, 2, 3, 0, 4), start = c(rep (NA, 3 From the output one can determine 1/a i which gives 2.008 and −1 × v i which gives 0.997. Both are close to the true values of respectively α trad and µ trad . In addition, estimates for ω θ and Ter i are close to the true values of σ trad and Ter trad . Note that the diffusion model fitted in this way does not include a random Ter or a (random) starting point as is the case in the software package fast-dm for instance (Voss and Voss 2007).

Multiple experimental conditions
Here we illustrate the application of the traditional diffusion model on a simulated dataset with multiple experimental conditions. The data are simulated according to a design similar to that of the real brightness discrimination experiment by Ratcliff and Rouder (1998). In this experiment, a subject had to decide for a number of trials whether the brightness of a stimulus (a randomly generated array of pixels displayed on a computer screen) was either "high" or "low". The true brightness of the stimuli were manipulated into a number of levels and administered with a speed instruction ("respond as fast as possible") and with an accuracy instruction ("respond as accurate as possible"). Here we simulated data for a single subject using function simdiff for a design with 6 different brightness levels and 2 speed instructions resulting in 6 × 2 = 12 conditions. The data can be obtained using data("brightness", package = "diffIRT") where the first 12 columns are the responses and the next 12 columns are the response times.
The brightness data are prepared such that the 12 conditions are in the columns and the 800 trials are in the rows. The data is arranged in such a way that the first 6 conditions are the speed instructed stimuli and the next 6 conditions are the corresponding accuracy instructed versions of these stimuli. As the trials are random, each trial is assigned to a separate row with the response time of that trial in the corresponding column and NA's on the remaining columns. Similarly for the responses. Note that the response and the response time matrices thus have 800 × 12 = 9600 rows and 12 columns.
We now fit the hypothesized traditional diffusion model to these data using: R> data("brightness", package = "diffIRT") R> x <-brightness[, 1:12] R> rt <-brightness[, 13:24] R> res <-diffIRT(rt, x, model = "D", constrain = c(rep(1, 6), rep(2, 6), + 3:8, 3:8, rep(9, 12), 0, 10), start = c(rep(NA, 36), 0, NA)) R> res In the constrain argument we fixed the a i parameters for the first 6 stimuli to be equal, and the a i parameters for the next 6 stimuli to be equal. In addition, we fixed the v i for the same stimuli (i.e., v i of the tasks in column 1 of the data is fixed to equal v i of column 7, etc.), and we fixed all Ter i parameters to be equal. In addition, ω γ is fixed to equal 0 to omit this dimension from the model. As discussed above, from the output, traditional diffusion model estimates can be obtained by 1/a i and −1 × v i . Ter i and ω θ are already equal to the traditional parameters.

Discussion
Process IRT models incorporating separate person and item parameters have the potential to bridge the gap between the statistically oriented psychometric measurement models and the theoretically oriented mathematical process models. Model estimation of such process IRT models is however challenging due to the presence of the random person effects and due to the relatively complex forms of the process models like the diffusion model. With the diffIRT package we provide a useful tool to fit a diffusion IRT model to data obtained from multiple subjects answering multiple items. Challenges for future developments within this line of research remain. For instance, at present, as the diffusion model is a model for two-choice data, the package can only handle binary item data. It would be interesting to consider possibilities for multiple choice data, including extensions of the diffusion models discussed in this paper and other process models, e.g., the linear ballistic accumulator model (Brown and Heathcote 2005) and the race model (Audley and Pike 1965; see also Tuerlinckx and De Boeck 2005). Other challenges are: development of models for multiple processes (e.g., an arithmetic process and a reading process for worded arithmetic items), multi-group approaches (e.g., to test gender differences in person boundary) and latent regression of the person drift and boundary parameters (e.g., to test for the effects of age on these parameters). These options might be considered in future developments of the diffIRT package.