DFIT : An R Package for Raju’s Diﬀerential Functioning of Items and Tests Framework

This paper presents DFIT , an R package that implements the diﬀerential functioning of items and tests framework as well as the Monte Carlo item parameter replication approach for producing cut-oﬀ points for diﬀerential item functioning indices. Furthermore, it illustrates how to use the package to calculate power for the NCDIF index, both post hoc, as has regularly been the case in diﬀerential item functioning empirical and simulation studies, as well as a priori given certain item parameters. The version reviewed here implements all DFIT indices and Raju’s area measures for tests comprised of items modeled with the same parametric item response unidimensional model (1-, 2-, and 3-parameters, generalized partial credit model or graded response model), the Mantel-Haenszel statistic with an underlying dichotomous item response model, and the item parameter replication method for any of the estimated indices with dichotomous item response models.


Introduction
Differential item functioning (DIF) has long been recognized as a threat to the validity of test scores and is an especially important issue to consider for fair comparisons between groups (Holland and Thayer 2009). In the past decades many procedures for identifying items with differential functioning have been envisioned and several of these methods have been implemented in R (R Core Team 2016): The package difR (Magis, Beland, Tuerlinckx, and De Boeck 2010) includes eleven different indices for dichotomous items; the package lordif (Choi, Gibbons, and Crane 2011) implements a modified logistic regression algorithm that allows for both dichotomous and polytomous items using estimated sum scores or item response theory (IRT) abilities as the matching criterion; packages like mirt (Chalmers 2012), ltm (Rizopoulos 2006), lme4 (Bates, Mächler, Bolker, and Walker 2015), and eRm (Mair, Hatzinger, and Maier 2016;Mair and Hatzinger 2007) allow to test DIF using the likelihood ratio test through comparisons of multiple models; while packages like psychomix (Frick, Strobl, Leisch, and Zeileis 2012;Frick, Strobl, and Zeileis 2015), mRm (Preinerstorfer 2016), and psychotree (Komboz, Strobl, and Zeileis 2017;Strobl, Kopf, and Zeileis 2015), enable modeling and testing for test invariance using mixture Rasch models or through trees. A recent IRT parametric framework for detecting DIF that allows for detection for both dichotomous and polytomous items with unidimensional or multidimensional IRT models, known as the differential functioning of items and tests (DFIT) framework (Raju, Van der Linden, and Fleer 1995), was yet to be implemented.
This paper presents the R package DFIT (Cervantes 2017) which implements the framework indices for dichotomous and polytomous indices with several unidimensional models. It also implements the Monte Carlo item parameter replication (IPR) approach for hypothesis testing for the noncompensatory differential item functioning (NCDIF) index with dichotomous unidimensional IRT models. Section 2 presents a short overview of the DFIT framework and of the IPR approach, including an improvement of this approach to give correct NCDIF sampling distributions in the presence of group sample size differences. Section 3 illustrates the main functions in the package; they include functions for calculating Raju's area measures (Raju 1988), and the Mantel-Haenszel DIF statistic under IRT models assumptions (Roussos, Schnipke, and Pashley 1999). Section 4 presents how to use the package to calculate power for the NCDIF index. Finally, Section 5 concludes on the capabilities of the package and presents the future directions for its development.

The DFIT framework
The DFIT framework was proposed by Raju et al. (1995) as an improvement to the internal measures developed by Raju (1988) to detect items for which examinees from different groups responding to a test or a scale do not perform the same way (Holland and Thayer 2009). Originally proposed to identify DIF in dichotomous items and to be able to analyze differential functioning at the test level, it has subsequently been expanded to be used with bundles of items (Oshima, Raju, Flowers, and Slinde 1998), polytomous data (Raju, Fortmann-Johnson, Kim, Morris, Nering, and Oshima 2009), multidimensional models (Oshima, Raju, and Flowers 1997), and calculation of conditional DIF statistics (Oshima and Morris 2008). Within the DFIT framework differential functioning is analyzed between two groups of respondents: the reference group, and the focal group. It is generally considered that the reference group is the majority group and the focal group is the minority group or the group that might be at a disadvantage.
This framework proposes three indices for analyzing the differential functioning of items and tests: DTF: The differential test functioning index.
In order to define these indices for arbitrary IRT models and regardless of item responses being dichotomous or polytomous, let S i (θ j ) be the expected score for examinee j with trait vector θ j on item i. The function S i can stand for the expected score under either a one-, two-, or three-parameter model for dichotomous items (i.e., the probability of a correct response), or a rating scale, partial credit (PCM), or graded response model (GRM) for polytomous items, in their unidimensional or multidimensional expressions (Raju et al. 1995;Oshima et al. 1997;Oshima and Morris 2008). For a given examinee, the expected score on a test T (θ j ) (known as the "true" score in classical test theory) is equal to the sum of the expected scores for said examinee on the n items in the test. Note that different S i do not need to have the same functional form for different items i and i . Also, let S i,g and T g represent these functions based on the true item parameters for group g.
The three indices are defined on the differences in the values of S and of T between the focal and the reference groups. Thus, is the difference on the expected test scores. The square of these differences is taken as a measure of differential item or test functioning at the examinee level (Raju et al. 1995). The test level statistic (DTF) is, then, defined as the expected value over the focal group of the squared differences of expected test scores. That is: And the basic item level statistic (NCDIF) is the expected value over the focal group of the difference of expected item scores, i.e., This index, as most DIF statistics, does not consider DIF from other items (or assumes that the other items are DIF free; Raju et al. 1995) and does not sum to the test level statistic. The third index within this framework seeks to define a DIF index such that its sum is equal to the test level statistic. To do so, DTF is decomposed in the following manner: and, thus, the item index is defined as It should be noted that the CDIF index is additive by definition and includes information from DIFs on the other items.
An important characteristic of the indices in the DFIT framework is that their definition is based directly on the IRT interpretation of DIF rather than, for instance, differences on item parameters. Although differences on item parameters imply differences on the expected scores both at the item and the test levels, it should be noted that different sets of item parameters may produce very similar item characteristic curves (ICC), and thus, very similar expected scores. Also, what best distinguishes this framework from other approximations is that it explicitly states that the focus of bias and fairness research is how the minority group's results are affected and incorporates that into the procedure. This latter characteristic also enables DIF to be calculated for the three parameters IRT model from the perspective of the differences on expected scores where the original area measures were infinite when pseudoguessing parameters differed between both groups as demonstrated by Raju (1988).

The IPR approach
Currently, hypothesis testing for NCDIF, the main index within the framework, is done using the item parameter replication (IPR) approach which was proposed by Oshima, Raju, and Nanda (2006) to overcome the limitations of the χ 2 tests originally developed by Raju et al. (1995). Algorithmically, the procedure may be described as follows: 1. Define the item parameter vector for the null hypothesis (µ) to be equal to the item parameter estimates from the focal group.
3. Use these item parameters and covariance matrix to simulate as many item parameter vectors as desired 1 . They are obtained from the multivariate normal distribution with mean vector µ from Step 1 and covariance matrix Σ from Step 2.
4. Obtain the NCDIF values for each pair of item parameter replications.
5. Calculate the cut-off point as the (1 − α) percentile of the NCDIF values.
6. Repeat Steps 1 to 5 for each item.
This approach basically implements a Monte Carlo algorithm to generate chains of parameter vectors from their sampling distribution. However, as specified in Oshima et al. (2006) and implemented by Oshima, Kushubar, Scott, and Raju (2009), the sampling distribution required to generate these chains uses only the estimates for item parameters and the covariance matrices of these estimates obtained with data from the focal group only. However, as it stands, the procedure is analogous to finding the degrees of freedom for a two sample t-test by always choosing twice the number of observations of one of the groups minus two, regardless of the effective sample sizes in each of them. Recently, Clark and LaHuis (2012) examined the effect on type I error for the current algorithm under unequal and small sample sizes (500 or 250 examinees in the focal group and 500 examinees in the reference group) and found it increased when the sample size for the focal group reduced from 500 to 250 (from 0.05 to 0.15). This effect is due to the different sampling distributions for the parameter estimates of each group 2 . These effects on type I error and power will be shown in detail in Section 4, where full power calculations for the NCDIF index will be presented.
Taking these considerations into account, Cervantes (2012) proposed a modification to the algorithm presented by Oshima et al. (2006) for obtaining cut-off points based only on sample estimates. The modified algorithm comprises the following steps: 1. Define the item parameter vector for the null hypothesis to be equal to the item parameter estimates from the focal group.
2. Take the variance-covariance matrix of those estimates given the respective sample sizes and ability distributions of the focal and the reference group.
3. Obtain as many pairs of item parameter vectors as desired from the multivariate normal distribution for each group.
4. Obtain the NCDIF values for each pair of item parameter replications.
5. Calculate the cut-off point as the (1 − α) percentile of the NCDIF values.
6. Repeat Steps 1 to 5 for each item.
The previous algorithm is easily adapted to be used with theoretical values for item parameters and variance-covariance matrices. The effects of the change in the algorithm will be illustrated in Section 4 in particular with respect to expected differences for type I error and power.

The DFIT package
The DFIT package is able to calculate all indices from the DFIT framework for logistic and normal ogive unidimensional one-, two-, and three-parameters IRT models for dichotomous items, as well as for the PCM and GRM for polytomous items. It is freely available for download via the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project. org/package=DFIT. The main functions in package DFIT are the following: (a) Cdif(), Ncdif(), and Dtf() for calculating the statistics from the DFIT framework; (b) PlotNcdif for illustrating NCDIF; (c) SignedArea(), UnsignedArea(), and IrtMh() for calculating Raju's area measures and the Mantel-Haenszel DIF statistic; and (d) Ipr() and CutoffIpr() for obtaining the cut-off points for any of the DIF statistics by using the IPR approach. This section shows how to use the functions implemented in the DFIT package. Section 3.1 presents the functions devoted to calculating DIF and DTF, while Section 3.2 presents the functions to obtain cut-off values by means of the IPR approach.
The following code loads the package and the data used in the examples.
R> raschParameters <-lapply(dichotomousItemParameters, function(x) + x[, 2, drop = FALSE]) R> raschParameters <-as.list(unique(as.data.frame(raschParameters))) R> raschParameters <-lapply(raschParameters, function(x) + matrix(x, ncol = 1)) R> items3Pl <-c(2, 20, 22, 8, 10, 28, 46, 32) R> threePlParameters <-lapply(dichotomousItemParameters, function(x) + x The format expected by all functions that use item parameters is a list with two named elements: focal and reference, each a matrix with a row for each item and columns according to the IRT model. Discrimination parameters (except for one-parameter IRT models for dichotomous items), must be included in the first column and pseudo-guessing parameters for three-parameters IRT models, in the third column. In the case of the polytomous models currently supported: generalized partial credit model (GPCM) or graded response model (GRM), the first column includes the discrimination parameters and the other columns the item categories difficulty parameters, that is the models should be parametrized as

DIF functions
Using this package, the DFIT statistics may be estimated either as the means, given a vector of abilities from a sample of examinees from the focal group, of the values within the expectations in expressions (1), (2), and (4) -this is the method used for estimation by Raju et al. (1995) and Oshima et al. (2009), or by integrating over the specified ability distribution (standard normal by default). The following code shows the use of Ncdif and Cdif functions to obtain DFIT's item indices for the one-parameter logistic IRT model using the normal ogive metric (i.e., D = 1.702). It also shows how to use the Dtf() function to estimate the differential test functioning index.
The use of the PlotNcdif function is illustrated below. Figure 1 presents an item with no DIF (Item 5) for which the normal density is plotted. Figure 2 presents the plots for two items with DIF, one with item difficulties for both groups close to the mean of the focal group ability, and one for which they are far. This figure also shows an alternative representation of the weighting given by the density of the abilities of examinees from the focal group, as obtained by setting plotDensity = FALSE.
R> it5PlotD <-PlotNcdif(iiItem = 5, itemParameters = raschParameters, + irtModel = "1pl", plotDensity = TRUE, logistic = FALSE,   + focalDensityText = "Focal group density (theoretical)", + main = "Item 5. NO DIF. 1PL model") R> it7PlotS <-PlotNcdif(iiItem = 7, itemParameters = raschParameters, + irtModel = "1pl", plotDensity = FALSE, logistic = FALSE, + main = "Item 7. Uniform DIF. 1PL model") All these measures may also be obtained for items under the two-and three-parameters IRT models. The argument irtModel should be, respectively, "2pl" and "3pl". Table 2 presents the statistics for the items selected above under a three-parameters model. For a test composed of this set of items, the DTF statistic would be 0.3151. Also, Figure 3 shows the plot for an item with nonuniform DIF where the guessing parameters for the focal and the reference groups are also different For polytomous IRT models, the DIF statistics (except for the Mantel-Haenszel and the associated delta measure) may be calculated by setting irtModel = "grm" or "pcm".   items, the DTF statistic would be 18.7411. Figure 4 shows the plot for one of these items which exhibits mixed DIF.

IPR cut-off points
The DFIT package is able to perform the IPR Monte Carlo procedure as described in Section 2.1 under any unidimensional IRT model if the item parameters and their variancecovariance matrices are provided; it is also able to calculate the NCDIF index, the SA and UA measures, and the Mantel-Haenszel DIF statistic with the generated item parameter pairs with the same models as shown in Section 3.1. Currently (version 1.0-3), the DFIT package is able to calculate the asymptotic variance-covariance matrices for item parameter estimates under the three logistic IRT models for dichotomous responses (1PL, 2PL, and 3PL) via the function AseIrt. The main function to perform the IPR algorithm in order to obtain cut-off points is CutoffIpr; however, the function Ipr will be presented too since it allows greater flexibility. The following code illustrates their use to obtain the cut-off values under the Rasch model (i.e., D = 1.0 under the one-parameter model) and the different ways the function CutoffIpr may be used.
Firstly, it is shown how to directly obtain the cut-off values from only the information on the estimated item parameters and sample sizes for each group. This is done by letting CutoffIpr calculate the asymptotic variance-covariance matrices for the models. For this, and the next examples, it is assumed that the reference group has a mean ability of 0.5 logits greater than the focal group.
R> set.seed(29834328) R> cutoffRaschNcdif2 <-CutoffIpr(quantiles = 0.95, + itemParameters = nullParameters, itemCovariances = raschAse, + irtModel = "1pl", logistic = TRUE, statistic = "ncdif", + nReplicates = 1000) Additionally, obtaining the variance-covariance matrices independently from the cut-off function (whether calculating the asymptotic ones as presented or the estimated ones, if available), allows further flexibility to the algorithm, as well as reducing simulation time when the researcher is interested in calculating several indices and cut-off values. The following code shows how to apply the IPR algorithm to the other indices.

Power calculation
This section illustrates how to use the functions in package DFIT to calculate power for the NCDIF index; the procedure may be used to obtain power for the other DIF measures; however, due to the interaction between the CDIF index with parameters of other items and DIF presence on the test level, the procedure is not recommended for this index. Section 4.1 presents how to obtain power curves, given both the cut-off points obtained by the current IPR procedure (as implemented in Oshima et al. 2009 andpresented by Oshima et al. 2006, and under the modification presented in Section 2.1. These curves are useful during the planning of DIF analyses to avoid underpowered studies, and may help in defining effect sizes for the DFIT statistics. Section 4.2 shows how to calculate power for a given set of item parameters using the items presented in Table 1 under the 1PL model. The shown procedure may be used to perform post-hoc power analyses, or a priori power calculations against a given difference of item parameters, as those presented by Wright and Oshima (2015). For all these examples, it is assumed that the ability for the focal and the reference groups is distributed as a standard normal (i.e., without impact).
The examples in this section will further illustrate the effects on power of the selected algorithm to obtain the cut-off points for the NCDIF statistic. Thus, in each case, both the results using the current algorithm as presented by Oshima et al. (2006) and the modified proposal are calculated and plotted.

Power curves
The power curves for uniform DIF and nonuniform DIF for an item under the two-parameters logistic IRT model with difficulty 0 and discrimination 1 are presented; these curves were originally presented by Cervantes (2012). In order to show how the proposed modified IPR algorithm compares to the current one, different sample sizes are used for each group. The general conditions for both uniform and nonuniform DIF power curves examples are set as follows, R> nReplicates <-3000 R> nFocal <-800 R> nReference <-2500 R> kRatio <-nReference / nFocal R> focalParam <-list(mean = 0, sd = 1) R> referenceParam <-list(mean = 0, sd = 1) First, the code to obtain power curves for uniform DIF is presented. The item parameters for the null and alternative hypotheses are generated for a number of equally spaced item difficulties lesser and greater to 0 for the focal group, while the difficulty remains constant for the reference group.
R> twoPlUniNcdifTrue <-Ncdif(itemParameters, irtModel = "2pl", + focalDistribution = "norm", focalDistrExtra = focalParam, + logistic = FALSE) Next, we obtain the asymptotic variance-covariance matrices of parameter estimates given the known item parameters for each group. Using the IPR Monte Carlo approach, the simulated item parameters are obtained for the null and the alternative hypotheses with the asymptotic covariance matrices calculated in the previous step. Then, the NCDIF statistic on each pair of parameter vectors is calculated.
R> cutoffPointEachSZUni <-CutoffIpr(quantiles = 0.95, + iprStatistics = twoPlUniNcdif[nullFocal, , drop = FALSE]) We will also obtain the cut-off point under the current algorithm. For that, we follow the same steps changing the covariance matrices to be equal for both groups. Since both groups were assumed to have the same distribution, obtaining the asymptotic variance-covariance matrix is possible directly from the ones from each group because they are proportional in this case. The following code obtains the matrices, the replicated parameters, the NCDIF statistics, and the cut-off point under the current approach.
R> set.seed (29834328 , "*", kRatio) R> twoPlUniIprCurrent <-Ipr(itemParameters = itemParametersNull, + itemCovariances = twoPlUniAseCurrent, nReplicates = nReplicates) R> twoPlUniNcdifCurrent <-IprNcdif(itemParameterList = twoPlUniIprCurrent, + irtModel = "2pl", logistic = FALSE, subdivisions = 1000) R> cutoffPointUni <-CutoffIpr(quantiles = 0.95, + iprStatistics = matrix(twoPlUniNcdifCurrent, nrow = length(nullFocal))) Power for each alternative is estimated as the proportion of the replicated NCDIF indexes that is greater than the cut-off value. Figures 5 and 6    is approximately 0.8 (0.831). It may be seen that low type I error rates as those reported by Oshima et al. (2006) are expected for the current IPR algorithm because the actual type I error is well below the nominal value. It may also be seen that power is affected by this. Power with the current algorithm only reaches 0.6663 for the same item difficulty, about 0.16 less than with the proposed algorithm.
The code to obtain the power curves for nonuniform DIF is similar to the one presented for uniform DIF; thus, it is not shown. Figures 7 and 8 show how power varies as a function of item discrimination and NCDIF index, respectively. The figures indicate the value for which power is 0.8177 with the proposed modification, power with the current algorithm only reaches 0.633. Furthermore, although the power curves are not symmetrical with respect to the discrimination value, they are as a function of the NCDIF index value. It is also apparent that power varies for uniform and nonuniform DIF; power is about 0.8 for a value of 0.0031 where DIF is uniform, while to achieve the same power for nonuniform DIF (with equal sample sizes, no impact and item difficulty equal to both groups ability mean), the NCDIF index needs to reach only 0.0025.

Power calculation
This section presents how to calculate power for particular item difficulties for both the reference and the focal group using the 1PL model. The difficulties shown in Table 4 are used, and the null hypothesis assumed is that item difficulties are the ones from the focal group; additionally, equal standard normal distributions for the abilities from both groups with unequal sample sizes are assumed. A similar procedure may be used for other IRT models.
The following code sets the sample size conditions for both groups, the distribution parameters and the number of IPR replications that will be used. First, we set the global variables, R> nFocal <-500 R> nReference <-1500 The variances of item difficulties are obtained for each group according to the item parameters for the focal group and their respective sample size and common distribution. With these variances for the parameter estimates, the cut-off point for each item using the IPR approach is obtained for the null hypothesis of each item.      "reference"]], distribution = "norm", + sampleSize = nReference, distributionParameters = distriParam, + irtModel = "1pl", logistic = FALSE) R> altIPR <-Ipr(itemParameters = raschParameters, itemCovariances = altAse, + nReplicates = nReplicates) R> altNcdif <-IprNcdif(itemParameterList = altIPR, irtModel = "1pl", + logistic = FALSE) Table 6 shows the calculated power for each item given the sample sizes and ability distributions. The column "Cut-off" presents the IPR cut-off points with the modified algorithm and 3000 replications. The column "Power" contains the calculated power to detect the true NCDIF with sample sizes of 500 and 1500 for the focal and reference groups, respectively.

Final remarks
The differential functioning of items and tests (DFIT) framework has been proposed as a parametric IRT alternative for DIF detection (Raju et al. 1995). Recent work has focused on the development of a Monte Carlo approach to obtain appropriate cut-off points for the NCDIF index in this framework Raju et al. 2009), on the type I error and power of this approach (Clark and LaHuis 2012), and on the establishment of effect sizes for that index (Wright 2011;Wright and Oshima 2015).
This paper presented the package DFIT which implements the framework for R. The package is capable of obtaining the DIF and DTF indices from the framework for the main unidimensional IRT models. It is able to calculate Monte Carlo cut-off points based on the IPR approach for dichotomous and polytomous models as currently available software given the same input from the user; that is, item parameters for both groups and variance-covariance matrices for item estimates for one or both groups.
It should be noted that the package is recommended based on its capabilities and on its accuracy with regards to the theoretical framework. Comparisons with the program DFIT8 ) and with the SAS (SAS Institute Inc. 2013) macro DIFCUT (Nanda, Oshima, and Gagné 2006), which implement the current approach, have been conducted by taking published item parameters (Raju et al. 1995Oshima et al. 1998Oshima et al. , 2009Wright and Oshima 2015) and verifying against the results for all three DFIT indices; the same testing procedure was employed for Raju's area measures (Raju 1988), the Mantel-Haenszel statistic (Wright 2011), and the implementation of asymptotic variance-covariance matrices calculations (Li and Lissitz 2004); additionally, the simulation procedure relies on the package mvtnorm (Genz, Bretz, Miwa, Mi, and Hothorn 2016), whose accuracy was studied by Mi, Miwa, and Hothorn (2009). Although the accuracy for the IPR algorithms follows from the previous, cut-off points for NCDIF were also compared with published results; differences were consistent with differences between repeated runs of the algorithm such as those shown in Table 4.
The package DFIT is also able to use either version of the IPR approach without empirical estimates of the variance-covariance matrices by using the asymptotic ones for dichotomous models. Additionally, the package is flexible and allows obtaining the cut-off points under the null hypothesis and sampling conditions specified by the user. Finally, the package complements the DFIT framework by making it the first DIF approach for which power may be calculated a priori, and thus be used in the planning of DIF studies, rather than post-hoc under limited simulation conditions as has been the case until now.
There are several features yet to be implemented in the package DFIT from the eponymous framework. Future work on the package will focus on acquiring item parameter estimates and their covariances from different estimation packages to be used for the calculations; calculating the asymptotic variance-covariance matrices for polytomous models; implementing Differential Bundle statistics; calculating cut-off points at the test level (DTF); allowing each item on a set to be modeled by a different IRT model; and implementing the indices for multivariate IRT models.