PerFit : An R Package for Person-Fit Analysis in IRT

Checking the validity of test scores is important in both educational and psychological measurement. Person-ﬁt analysis provides several statistics that help practitioners assessing whether individual item score vectors conform to a prespeciﬁed item response theory model or, alternatively, to a group of test takers. Software enabling easy access to most person-ﬁt statistics was lacking up to now. The PerFit R package was written in order to ﬁll in this void. A theoretical overview of relatively simple person-ﬁt statistics is provided. A practical guide showing how the main functions of PerFit can be used is also given. Both numerical and graphical tools are described and illustrated using examples. The goal is to show how person-ﬁt statistics can be easily applied to testing of questionnaire data.


Introduction
It is well known that total scores or estimated trait values do not always reflect the trait or proficiency level that a questionnaire or test intends to measure. The answers provided by test takers or respondents to questionnaires may be biased due to factors unrelated to the trait of interest. For example, cheating or item-preknowledge may inflate test scores whereas inattention or guessing may deflate test scores. It is important to be able to detect whether test scores are invalid, that is, whether test scores are biased and therefore not indicative of the true latent trait being measured. This fact is acknowledged in several guidelines for the reporting of test scores (e.g., International Test Commission 2014, Olson and Fremer 2013), and it applies to both educational and psychological measurement. Several person-fit statistics (PFSs) have been proposed to detect inconsistent, aberrant, or misfitting patterns of item scores. There are some empirical applications of person-fit analyses in the literature; see for example Meijer, Egberink, Emons, and Sijtsma (2008), Meijer and Tendeiro (2014), and Conijn (2013) for applications in primary education, high-stakes testing, and clinical testing, respectively. However, some researchers and practitioners find it difficult to implement personfit analyses in practice. This might be explained by the mathematical complexity of (some of the) PFSs and by the lack of software that helps practitioners conducting this type of analysis. It is our hope that the R (R Core Team 2016) PerFit package (Tendeiro 2016) discussed in this paper can be helpful to a wide range of practitioners (also, see Meijer, Niessen, andTendeiro 2016 andTendeiro and for accessible person-fit overviews). This paper is divided into three main parts. First, we start with a theoretical overview of person-fit analysis in item response theory (IRT; Embretson and Reise 2000). The exposition will focus on PFSs that are currently implemented in PerFit. Second, we illustrate how the main functions of PerFit can be used in practice. Both numerical and graphical tools are described using empirical data that are also included in the R package. We explain how inconsistent item score patterns can be detected and we present a statistical tool that may be used in order to help finding an explanation for the cause of the misfit. Finally, we will summarize the main points of the paper and discuss further planned extensions for the package in the future.

Person-fit analysis in IRT
In the sequel we refer to persons as test takers, although all procedures also apply to questionnaires data. Moreover, we use θ to denote the trait or proficiency level that the questionnaire or test is measuring.
Let X i be the random variable denoting the score on item i (i = 1, . . . , I). The observed score of test taker n (n = 1, . . . , N ) on item i, that is, a realization of random variable X i , will be denoted by x ni . Items are dichotomous when there are m = 2 possible score categories or polytomous when there are three or more score categories (m ≥ 3). In IRT it is assumed that θ induces the item responses. It is also assumed that the ordering of the item score categories reflects the hypothesized ordering on θ. The following item scores coding is used: 0, 1 for dichotomous items (e.g., 0 = incorrect, 1 = correct) and 0, 1, . . . , (m − 1) for polytomous items (e.g., 0 = disagree, 1 = neutral, 2 = agree). It is important that all items are keyed in the same direction, especially if the total score (or some measure based on the total score) is used.
An item score vector (or pattern) and its associated total score will be denoted by x n = (x n1 , x n2 , . . . , x nI ) and s n = I i=1 x ni , respectively. The goal of person-fit analysis is to detect whether an individual item score pattern x n (n = 1, . . . , N ) is unusual or unexpected given the item score patterns in a group of test takers (group-based analysis) or, alternatively, given a prespecified IRT model (model-based analysis). For comprehensive overview studies for both types of approaches we refer to Meijer and Sijtsma (2001) and Karabatsos (2003). Most of the PFSs currently available in PerFit are group-based. The exceptions are the loglikelihood statistics introduced by Drasgow, Levine, and Williams (1985) and improved by Snijders (2001). In what follows we will focus on the PFSs and related functions that are available in PerFit (also, see Tendeiro and Meijer 2014).

IRT models
It is important to clarify what type of IRT model assumptions should be verified before person-fit analyses can be conducted. Typically, a functional relationship between θ and the probability of answering at or above a specific score category is conceptualized in IRT. This function is known as the item response function (IRF) in the dichotomous case or item step response function (ISRF) in the polytomous case and it is denoted as P ih (θ) = P(X i ≥ h|θ) for i = 1, . . . , I and h = 1, . . . , m − 1. We will use the acronym 'IRF' to refer to function P ih (θ) in either the dichotomous or the polytomous case. For dichotomous items there is only one IRF (P i (θ) = P(X i = 1|θ)) whereas for polytomous items there are (m − 1) IRFs (observe that P i0 (θ) ≡ 1).
Strictly speaking, group-based PFSs do not require closed-form mathematical IRFs to be estimated. Hence, nonparametric IRT models (NIRT; Sijtsma and Molenaar 2002) can be used.
In NIRT the following model assumptions are typically imposed: (1) Unidimensionality: All items in the test predominantly measure the same latent trait, θ.
(2) Local independence: Answers to different items are statistically independent conditional on θ.
(3) Latent monotonicity of the IRFs: Each IRF is monotonically nondecreasing in θ. The NIRT model that satisfies assumptions 1 through 3 is known as the monotone homogeneity model (MHM; Mokken 1971), which is the most general nonparametric model. The double monotonicity model (DMM; Mokken 1971) further imposes the assumption that the IRFs of different items do not intersect. This assumption is known as the invariant item ordering (IIO) assumption. There are several tutorials available to aid practitioners interested in fitting these types of models to empirical data; see for example Sijtsma and Molenaar (2002) and Meijer, Tendeiro, and Wanders (2015). The R mokken package ( Van der Ark 2007 can be used to check these assumptions in practice. Both the MHM and the DMM are parameter-free, both in terms of person and of item parameters. Instead of estimating θs, the MHM (and therefore the DMM) relies on the total score statistic and on the property of stochastic ordering of the latent trait (SOL; Hemker, Sijtsma, Molenaar, and Junker 1997). The SOL property states that test takers are stochastically ordered on θ by means of the total score statistic. This ordering holds for dichotomous items and, for most practical purposes, also for polytomous items (however, see Van der Ark 2005). Also, instead of estimating item difficulty parameters, one uses the socalled item-step difficulties. An item-step difficulty π ih is the population proportion of persons who answer at or above score category h for item i. This parameter is usually estimated by means of the corresponding sample proportion in the dataset. The item-step difficulty of a dichotomous item is the proportion of persons with score 1 and is also referred to as the item's proportion-correct score (denoted as π i ). Without loss of generality, and unless stated otherwise, in what follows it is assumed that the item steps are ordered in increasing order of difficulty, that is, in decreasing order of item-step difficulty. In particular for dichotomous items, this means that items are labeled in increasing order of difficulty (i.e., item 1 being the easiest and item I the most difficult). This property is especially useful when IIO is met (thus for the DMM) because it implies that the same ordering of item-step difficulties holds at any θ level. Nearly all group-based PFSs that are available in PerFit involve computations based on item/total scores and item-step difficulties (proportion-correct scores for dichotomous items).
Practitioners are advised to analyze the fit of the IRT model to real data before attempting to interpret the validity of individual item response patterns.

Rationale behind person-fit
A typical test taker, when given a set of dichotomous items, is expected to answer very easy items correctly and very difficult items incorrectly. The more difficult the item, the lower the probability of answering the item correctly. This is the general principle behind the so-called person response function (PRF), which relates item difficulty with the probability of correctly answering an item. PerFit allows computing PRF plots; we will discuss this function at more length in Section 3.2. As an example, Figure 1 shows nonparametric PRFs (Emons, Sijtsma, and Meijer 2004;Sijtsma and Meijer 2001) for four different test takers that were estimated based on the answers of 1000 test takers on 26 dichotomous items (dataset IntelligenceData in PerFit). The x-axes display the item difficulty in terms of the values (1 − π i ), thus values close to zero concern easier items and values close to one concern more difficult items.
The top-left panel shows the PRF of a typical test taker: The probability of answering an item correctly decreases as the item difficulty increases. The plots in the remaining panels of Figure 1 display deviations to the expected behavior. For example, the probability that test taker B answers an item correctly seems to increase with item difficulty in the interval (0, .5). The PRF for test taker C increases in the interval (.6, .8). Test taker D seems to be the most atypical of the four shown here: The PRF has mostly an increasing trend. Table 1 shows the item scores of test takers A, B, C, and D, where the items (columns) are rearranged in increasing order of difficulty. Note that test takers B and D incorrectly answered several easy items. This fact helps explaining the increasing trend in the corresponding PRFs for low item difficulty values. Test taker C, on the other hand, answered 6 out of the 7 most difficult items correctly whereas several easier items were incorrectly answered. The corresponding PRF in Figure 1 does reflect this in terms of a local increasing trend for relatively difficult items.
Person-fit analysis can be used to detect test takers in a sample that display some kind of atypical response behavior as shown in the examples above. It should be observed that there are many psychological processes that may trigger atypical responses to a test or questionnaire. Some examples include cheating, guessing, plodding, alignment errors, extremely creative interpretation of items, and deficient preparation in specific subparts of the contents being evaluated (Meijer 1996). It is possible that different psychological processes lead to similar atypical item score patterns, thus the interpretation of person-fit results is often not easy. As a general rule, person-fit analyses typically do not provide conclusive evidence as to exactly which abnormal phenomenon caused a strange item score pattern. Practitioners should complement and possibly follow up the analysis using other available tools, such as interviews with test takers (see Meijer et al. 2008) or relating the values of PFSs with other variables (Conijn 2013).

Person-fit statistics
Many PFSs (also referred to as appropriateness indices; Levine and Drasgow 1983) have been proposed in the literature. Karabatsos (2003) and Meijer and Sijtsma (2001) (see also Tendeiro and Meijer 2014 for an overview focusing on group-based statistics) discuss the most relevant statistics. In this paper we briefly describe each PFS that is implemented in PerFit.

Likelihood-based PFSs
In the dichotomous case the log-likelihood of item score pattern x n is given by where P i is typically the 1-, 2-, or the 3-parameter logistic model (1PLM, 2PLM, or 3PLM, respectively; Embretson and Reise 2000). Equation 1 is readibly extendable to the polytomous case by using the score category probabilities P ih (θ) = P ih − P i(h+1) for h = 0, . . . , m − 1 with an associated indicator variable δ ih (1 if the corresponding score category h was chosen, 0 otherwise): Levine and Rubin (1979) proposed to use Equation 1 as a PFS denoted by l 0 . Low l 0 values indicate misfitting item score patterns. However, the l 0 statistic has some limitations because it is not standardized nor is its distribution under a fitting IRT model known. Drasgow et al. (1985) proposed a standardized normal version of l 0 denoted as l z (l p z in the polytomous case). Values of l z are interpreted similarly as l 0 , that is, the smaller the (negative) l z values the stronger the indication of misfit. Statistic l z is asymptotically (in terms of the number of items I) standard normally distributed. However, this approximation only holds when true θ values are used (Molenaar and Hoijtink 1990), which is an unrealistic requirement in practice. When estimated θ parameters replace θ in Equation 1 then it has been shown that the variance of l z is reduced (Meijer and Tendeiro 2012, Molenaar and Hoijtink 1990, Nering 1995, 1997, Reise 1995. Snijders (2001) proposed a corrected statistic known as l * z which takes the sampling variability of the estimated θ parameters into account. Magis, Raîche, and Béland (2012) thoroughly explained the computational formulas of l * z .

Group-based PFSs
Group-based or nonparametric PFSs are not constrained to fitting closed-form mathematical models to the data (although the unidimensionality, local independence, and monotonicity assumptions previously mentioned need still be checked; see Sijtsma and Molenaar 2002). More specifically, an item score pattern is flagged as misfitting in case it deviates to a large extent from the most typical response behavior in the sample. Group-based PFSs are usually computed using the item-step difficulties in the polytomous case (which are simply equal to the proportion-correct scores in the dichotomous case). Donlon and Fischer (1968) considered the personal point-biserial correlations. r.pbis is the correlation between the test taker's item score pattern x n and the vector of item proportioncorrect scores in the sample p = (p 1 , . . . , p I ). The p i score, also known as the difficulty or p-value of item i (i = 1, . . . , I), is the proportion of persons who answered item i correctly. Low values of r.pbis may be indicative of misfit of the response vector. Sato (1975) proposed the PFS C, which is a covariance ratio measuring the extent to which an item score pattern deviates from the perfect pattern (Guttman pattern). Assume that the items are ordered in increasing order of difficulty, that is, p 1 ≥ p 2 ≥ · · · ≥ p I , without loss of generality. The formula for C is where x * n is the so-called Guttman vector containing correct answers for the s n easiest items (i.e., with the largest p-values) only. C is zero for Guttman vectors and its value increases (without theoretical bound) for item score patterns that deviate more from the perfect Guttman pattern, thus higher values indicate more deviant response behavior. Harnisch and Linn (1981) further adapted C by norming it between 0 and 1: where x n is the reversed Guttman vector containing correct answers for the s n hardest items (i.e., with the smallest p-values) only. C * is sensitive to the so-called Guttman errors. A Guttman error is a pair of scores (0, 1), where the 0-score pertains to the easiest item and the 1-score pertains to the hardest item. C * ranges between 0 (perfect Guttman vector) and 1 (reversed Guttman vector).
The U 3 statistic (Van der Flier 1982) is similar to C * except that it is based on the log-odds of the item proportion-correct scores. It is given by where A standardized normal version (statistic ZU 3) was also developed (see Van der Flier 1982). However, the adequacy of this asymptotic approximation has been questioned by some researchers (e.g., Emons, Meijer, and Sijtsma 2002). Kane and Brennan (1980) introduced the agreement (A), disagreement (D), and dependability (E) statistics, which assess the similarity (A and E) and dissimilarity (D) between the vectors x n and p. The formulas are where max(A|s n ) equals the sum of the s n largest p i scores. Small values of A and E (i.e., in the left tail of the sampling distribution) are potentially indicative of aberrant response behavior, whereas large values of D (i.e., in the right tail of the sampling distribution) are potentially indicative of aberrant response behavior.
Sijtsma (1986; see also Sijtsma and Meijer 1992) proposed the H t statistic, which is based on an idea presented in Mokken (1971). Sijtsma observed that Mokken had already introduced an index H i that allowed assessing the scalability of an item to the Guttman model (Guttman 1944(Guttman , 1950. Sijtsma (1986) used the same index applied to the transposed data in order to detect test takers that would not comply with the Guttman model. Assume, without loss of generality, that the rows of the data matrix are ordered to increasing order of total score s n (n = 1, . . . , N ). The H t statistic is given by with t n = s n /I, t m = s m /I, and t nm is the proportion of items answered correctly by both test takers n and m (Sijtsma and Molenaar 2002, p. 57). H t takes its maximum value 1 when t nm = t n (n < m) and t nm = t m (n > m). This means that no test taker with a total score smaller/larger than t n can answer an item correctly/incorrectly that test taker n has answered incorrectly/correctly, respectively. H t equals zero when the average covariance of the response pattern of test taker n with the other response patterns equals zero. Van der Flier (1980;see also Meijer 1994;Tatsuoka and Tatsuoka 1982) proposed the number of Guttman errors as a PFS, here denoted G. Thus, G is the number of (0, 1) pairs of item scores in the item score vector with the incorrect answer given to the easiest item and the correct answer given to the most difficult item. G is an integer and its upper bound is a function of the test length, which is inconvenient. As a result, this statistic has been normalized, leading to statistic G n . The normalization is done against the maximum number of possible Guttman errors given the test taker's total score s n (which is s n (I − s n ) in the dichotomous case). The norm conformity index (NCI ) defined by Tatsuoka and Tatsuoka (1983) is perfectly linearly related to G (NCI = 1 − 2G n ).
A version of the U 3 statistic adapted to polytomous items was proposed by Emons (2008). This statistic relies on the concept of item steps (Molenaar 1982). Assume that all items have the same number m of ordered score categories. A test taker may take step h (h = 1, . . . , m − 1), from score category (h − 1) to h, only if all previous steps were taken. The goal is to associate the test taker's latent ability with the probability of taking each item step. Define π ih as the difficulty of item step h on item i, consisting of the population proportion of test takers with a score on item i at least equal to h. The item step difficulties of all items can be estimated in a sample and ranked in decreasing order (i.e., in increasing order of item-step difficulty), so that π 1 ≥ π 2 ≥ · · · ≥ π I(m−1) . Item-step scores of a test taker (1 if a step is taken, 0 otherwise) are collected in an I(m − 1) vector y = (y 1 , y 2 , . . . , y I(m−1) ). The U 3 p statistic (Emons 2008) is defined by with W (y) = I(m−1) k=1 y k logit( π k ) and X + = I(m−1) k=1 y k . The value min(W |X + ) cannot be written in closed form; its value can be found by means of a recursive algorithm (Emons 2008).
Extensions of the Guttman-based dichotomous statistics to polytomous items are also available in the package. Using the same notation introduced in the previous, statistic G p is defined by and the normed polytomous version is In this case also the value max(G p |X + ) needs to be computed by means of a recursive algorithm (Emons 2008).

Comparative performance of PFSs
Given the wealth of PFSs available to practitioners, it is important to know which statistics outperform others, that is, which statistics have the highest power to detect aberrant response behavior while keeping the false positive rates under control. Assessing the performance of competing statistical strategies is traditionally done by means of simulation studies, and in rare cases using empirical data. Results from several studies are indeed available (Birenbaum 1985(Birenbaum , 1986Drasgow, Levine, and McLaughlin 1987;Harnisch and Linn 1981;Harnisch and Tatsuoka 1983;Karabatsos 2003;Kogut 1987;Li and Olejnik 1997;Meijer 1994;Meijer, Muijtjens, and van der Vleuten 1996;Nering and Meijer 1998;Noonan, Boss, and Gessaroli 1992;Rogers and Hattie 1987;Rudner 1983;Tendeiro and Meijer 2014). Findings may vary according to the design of the simulation study (see Rupp 2013) and which PFSs were used in the study. However, a tentative general conclusion is that simple group-based PFSs like H t and U 3 do not perform worse, and in some cases perform better, than most model-based PFSs across different types of datasets (Karabatsos 2003;Tendeiro and Meijer 2014). This is good news for practitioners in the sense that (1) group-based PFSs are typically easier to compute, and (2) nonparametric measurement models are less restrictive to empirical data than parametric logistic IRT models (Sijtsma and Molenaar 2002).

Current software available
From the overview given above it is clear that there are many PFSs available that can be used to assess the validity of a test taker's test score. One of the most serious limitations of PFSs, however, is their implementation in practice. The relatively advanced statistical knowledge required to apply person-fit techniques in educational and psychological assessment and the lack of accessible software withholds substantive researchers and practitioners from using person-fit analyses with their own data. The few existing options available (WPERFIT, Ferrando and Lorenzo 2000; PERSONz, Choi 2010) do not offer group-based PFSs which were shown to perform very well in several simulation studies. Some PFSs are already available in R, such as l z (see packages irtoys; ltm; mirt, Chalmers 2012; and sirt, Robitzsch 2016) and the Rasch model-based outfit and infit statistics (Wright and Masters 1990; see packages eRm, mirt, and sirt). Only the sirt package currently offers some group-based PFSs (C * , D, r.pbis, and U 3). PerFit offers a broader set of PFSs, both model-and group-based. In particular, PerFit includes H t which has been shown to outperform most PFSs in several simulation studies (Karabatsos 2003;Tendeiro and Meijer 2014), l * z as an updated version of the commonly used l z statistic, and several PFSs suited to polytomously scored data.
Moreover, the ability to estimate PRFs that reflect observed deviations of individual response vector patterns with respect to the test takers group trend is of great value, as discussed in Section 2.2. Package irtProb (Raîche 2014) already allows plotting person characteristic curves, which are expected curves under parametric IRT models. We propose a graphical tool (function PRFplot()) that relies on observed item scores to better interpret person misfit, based on a nonparametric approach.
In combination with the other R packages already mentioned, PerFit is deemed as a major contribution to users relying on only one computing platform in order to conduct their data analyses.

Computing PFSs
Package PerFit is available from the Comprehensive R Archive Network (CRAN) at https: //CRAN.R-project.org/package=PerFit and can be installed from there using:

R> install.packages("PerFit")
The PFSs available in PerFit are listed in Table 2. Depending on the PFS, N × I matrices with scores 0, 1 (in the dichotomous case) or scores 0, 1, . . . , m − 1 (in the polytomous case) need to be loaded in R. The number of score options is required to be the same for all items. Missing values are allowed. The package includes three datasets by default. The InadequacyData (N = 806, I = 28) and the IntelligenceData (N = 1000, I = 26) datasets consist of dichotomously scored items whereas PhysFuncData (N = 714, I = 10) consists of polytomously scored items (each with three item categories).
Typically, to use a PFS function one may only provide the data matrix (argument matrix). For example, to compute the H t values for dataset IntelligenceData, run the following code: R> library("PerFit") R> data("IntelligenceData", package = "PerFit") R> Ht("IntelligenceData") Function Ht has more arguments but all have predefined defaults. In general, the arguments for dichotomous PFS functions are The default approach to dealing with missing values is pairwise deletion (i.e., NA.method = "Pairwise"). This procedure is adequate under the missing completely at random (MCAR) mechanism, and Zhang and Walker (2008) showed that the pairwise deletion method performed well under MCAR. Alternatively, three imputation methods are also available (see Zhang and Walker 2008 for an overview of these and related methods): • NA.method = "Hotdeck": Hotdeck imputation replaces missing responses of a test taker ('recipient') by item scores from the test taker which is closest to the recipient ('donor'), based on the recipient's nonmissing item scores. The similarity between nonmissing item scores of recipients and donors is based on the sum of absolute differences between the corresponding item scores. The donor's response pattern is deemed to be the most similar to the recipient's response pattern in the group, so item scores of the former are used to replace the corresponding missing values of the latter. When multiple donors are equidistant to a recipient, one donor is randomly drawn from the set of all donors.
• NA.method = "NPModel" (default): The nonparametric model imputation method is similar to the hotdeck imputation, but item scores are generated from Bernoulli distributions (for dichotomous items) or multinomial distributions (for polytomous items) with probabilities defined by donors with similar total score than the recipient (based on all items except the NAs).
• NA.method = "PModel": The parametric model imputation method is similar to the hotdeck imputation, but item scores are generated from Bernoulli distributions (for dichotomous items) or multinomial distributions (for polytomous items) with probabilities estimated by means of parametric IRT models (see below).
The user may choose to save the imputed data matrix to a file (Save.MatImp = TRUE).
A word of caution concerning missing values is in order here. When the proportion of missing values is small and follows the MCAR mechanism, the missing values procedures available within PerFit are probably well suited for most practical purposes. But in case more elaborate schemes to tackle missing values are required (e.g., by means of multiple imputation or maximum likelihood estimation) then users are advised to resort to specialized R packages (e.g., Amelia, Honaker, King, and Blackwell 2011;mi, Su, Gelman, Hill, and Yajima 2011;Miss-ingDataGUI, Cheng, Cook, and Hofmann 2016). However, when the proportion of missing values is large, both strategies (pairwise deletion and imputation) are not to be recommended because they may lead to biased person-fit results. For large proportions of missing values it is probably better not to calculate a person-fit statistic. Therefore, we strongly advice the user to first inspect the proportion of missing values for each test taker and then to decide, on the basis of these results, what the best strategy may be. One may even conceive considering both options and then to compare results.
The estimation of parametric item and ability parameters for group-based PFSs is only required if the parametric model imputation method is chosen. For all likelihood-based PFSs in PerFit, the model parameters need to be either provided (arguments IP and Ability) or internally estimated by the function. As an example, the following code shows how item parameters estimated using package mirt can be provided to compute the l z statistic: R> library("mirt") R> mod <-mirt(IntelligenceData, 1) R> ip.mirt <-coef(mod, IRTpars = TRUE, simplify = TRUE, digits = Inf)$ + items[, c("a", "b", "g")] R> lz.out <-lz(IntelligenceData, IP = ip.mirt) The default for dichotomous data is based on fitting the 2PLM to the data using functions available from the irtoys package (IRT.PModel = "2PL"). Ability parameters are by default estimated via maximum likelihood (Ability.PModel = "ML"). Alternatively, other dichotomous IRT models that can be used to describe the data are the 1PLM (IRT.PModel = "1PL") and the 3PLM (IRT.PModel = "3PL"), whereas the ability parameters may also be estimated by means of Bayes modal estimation (Ability.PModel = "BM") or weighted likelihood estimation (method = "WL"). For polytomous data the defaults are the graded response model (IRT.PModel = "GRM") with expected a posteriori estimates of ability parameters (Ability.PModel = "EAP"). Other model options for polytomous data include the partial credit model (IRT.PModel = "PCM") and the generalized partial credit model (IRT.PModel = "GPCM"), whereas estimation of ability parameters may be performed by means of empirical Bayes estimation (Ability.PModel = "EB") or multiple imputation estimation (Ability.PModel = "MI").
The output from a PFS function is an object of class 'PerFit', which consists of a list with 12 elements: 9. IP: The matrix of item parameters in case NA.method = "PModel", otherwise NULL.
11. Ability: The vector of N ability parameters in case NA.method = "PModel", otherwise NULL.

12.
NAs.method: The method used to deal with missing values.

Extra functions
Aside from the PFSs currently included in the package there are more functions available, namely: PerFit.PFS, PerFit.SE, cutoff, flagged.resp, the print, summary, and plot methods in PerFit, and PRFplot. The output is a matrix with two columns: PFscores shows the values of the person-fit statistic and PFscores.SE shows the estimated standard errors. For example, the standard errors of the H t values computed from the intelligence data can be found as follows:  This function allows estimating a cutoff value at a prespecified level. PFS values in the sample at or more extreme than the cutoff value lead to flagging the corresponding item score vectors as misfitting. The direction according to which a value may be considered extreme (i.e., larger than/smaller than the cutoff) varies. The last column in Table 2 shows which tail of the distribution indicates misfit. The cutoff function routinely reports which tail of the PFS distribution is used (Tail = "lower" or Tail = "upper").
The procedure consists of generating Nreps model-fitting item score vectors based on parametric models (ModelFit = "Parametric") or on sample proportions of test takers per answer category (ModelFit = "NonParametric"). The item parameters (either parametric or nonparametric) are estimated from the original data matrix. The Nreps ability parameters (under the parametric approach) or total scores (under the nonparametric approach) are randomly sampled with replacement from the set of N ability parameters or total scores in the sample, respectively. The probability of each simulee endorsing each answer option is then estimated under each modeling framework. Next, item scores are randomly generated by means of the multinomial distribution. Finally, Nreps values of the PFS corresponding to model-fitting item response patterns are computed.
A bootstrap procedure is then used to approximate the sampling distribution of the quantile of level Blvl (resp., 1 -Blvl) for "lower" (resp. "upper") types of PFS, based on Breps resamples. The cutoff (and its standard error) is given by the median (standard deviation) of this bootstrap distribution. A bootstrap percentile confidence interval of level CIlvl is also reported.
The cutoff may alternatively be manually entered by the user (e.g., when it is available from prior data calibration) by means of UDlvl. The goal is then to compute the proportion of test takers in the sample that are flagged by means of this cutoff value. In this case the bootstrap standard error and confidence intervals are not reported.
The cutoff function should be used on objects of class 'PerFit'. For example, the cutoff of the H t values computed from the intelligence data based on a nominal 10% error rate is found as follows: R> set.seed(123) R> cutoff(Ht.out, Blvl = 0.10) The seed was fixed in the command above for replicability, since the estimation of the cutoff value may vary between runs due to the resampling procedure.
The output is an object of class 'PerFit.cutoff', consisting of a list with five elements: The cutoff value ($Cutoff), the associated standard error ($Cutoff.SE), the proportion of item score vectors in the sample that will be flagged by means of this cutoff ($Prop.flagged), which tail of the distribution is relevant ($Tail), and the CIlvl% confidence interval ($Cutoff.CI).
The flagged.resp function flagged.resp(x, cutoff.obj = NULL, scores = TRUE, ord = TRUE, ModelFit = "NonParametric", Nreps = 1000, IP = x$IP, IRT.PModel = x$IRT.PModel, Ability = x$Ability, Ability.PModel = x$Ability.PModel, mu = 0, sigma = 1, Blvl = 0.05, Breps = 1000, CIlvl = 0.95, UDlvl = NA) This function finds which test takers in the sample are flagged by a PFS, based on a cutoff value. The PFS is specified by means of the 'PerFit' object x. The cutoff (i.e., an object of class 'PerFit.cutoff') may be provided using the argument cutoff.obj. If no cutoff is provided then it will be internally computed, for which the function parameters ModelFit through UDlvl are required (see function cutoff above). If scores = TRUE then the test takers' item scores will be shown in the output, either in the original item order (ord = FALSE) or in increasing difficulty order determined by decreasing proportion-correct values (ord = TRUE). The latter option is useful for interpretation as we explained in Section 2.2.  I8  I10  I7  I14  I4  I9  I11  I16  I6  I1  I20  I23 I17  I13  I3  I12  I2  I18  I25  I5  I15  I22  I21  I24  The output is a list with three elements. The first element is a matrix, $Scores, with the row rank score identifying the flagged test takers (first column), the item scores, and the PFS values (last column). The second element is a vector with the proportion-correct values ($MeanItemValue). The third element is the 'PerFit.cutoff' object. When the option scores = FALSE is used then the output omits the item scores and the items' proportioncorrect values.

The print and summary methods in PerFit
A summary for objects of class 'PerFit' is also available. To make the summary more practically useful to the user we combined the information provided by the object with the outcome from both the cutoff and the flagged.resp functions. Continuing the example above: The summary lists the flagged respondents as well as the details of the cutoff criterion on which this decision was based. We find this type of information more useful than merely reporting summaries of the scores of the PFSs. Function PRFplot plots nonparametric PRFs with optional variability bands (see argument VarBands) for all test takers identified in vector respID. It applies to dichotomous data only. The PRF relates item difficulty (i.e., the values (1 − π i )) on the x-axis with the associated probability of correct response on the y-axis. The PRF is typically nonincreasing; misfitting item score vectors will display deviations from this expected pattern. Missing values are addressed using one of the single imputation methods previously explained (for which arguments NA.method through sigma apply). The current implementation of the PRF is based on nonparametric kernel smoothing (Emons et al. 2004). The value of the PRF at each focal point (representing a nonparametric difficulty parameter between 0 and 1) is estimated as a weighted sum score, where scores pertaining to items with difficulty close to the focal point are given the largest weights. The total number of focal points is defined by N.Fpts (by default N.Fpts = 101). The weights are functions of the Gaussian kernel function. It is necessary to specify a bandwidth value (argument h) in order to compute the weights. The h value controls the trade-off between bias and sampling variation (Emons et al. 2004). Small h values reduce bias but increase variance, leading to PRFs that capture too much measurement error. Large h values, on the other hand, increase bias which renders PRFs that are often too flat, thus missing potentially relevant misfitting response behavior. Therefore, it is important to carefully specify the h value. Emons et al. (2004, pp. 10-13), after a simulation study, advised that "h values between 0.07 and 0.11 are reasonable choices". The function's default is h = 0.09.
Variability bands of level alpha (alpha = 0.05 by default) can also be added to the plot. These bands are computed following the jackknife procedure explained in Emons et al. (2004).
The PRFs and variability bands for each test taker are approximated by means of functional data objects (e.g., Ramsay, Hooker, and Graves 2009), with the help of the fda package (Ramsay, Wickham, Graves, and Hooker 2014). This procedure follows two steps: • Compute a B-splines basis system. This basis consists of a set of piecewise polynomials, all of degree three/order four (i.e., cubic polynomial segments), with one knot per break point. Any two consecutive splines, sp 1 and sp 2 , with common break point BP, verify sp 1 (BP) = sp 2 (BP), sp 1 (BP) = sp 2 (BP), and sp 1 (BP) = sp 2 (BP). At 0 and 1 (extremes of the x-range) four knots are used.
• Specify coefficients for the B-splines basis system computed above and then create functional data objects. The procedure is based on smoothing using regression analysis (Ramsay et al. 2009, Section 4.3). Function PRFplot outputs a list with three functional data objects of class 'fd' (see package fda). PRF.FDO is the functional data object of the PRFs for all test takers; VarBandsLow.FDO (resp. VarBandsHigh.FDO) is the functional data object of the lower-bound (resp. upperbound) of the variability bands, for the entire sample. Figure 1 (page 5) shows the PRFs for test takers 6, 584, 832, and 772 from the intelligence data. Figure 2 reproduces the third plot with variability bands included, and includes the R code in the caption that produces this plot disregarding the axes labels and the title formatting.
The cutoff score can be added to the plot (Cutoff = TRUE). Either the user provides a previously computed 'PerFit.cutoff' object or the function internally computes it (for which arguments ModelFit through UDlvl are required). The adequate "flagging region" is colored, thus indicating the range of values for which the PFS flags test takers as potentially displaying aberrant behavior. It is possible to display the CIvlv% confidence interval of the cutoff value on the x-axis (Cutoff.int = TRUE). The width of this interval may give the user a better idea of the error associated to the estimation of the cutoff. It is also possible to add ticks to the upper-side of the plot marking the positions of the flagged item score vectors. This information may be useful to visually distinguish between flagged test takers which are very close to the cutoff versus those that are further away in the rejection region direction. Finally, the option both.scale was included to better tune the scale of the y-axis.
As an example, Figure 3 shows the sample distribution of the U 3 p scores from the physical functioning data with a bootstrapped 1% cutoff value. Item score vectors with U 3 p values larger than 0.29 are flagged as potentially displaying aberrant answering behavior.

Discussion
PerFit is a new R package which offers the most widely used PFSs currently available. Developments in person-fit are ongoing and our goal is to keep extending PerFit. Some of the methods that are likely to be included in future versions of the package include a wider variety of model-based PFSs (see Karabatsos 2003 and, methods suited to computerized adaptive tests data (the CUSUMs approach; see for example Van Krimpen-Stoop and Meijer 2001), multitest person-fit approaches, and related methods based on the concept of person reliability (e.g., Ferrando 2015).