KernSmoothIRT: An R Package for Kernel Smoothing in Item Response Theory

Item response theory (IRT) models are a class of statistical models used to describe the response behaviors of individuals to a set of items having a certain number of options. They are adopted by researchers in social science, particularly in the analysis of performance or attitudinal data, in psychology, education, medicine, marketing and other fields where the aim is to measure latent constructs. Most IRT analyses use parametric models that rely on assumptions that often are not satisfied. In such cases, a nonparametric approach might be preferable; nevertheless, there are not many software applications allowing to use that. To address this gap, this paper presents the R package KernSmoothIRT. It implements kernel smoothing for the estimation of option characteristic curves, and adds several plotting and analytical tools to evaluate the whole test/questionnaire, the items, and the subjects. In order to show the package's capabilities, two real datasets are used, one employing multiple-choice responses, and the other scaled responses.


Introduction
In psychometrics the analysis of the relation between latent continuous variables and observed dichotomous/polytomous variables is known as item response theory (IRT). Observed variables arise from items of one of the following formats: multiple-choice in which only one alternative is designed to be correct, multiple-response in which more than one answer may be keyed as correct, rating scale in which the phrasing of the response categories must reflect a scaling of the responses, partial credit in which a partial credit is given in accordance with an examinee's degree of attainment in solving a problem, and nominal in which there is neither a correct option nor an option ordering. Naturally, a set of items can be a mixture of these item formats. Hereafter, for consistency's sake, the term "option" will be used as the unique term for several often used synonyms like: (response) category, alternative, answer, and so on; also the term "test" will be used to refer to a set of items comprising any psychometric test or questionnaire. This paper focuses on NIRT. Its origins -prior to interest in PIRT -are found in the scalogram analysis of Guttman (1947Guttman ( , 1950a. Nevertheless, the work by Mokken (1971) is recognized as the first important contribution to this paradigm; he not only gave a nonparametric representation of the item characteristic curves in the form of a basic set of formal properties they should satisfy, but also provided the statistical theory needed to check whether these properties would hold in empirical data. Among these properties, monotonicity with respect to ϑ was required. The R package mokken (van der Ark 2007) provides tools to perform a Mokken scale analysis. Several other NIRT approaches have been proposed (see van der Ark 2001). Among them, kernel smoothing (Ramsay 1991) is a promising option, due to conceptual simplicity as well as advantageous practical and theoretical properties. The computer software TestGraf (Ramsay 2000) performs kernel smoothing estimation of OCCs and related graphical analyses. In this paper we present the R (R Core Team 2013) package KernSmoothIRT, available from CRAN (http://CRAN.R-project.org/), which offers most of the TestGraf features and adds some related functionalities. Note that, although R is well-provided with PIRT techniques (see de Leeuw andMair 2007 andWickelmaier, Strobl, andZeileis 2012), it does not offer nonparametric analyses, of the type described above, in IRT. Nonparametric smoothing techniques of the kind found in KernSmoothIRT are commonly used and often cited exploratory statistical tools; as evidence, consider the number of times in which classical statistical studies use the functions density() and ksmooth(), both in the stats package, for kernel smoothing estimation of a density or regression function, respectively. Consistent with its exploratory nature, KernSmoothIRT can be used as a complementary tool to other IRT packages; for example a mokken package user may use it to evaluate monotonicity. OCCs smoothed by kernel techniques, due to their statistical properties (see Douglas 1997and Douglas and Cohen 2001, have been also used in PIRT analysis as a benchmark to estimate the best OCCs in a pre-specified parametric family (Punzo 2009).
The paper is organized as follows. Section 2 retraces kernel smoothing estimation of the OCCs and Section 3 illustrates other useful IRT functions based on these estimates. The relevance of the package is shown, via two real data sets, in Section 4, and conclusions are finally given in Section 5. Ramsay (1991Ramsay ( , 1997 popularized nonparametric estimation of OCCs by proposing regression methods, based on kernel smoothing approaches, which are implemented in the TestGraf program (Ramsay 2000). The basic idea of kernel smoothing is to obtain a nonparametric estimate of the OCC by taking a (local) weighted average (see Altman 1992, Härdle 1990, and Simonoff 1996 of the form

Kernel smoothing of OCCs
j = 1, . . . , k and l = 1, . . . , m j , where the weights w ij (ϑ) are defined so as to be maximal when ϑ = ϑ i and to be smoothly non-increasing as |ϑ − ϑ i | increases, with ϑ i being the value of ϑ for S i ∈ S. The need to keep p jl (ϑ) ∈ [0, 1], for each ϑ ∈ IR, requires the additional constraints w ij (ϑ) ≥ 0 and n i=1 w ij (ϑ) = 1; as a consequence, it is preferable to use Nadaraya-Watson weights (Nadaraya 1964 andWatson 1964) of the form where h j > 0 is the smoothing parameter (also known as bandwidth) controlling the amount of smoothness (in terms of bias-variance trade-off), while K is the kernel function, a nonnegative, continuous ( p jl inherits the continuity from K) and usually symmetric function that is nonincreasing as its argument moves further from zero.
Since the performance of (1) largely depends on the choice of h j , rather than on the kernel function (see, e.g., Marron and Nolan 1988) a simple Gaussian kernel K (u) = exp −u 2 /2 is often preferred (this is the only setting available in TestGraf). Nevertheless, KernSmoothIRT allows for other common choices such as the uniform kernel K (u) = I [−1,1] (u), and the quadratic kernel K (u) = 1 − u 2 I [−1,1] (u), where I A (u) represents the indicator function assuming value 1 on A and 0 otherwise. In addition to the functionalities implemented in TestGraf, KernSmoothIRT allows the bandwidth h j to vary from item to item (as highlighted by subscript j). This is an important aspect, since different items may not require the same amount of smoothing to obtain smooth curves (Lei et al. 2004, p. 8).

Estimating abilities
Unlike the standard kernel regression methods, in (1) the dependent variable Y jl is a binary variable and the independent one is the latent variable ϑ. Although ϑ cannot be directly observed, kernel smoothing can still be used, but each ϑ i in (2) must be replaced with a reasonable estimate ϑ i (Ramsay 1991) leading to where The choice of the scale of ϑ i is arbitrary, since in this context only rank order considerations make sense (Bartholomew 1983 andRamsay 1991, p. 614). Therefore, as most IRT models do, the estimation process begins (Ramsay 1991, p. 615 andRamsay 2000, pp. 25-26) with: 1. computation of the transformed rank r i = rank (S i ) / (n + 1), with rank (S i ) ∈ {1, . . . , n}, induced by some suitable statistic t i , the total score y ijl x jl being the most obvious choice. KernSmoothIRT also allows, through the argument RankFun of the ksIRT() function, for the use of common summary statistics available in R, such as mean() and median(), or for a custom user-defined function. Alternatively, the user may specify the rank of each subject explicitly through the argument SubRank, allowing subject ranks to come from another source than the test being studied.
2. replacement of r i by the quantile ϑ i of some distribution function F . The estimated ability value for S i then becomes ϑ i = F −1 (r i ). In these terms, the denominator n + 1 of r i avoids an infinity value for the biggest ϑ i when lim ϑ→+∞ F (ϑ) = 1 − . Note that the choice of F is equivalent to the choice of the ϑ-metric. Historically, the standard Gaussian distribution F = Φ has been heavily used (see Bartholomew 1988). However, KernSmoothIRT allows the user specification of F through one of the classical continuous distributions available in R.
Since these preliminary ability estimates are rank-based, they are usually referred to as ordinal ability estimates. Note that even a substantial amount of error in the ranks has only a small impact on the estimated curve values. This can be demonstrated both by mathematical analysis and through simulated data (see Ramsay 1991, 2000and Douglas 1997. Further theoretical results can be found in Douglas (2001) and Douglas and Cohen (2001). The latter also assert that, if nonparametric estimated curves are meaningfully different from parametric ones, this parametric model -defined on the particular scale determined by F -is an incorrect model for the data. In order to make this comparison valid, it is fundamental that the same F is used for both nonparametric and parametric curves. Thus, in the choice of a parametric family, visual inspections of the estimated kernel curves can be useful (Punzo 2009).

Operational aspects
Operationally, the kernel OCC is evaluated on a finite grid, ϑ 1 , . . . , ϑ s , . . . , ϑ q , of q equallyspaced values spanning the range of the ordinal ability estimates, so that the distance between two consecutive points is δ. Thus, starting from the values of y ijl and ϑ i , by grouping we can define the two sequences of q values Up to a scale factor, the sequence y sjl is a grouped version of y ijl , while v s is the corresponding number of subjects in that group. It follows that

Cross-validation selection for the bandwidth
Two of the most frequently used methods of bandwidth selection are the plug-in method and the cross-validation (for a more complete treatment of these methods see, e.g., Härdle 1990).
The former approach, widely used in kernel density estimation, often leads to rules of thumb. Motivated by the need to have fast automatically generated kernel estimates, the function ksIRT() of KernSmoothIRT adopts, as default, the common rule of thumb of Silverman (1986, p. 45) for the Gaussian kernel density estimator. It, in our context, is formulated as where σ ϑ -that in the original framework is a sample estimate -simply represents the standard deviation of ϑ, induced by F . Note that (5), with σ ϑ = 1, is the unique approach considered in TestGraf.
The second approach, cross-validation, requires a considerably higher computational effort; nevertheless, it is simple to understand and widely applied in nonparametric kernel regression (see, e.g., Wong 1983, Rice 1984and Mazza and Punzo 2011, 2013a,b, 2014. Its description, in our context, is as follows. Let y j = y 1j , . . . , y ij , . . . , y nj be the m j × n selection matrix referred to I j . Moreover, let be the m j -dimensional vector of kernel-estimated probabilities, for I j , at the evaluation point ϑ. The probability kernel estimate evaluated in ϑ, for I i , can thus be written as where w j (ϑ) = w 1j (ϑ) , . . . , w ij (ϑ) , . . . , w nj (ϑ) denotes the vector of weights.
In detail, cross-validation simultaneously fits and smooths the data contained in y j by removing one "data point" y ij at a time, estimating the value of p j at the correspondent ordinal ability estimate ϑ i , and then comparing the estimate to the omitted, observed value. So the cross-validation statistic is is the estimated vector of probabilities at ϑ i computed by removing the observed selection vector y ij , as denoted by the superscript in p The value of h j that minimizes CV (h j ) is referred to as the cross-validation smoothing parameter, h CV j , and it is possible to find it by systematically searching across a suitable smoothing parameter region.

Approximate pointwise confidence intervals
In visual inspection and graphical interpretation of the estimated kernel curves, pointwise confidence intervals at the evaluation points provide relevant information, because they indicate the extent to which the kernel OCCs are well defined across the range of ϑ considered. Moreover, they are useful when nonparametric and parametric models are compared.
Since p jl (ϑ) is a linear function of the data, as can be easily seen from (3), and being Y ijl ∼ The above formula holds if independence of the Y ijl s, with respect to the subjects, is assumed and possible error variation in the arguments, ϑ i , are ignored (Ramsay 1991). Substituting p jl for p jl yields the (1 − α) 100% approximate pointwise confidence intervals where Other more complicated approaches to interval estimation for kernel-based nonparametric regression functions are described in Azzalini, Bowman, and Härdle (1989) and Härdle (1990, Section 4.2).

Functions related to the OCCs
Once the kernel estimates of the OCCs are obtained, several other quantities can be computed based on them. In what follows we will give a concise list of the most important ones.

Expected item score
In order to obtain a single function for each item in I it is possible to define the expected value of the score X j = m j l=1 x jl Y jl , conditional on a given value of ϑ (see, e.g., Chang and Mazzeo 1994), as follows j = 1, . . . , k, that takes values in [x j min , x j max ], where x j min = min x j1 , . . . , x jm j and x j max = max x j1 , . . . , x jm j . The function e j (ϑ) is commonly known as expected item score (EIS) and can be viewed (Lord 1980) as a regression of the item score X j onto the ϑ scale. Naturally, for dichotomous and multiple-choice IRT models, the EIS coincides with the OCC referred to the correct option.
Starting from (7), it is straightforward to define the kernel EIS estimate as follows x jl y ijl .
For the EIS, in analogy with Section 2.4, the (1 − α) 100% approximate pointwise confidence interval is given by and, since Y ijl Y ijt ≡ 0 for l = t, one has , quantity that has to be inserted in (8).
Really, intervals in (6) and (8) are, respectively, intervals for E [ p jl (ϑ)] and E [ e j (ϑ)], rather than for p jl (ϑ) and e j (ϑ); thus, they share the bias present in p jl and e j , respectively (for the OCC case, see Ramsay 1991, p. 619).

Expected test score
In analogy to Section 3.1, a single function for the whole test can be obtained as follows It is called expected test score (ETS). Its kernel smoothed counterpart can be specified as and may be preferred in substitution of ϑ, for people who are not used to IRT, as display variable on the x-axis to facilitate the interpretation of the OCCs, as well as of other output-plots of KernSmoothIRT. This possibility is considered through the argument axistype="scores" of the plot() method. Note that, although it can happen that (9) fails to be completely increasing in ϑ, this event is rare and tends to affect the plots only at extreme trait levels.

Relative credibility curve
For a generic subject S i ∈ S, we can compute the relative likelihood of the various values of ϑ given his pattern of responses and given the kernel-estimated OCCs.
In (10), (10) is also known as relative credibility curve (RCC; see, e.g, Lindsey 1973). The ϑ-value, say ϑ ML , such that L i (ϑ) = 1, is called the maximum likelihood (ML) estimate of the ability for S i (see also Kutylowski 1997). Differently from simple summary statistics like the total score, ϑ ML considers, in addition to the whole pattern of responses, also the characteristics of the items as described by their OCCs; thus, it will tend to be a more accurate estimate of the ability.
Finally, as Kutylowski (1997) and Ramsay (2000) do, the obtained values of ϑ ML may be used as a basis for a second step of the kernel smoothing estimation of OCCs. This iterative process, consisting in cycling back the values of ϑ ML into estimation, can clearly be repeated any number of times with the hope that each step refines or improves the estimates of ϑ. However, as Ramsay (2000) states, for the vast majority of applications, no iterative refinement is really necessary, and the use of ϑ i or ϑ ML i for ranking examinees works fine. This is the reason why we have not considered the iterative process in the package.

Probability simplex
With reference to a generic item I j ∈ I, the vector of probabilities p j (ϑ) can be seen as a point in the probability simplex S m j , defined as the (m j − 1)-dimensional subset of the m j -dimensional space containing vectors with nonnegative coordinates summing to one. As ϑ varies, because of the assumptions of smoothness and unidimenionality of the latent trait, p j (ϑ) moves along a curve; the item analysis problem is to locate the curve properly within the simplex. On the other hand, the estimation problem for S i is the location of its position along this curve.
As illustrated in Aitchison (2003, pp. 5-9), a convenient way of displaying points in S m j , when m j = 3 or m j = 4, is represented, respectively, by the reference triangle in Figure 1(a) -an equilateral triangle having unit altitude -and by the regular tetrahedron, of unit altitude, in Figure 1(b). Here, for any point p, the lengths of the perpendiculars p 1 , . . . , p m j from p to the sides opposite to the vertices 1, . . . , m j are all greater than, or equal to, zero and have a unitary sum. Since there is a unique point with these perpendicular values, there is a oneto-one correspondence between S 3 and points in the reference triangle, and between S 4 and points in the regular tetrahedron. Thus, we have a simple means for representing the vector of probabilities p j (ϑ) when m j = 3 and m j = 4. Note that for items with more than four options there is no satisfactory way of obtaining a visual representation of the corresponding probability simplex; nevertheless, with KernSmoothIRT we can perform a partial analysis which focuses only on three or four of the options.

Package description and illustrative examples
The main function of the package is ksIRT(); it creates an S3 object of class ksIRT, which provides a plot() method as well as a suite of functions that allow the user to analyze the subjects, the options, the items, and the overall test. What follows is an illustration of the main capabilities of the KernSmoothIRT package.

Kernel smoothing with the ksIRT() function
The ksIRT() function performs the kernel smoothing. It requires responses, a (n × k)matrix, with a row for each subject in S and a column for each item in I, containing the selected option numbers. Alternatively, responses may be a data frame or a list object.
Arguments for setting the item format: format, key, weights To use the basic weighting schemes associated with each item format, the following combination of arguments have to be applied.
• For multiple-choice items, use format=1 and provide in key, for each item, the option number that corresponds to the correct option. For multiple-response items, one way to score them is simply to count the correctly classified options; to do this, a preliminary conversion of every option into a separate true/false item is necessary.
• For rating scale and partial credit items, use format=2 and provide in key a vector with the number of options of every item. If all the items have the same number of options, then key may be a scalar.
• For nominal items, use format=3; key is omitted. Note that to analyze items or options, subjects have to be ranked; this can only be done if the test also contains non-nominal items or if a prior ranking of subjects is provided with SubRank.
If the test is made of a mixture of different item formats, then format must be a numeric vector of length equal to the number of items. More complicated weighting schemes may be specified using weights in lieu of both format and key (see the help for details).
Arguments for smoothing: evalpoints, nevalpoints, thetadist, kernel, bandwidth The user can select the q evaluation points of Section 2.2, the ranking distribution F of Section 2.1, the type of kernel function K and the bandwidth(s). The number q of OCCs evaluation points may be specified in nevalpoints. By default they are 51 and their range is data dependent. Alternatively, the user may directly provide evaluation points using evalpoints.
As to F , it is by default Φ; any other distribution, with its parameters values, may be provided in thetadist. The default kernel function is the Gaussian; uniform or quadratic kernels may be selected with kernel. The global bandwidth is computed by default according to the rule of thumb in equation (5). Otherwise, the user may either input a numerical vector of bandwidths for each item or opt for cross-validation estimation, as described in Section 2.3, by specifying bandwidth="CV".

Arguments to handle missing values: miss, NAweight
Several approaches are implemented for handling missing answers. The default, miss="option", treats them as further options, with weight specified in NAweight, that by default is 0. When OCCs are plotted, the new options will be added to the corresponding items. Other choices impute the missing values according to some discrete probability distributions taking values on {1, . . . , m j }, j = 1, . . . , k; the uniform distribution is specified by miss="random.unif", while the multinomial distribution, with probabilities equal to the frequencies of the non-missing options of that item, is specified by miss="random.multinom". Finally, miss="omit" deletes from the data all the subjects with at least one omitted answer.

The ksIRT class
The ksIRT() function returns an S3 object of class ksIRT; its main components, along with their brief descriptions, can be found in Table 1. Methods implemented for this class are illustrated in Table 2. The plot() method allows for a variety of exploratory plots, which are selected with the argument plottype; its main options are described in Table 3.

Psych 101
The first tutorial uses the Psych 101 dataset included in the KernSmoothIRT package. This dataset contains the responses of n = 379 students, in an introductory psychology course, to k = 100 multiple-choice items, each with m j = 4 options as well as a key. These data were also analyzed in Ramsay and Abrahamowicz (1989) and in Ramsay (1991).
To begin the analysis, create a ksIRT object. This step performs the kernel smoothing and prepares the object for analysis using the many types of plots available.

Item Correlation
Once the ksIRT object Psych1 is created, several plots become available for analyzing each item, each subject and the overall test. They are displayed through the plot() method, as described below.    displayed in blue and the incorrect options in red. The default specification axistype="scores" uses the expected total score (9) as display variable on the x-axis. The vertical dashed lines indicate the scores (or quantiles if axistype="distribution") below which 5%, 25%, 50%, 75%, and 95% of subjects fall. Since the argument miss has not been specified, by default (miss="option") an additional OCC is plotted for items receiving nonresponses, as we can see from Figure 2(b) and Figure 2(d).

Methods
The OCC plots in Figure 2 show four very different items. Globally, apart from item 96 in Figure 2(d), the other items appear to be monotone enough. Item 96 is problematic for the Psych 101 instructor, as subjects with lower trait levels are more likely to select the correct option than higher trait level examinees. In fact, examinees with an expected total score of 90 are the least likely to select the correct option. Perhaps the question is misworded or it is measuring a different trait. On the contrary, items 24, 25, and 92, do a good job in differentiating between subjects with low and high trait levels. In particular item 24, in Figure 2(a), displays a high discriminating power for subjects with expected total scores near 40, and a low discriminating power for subjects with expected total scores greater than 50; above 50, subjects have roughly the same probability of selecting the correct option regardless of their expected total score. Item 25 in Figure 2(b) is also an effective one, since only the top students are able to recognize option 3 as incorrect; option 3 was selected by about 30.9% of the test takers, that is the 72.7% of those who answered incorrectly. Note also that, for subjects with expected total scores below about 58, option 3 constitutes the most probable choice. Finally, item 92 in Figure 2(c), aside from being approximately monotone, is also easy, since a subject with expected total score of about 30 already has a 70% chance of selecting the correct option; only a few examinees are consequently interested to the incorrect options 1, 3, and 4.

Expected item scores
Through the code R> plot(Psych1, plottype="EIS", item=c(24,25,92,96)) we obtain, for the same set of items, the EISs displayed in Figure 3. Due to the 0/1 weighting scheme, the EIS is the same as the OCC (shown in blue in Figure 2) for the correct option. EISs by default show the 95% approximated pointwise confidence intervals (dashed red lines) illustrated in Section 2.4. Via the argument alpha, these confidence intervals can be removed entirely (alpha=FALSE) or changed by specifying a different value. In this example relatively wide confidence intervals, for expected total scores at extremely high or low levels, are obtained. This is due to the fact that there are less data for estimating the curve in these regions and thus there is less precision in the estimates. Finally, the points on the EIS plots show the observed average score for the subjects grouped as in (4).

Probability simplex plots
To complement the OCCs, the package includes triangle and tetrahedron (simplex) plots that, as illustrated in Section 3.4, synthesize the OCCs. When these plots are used on items with more than 3 or 4 options (including the missing value category), only the options corresponding to the 3 or 4 highest probabilities will be shown; naturally, these probabilities are normalized in order to allow the simplex representation. This seldom loses any real information since experience tends to show that in a very wide range of situations people tend to eliminate all but a few options. The tetrahedron is the natural choice for the items 24 and 92, characterized by four options and without "observed" missing responses; for these items the code R> plot(Psych1, plottype="tetrahedron", items=c(24,92)) generates the tetrahedron plots displayed in Figure 2. These plots may be manipulated  with the mouse or keyboard. Inside the tetrahedron there is a curve constructed on the q (nevalpoints) evaluation points. In particular, low, medium, and high trait levels are identified by red, green, and blue points, respectively, where the levels are simply the values of evalpoints broken into three equal groups. Considering this ordering in the trait level, it is possible to make some considerations.
• A basic requirement of a reasonable item, of this format, is that the sequence of points terminates at or near the correct answer. In these terms, as can be noted in Figure 4(a) and Figure 4(b), items 24 and 92 satisfy this requirement since the sequence of points moves toward the correct option.
• The length of the curve is very important. The individuals with the lowest trait levels should be far from those with the highest. Item 24, in Figure 4(a), is a fairly good example. By contrast very easy items, such as item 92 in Figure 4(b), have very short curves concentrated close to the correct answer, with only the worst students showing a slight tendency to choose a wrong answer.
• The spacing of the points is related to the speed at which probabilities of choice change; compare the worst students of Figure 4(a) with those in Figure 4(b) and also the corresponding results in Figure 2(a) and Figure 2(c), respectively.
For the same items, the code R> plot(Psych1, plottype="triangle", items=c(24,92)) produces the triangle plots displayed in Figure 5. From Figure 5(a) we can see that in the set of the three most chosen options, the second one has a much higher probability of being selected while the other two share almost the same probability, and so the sequence of points approximately lies on the bisector of the angle associated to the second option.

Principal component analysis
By performing a principal component analysis (PCA) of the EISs at each evaluation point, the KernsmoothIRT package provides a way to simultaneously compare items and to show the relationships among them. Since EISs may be defined on different ranges [x j min , x j max ], the transformation ( e j (ϑ) − x j min ) / (x j max − x j min ), j = 1, . . . , k, is preliminary applied. Furthermore, as stated in Section 2.1, in this paradigm only rank order considerations make sense, so the zero-centered ranks of e 1 (ϑ s ) , . . . , e k (ϑ s ), for each s = 1, . . . , q, are computed and the PCA is carried out on the resulting (q × k)-matrix. In particular, the code R> plot(Psych1, plottype="PCA") produces the graphical representation in Figure 6. A first glance to this plot shows that: • the first principal component, on the horizontal axis, represents item difficulty, since the most difficult items are placed on the right and the easiest ones on the left. The small plots on the left and on the right show the EISs for the two extreme items with respect to this component and help the user in identifying the axis-direction with respect to difficulty (from low to high or from high to low). Here, I 7 shows high difficulty, as test takers of all ability levels receive a low score, while I 11 is extremely easy.
• the second principal component, on the vertical axis, corresponds to item discrimination, since low items tend to have an high positive slope while high items tend to have an high negative slope. Also in this case, the small plots on the bottom and on the top show the EISs for the two extreme items with respect to this component and help the user in identifying the axis-direction with respect to discrimination (from low to high or viceversa). Here, while both I 96 and I 29 possess a very strong discrimination, I 96 is clearly ill-posed, since it discriminates negatively. Concluding, the principal components plot tends to be a useful overall summary of the composition of the test. Figure 6 is fairly typical of most academic tests and it is also usual to have only two dominant principal components reflecting item difficulty and discrimination.

Relative credibility curves
The RCCs shown in Figure 7 are obtained by the command R> plot(Psych1, plottype="RCC", subjects=c(13,92,111,33)) In each plot, the red line shows the subject's actual score t.
For both the subjects considered in Figure 7(a) and Figure 7(b), there is a substantial agreement between the maximum of the RCC, e ϑ ML , and t. Nevertheless, there is a difference  in terms of the precision of the ML-estimates; for S 13 the RCC is indeed more spiky, denoting a higher precision. In Figure 7(c) there is a substantial difference between e ϑ ML 111 and t 111 . This indicates that the correct and incorrect answers of this subject are more consistent with a lower score than they are with the actual score received. Finally, in Figure 7(d), although there is a substantial agreement between e ϑ ML 33 and t 33 , a small but prominent bump is present in the right part of the plot. Although S 33 is well represented by his total score, he passed some, albeit few, difficult items and this may lead to think that he is more able than t 33 suggests.

Test summary plots
The KernSmoothIRT package also contains many analytical tools to assess the test overall. Figure 8 shows a few of these, obtained via the code R> plot(Psych1, plottype="expected") R> plot(Psych1, plottype="sd") R> plot(Psych1, plottype="density") Figure 8(a) shows the ETS as a function of the quantiles of the standard normal distribution Φ; it is nearly linear for the Psych 101 dataset. Note that, in the nonparametric context, the ETS may be non-monotone due to either ill-posed items or random variations. In the latter case, a slight increase of the bandwidth may be advisable.
The total score, for subjects having a particular value ϑ, is a random variable, in part because different examinees, or even the same examinee on different occasions, cannot be expected to make exactly the same choices. The standard deviation of these values, graphically represented in Figure 8(b), is therefore also a function of ϑ. Figure 8(b) indicates that the standard deviation reaches the maximum for examinees at around a total score of 50, where it is about 4.5 items out of 100. This translates into 95% confidence limits of about 41 and 59 for a subject getting 50 items correct.
Figure 8(c) shows a kernel density estimate of the distribution of the total score. Although such distribution is commonly assumed to be "bell-shaped", from this plot we can note as this assumption is strong for these data. In particular, a negative skewness can be noted which is a consequence of the test having relatively more easy items than hard ones. Moreover, bimodality is evident.

Voluntary HIV-1 counseling and testing efficacy study group
It is often useful to explore if, for a specific item on a test, its expected score differs when estimated on two or more different groups of subjects, commonly formed by gender or ethnicity. This is called Differential Item Functioning (DIF) analysis in the psychometric literature. In particular, DIF occurs when subjects with the same ability but belonging to different groups have a different probability of choosing a certain option. DIF can properly be called item bias because the curves of an item should depend only on ϑ, and not directly on other person factors. Zumbo (2007)  To perform this study, n = 4292 persons were enrolled. The whole dataset -downloadable from http://caps.ucsf.edu/research/datasets/, which also contains other useful survey details -reported 1571 variables for each participant. As part of this study, respondents were surveyed about their attitude toward condom use via a bank of k = 15 items. Respondents were asked how much they agreed with each of the statements on a 4-point response scale, with 1="strongly disagree", 2="disagree more than I agree", 3="agree more than I disagree", 4="strongly agree"). Since 10 individuals omitted all the 15 questions, they have been preliminary removed from the used data. Moreover, given the ("negative") wording of the items I 2 , I 3 , I 5 , I 7 , I 8 , I 11 , and I 14 , a respondent who strongly agreed with such statements was indicating a less favorable attitude toward condom use. In order to uniform the data, the score for these seven items was preliminary reversed. The dataset so modified can be directly loaded from the KernSmoothIRT package by the code R> data("HIV") R> HIV

R> attach(HIV)
As it can be easily seen, the above data frame contains the following person factors: SITE = "site of the study" (Ken=Kenya, Tan=Tanzania, Tri=Trinidad) GENDER = "subject's gender" (M=male, F=female) AGE = "subject's age" (age at last birthday) Each of these factors can potentially be used for a DIF analysis. These data have been also analyzed, through some well-known parametric models, by Bertoli-Barsotti, Muschitiello, and Punzo (2010) which also perform a DIF analysis. Part of this sub-questionnaire has been also considered by De Ayala (2003Ayala ( , 2009) with a Rasch Analysis.
The code below R> HIVres <-ksIRT(HIV [,-(1:3)], key=HIVkey, format=2, miss="omit") R> plot(HIVres, plottype="OCC", item=9) R> plot(HIVres, plottype="EIS", item=9) R> plot(HIVres, plottype="tetrahedron", item=9) produces the plots, for I 9 , displayed in Figure 9. The option miss="omit" excludes from the nonparametric analysis all the subjects with at least one omitted answer, leading to a sample of 3473 respondents; the option format=2 specifies that the data contain rating scale items. Figure 9(a) displays the OCCs for the considered item. As expected, subjects with the smallest scores are choosing the first option while those with the highest ones are selecting the fourth option. Generally, as the total scores increase, respondents are approximately estimated to be more likely to choose an higher option and this reflects the typical behavior of a rating scale item. Figure 9(b) shows the EIS for I 9 . Note how the expected item score climbs consistently as the total test score increases. Moreover, the EIS displays a fairly monotone behavior that covers the entire range [1,4]. Finally, Figure 9(c) shows the tetrahedron for item 9. It corroborates the good behavior of I 9 already seen in Figure 9(a) and Figure 9(b). The sequence of points herein, as expected, starts from (the vertex) option 1 and smoothly tends to option 4, passing by option 2 and option 3.
The following example demonstrates DIF analysis using the person factor GENDER. To perform this analysis, a new ksIRT object must be created with the addition of the groups argument by which the different subgroups may be specified. In particular, the code R> gr1 <-as.character(HIV$GENDER) R> DIF1 <-ksIRT(res=HIV [,-(1:3)], key=HIVkey, format=2, groups=gr1, + miss="omit") R> plot(DIF1, plottype="expectedDIF", lwd=2) R> plot (DIF1,plottype="densityDIF",lwd=2) produces the plots in Figure 10. Figure 10 the same, the relationship will appear as a nearly diagonal line (a dotted diagonal line is plotted as a reference). Figure 10(b) shows the density functions for the two groups. Both plots confirm that there is a strong agreement in behavior for males and females with respect to the test.
After this preliminary phase, the DIF analysis proceeds by considering the item by item group comparisons. Figure 11, obtained via the command R> plot(DIF1, plottype="OCCDIF", cex=0.5, item=3) (b) Density Figure 10: Behavior of males (M) and females (F) on the test. In the QQ-plot on the left, the dashed diagonal line indicates the reference situation of no difference in performance for the two groups; the horizontal and vertical dashed blue lines indicate the 5%, 25%, 50%, 75%, and 95% quantiles for the two groups.
displays the OCCs for the (rating scale) item I 3 . These plots allow the user to compare the two groups at the item level. Lack of DIF is evident by nearly overlapping OCCs for all the four options. DIF may also be evaluated in terms of the expected score of the groups, as displayed in Figure 12. This plot is obtained with the code R> plot(DIF1, plottype="EISDIF", cex=0.5, item=3) The different color points on the plot represent how individuals from the groups actually scored on the item. Although we have focused the attention only on I 3 , similar results are obtained for all of the other items in I, and this confirms as GENDER is not a variable producing DIF in this study. This result is corroborated in Bertoli- Barsotti et al. (2010). Note that, for both OCCs and EISs, it is possible to add confidence intervals through the alpha argument.
these groups, and Figure 13 shows this. The three pairwise QQ-plots of the expected score distributions show that there is a slight dominance of people from Kenia over people from Trinidad (in the sense that people from Kenia have, in distribution, a slightly more positive attitude toward condom use than people from Trinidad), and a large discrepancy between the performances of people from Tanzania and those of the other two groups, as shown in Figure 13(a) and Figure 13 from Tanzania compared with the other countries, can be also noted by looking at the observed total score densities in Figure 13(d). Here, there is higher variability in the total score for people from Tanzania. But what about DIF? The command R> plot(DIF2, plottype="EISDIF", item=c(6,11)) produces, for I 6 and I 11 , the EISs in Figure 14. In both the plots we have a graphical indication of the presence of DIF, and this confirms the results by Bertoli-Barsotti et al. (2010) that detect SITE-based DIF for these and other items in the test.

Conclusions
In this paper, package KernSmoothIRT for the R environment, which allows for kernel smoothing within the IRT context, has been introduced. Two applications have been discussed, along with some theoretical and practical issues.
The advantages of nonparametric IRT modeling are well known. Ramsay (2000) recommends its application, at least as an exploratory tool, to guide users over the choice of an appropriate parametric model. Moreover, while currently most IRT analyses are conducted with parametric models, quite often the assumptions underling parametric IRT modeling are not preliminarily checked. One reason for this may be the lack, apart from TestGraf, of available software. TestGraf has set a milestone on this field as the first computer program to implement a kernel smoothing approach to IRT and has been the prominent software for years. (d) Total score density Figure 13: Behavior of people from Kenya (Ken), Tanzania (Tan), and Trinidad (Tri), on the test. In all the pairwise QQ-plots, the dashed diagonal line indicates the reference situation of no difference in performance for the two groups; the horizontal and vertical dashed blue lines indicate the 5%, 25%, 50%, 75%, and 95% quantiles for the two groups.
Compared to TestGraf, KernSmoothIRT has the major advantage of running within the R environment. Users do not have to export their results into another piece of software in order to perform non-standard data analysis, to produce customized plots or to perform parametric IRT using any of several packages available in R. Furthermore, KernSmoothIRT allows more flexibility in bandwidth and kernel selection, as well as in handling missing values. We believe that KernSmoothIRT may prove useful to educators, psychologists, and other researchers developing questionnaires, enabling them to spot ill-posed questions and to formulate more plausible wrong options. Future works will consider extending the package by allowing for kernel smoothing estimation of test and item information functions. Although well-established in parametric IRT, information functions present serious statistical problems in NIRT context, as underlined by Ramsay (2000, p. 66). Currently available nonparametricbased IRT programs, such as TestGraf, estimate test and item information functions based on parametric OCCs.