Performing the Kernel Method of Test Equating with the Package kequate

In standardized testing it is important to equate tests in order to ensure that the test takers, regardless of the test version given, obtain a fair test. Recently, the kernel method of test equating, which is a conjoint framework of test equating, has gained popularity. The kernel method of test equating includes ﬁve steps: (1) pre-smoothing, (2) estimation of the score probabilities, (3) continuization, (4) equating, and (5) computing the standard error of equating and the standard error of equating diﬀerence. Here, an implementation has been made for six diﬀerent equating designs: equivalent groups, single group, counter balanced, non-equivalent groups with anchor test using either chain equating or post-stratiﬁcation equating, and non-equivalent groups using covariates. An R package for the kernel method of test equating called kequate is presented. Included in the package are also diagnostic tools aiding in the search for a proper log-linear model in the pre-smoothing step for use in conjunction with the R function glm .


Introduction
One of the main concerns when using standardized achievement tests is that they are fair. In order to ensure fairness when a test is given at different points in time, or in different versions of the same standardized test, a statistical procedure known as equating is used. The ultimate goal of equating is to adjust scores on different test forms so that the test forms can be used interchangeably (Kolen and Brennan 2004).
The kernel method of test equating is a single unified approach to observed-score test equating, usually presented as a process involving five different steps: pre-smoothing, score probability estimation, continuization, computation of the equating function, and computation of the standard errors of the equating function (von Davier, Holland, and Thayer 2004). The method has a number of advantages over other observed-score test equating methods. In particular, it provides explicit formulas for the standard errors of equating in different designs and directly uses information from the pre-smoothing step in the estimation of these. Kernel equating can also handle equating using covariates in a non-equivalent groups setting and provides a method to compare two different equatings using the standard error of the difference between two equating functions. Since the kernel method of test equating is a unified equating framework with large applicability both for the testing industry and the research community, it is of great interest to create a software package which anyone interested in equating can use.
The aim of this paper is to introduce package kequate (Andersson, Bränberg, and Wiberg 2013), an implementation of the kernel method of test equating using five different data collection designs in the statistical programming environment R (R Core Team 2013). For an introduction to R see Venables, Smith, and R Core Team (2013). The content of the package consists of the entirety of the equating aspects of the book The Kernel Method of Test Equating by von Davier et al. (2004). In addition, our implementation of the kernel method of test equating in R includes the option to use information from an item response theory (IRT) model to conduct an IRT observed-score equating and the option to use unsmoothed input frequencies directly, to enable the comparison of different approaches to observed-score test equating.
The paper is structured as follows. In Section 2 a brief introduction to the kernel equating framework is given. Section 3 introduces the functionality of kequate, and Section 4 provides examples of the main functions. Section 5 contains some concluding remarks and presents possible future additions to the package.

The kernel method of test equating
This section will comprise a brief description of the kernel method of test equating. For a complete description please read the excellent book by von Davier et al. (2004).However, before we can go through the steps of kernel equating, we need to describe the different data collection designs that are available in the package. The first four are standard data collection designs (see, e.g., Kolen and Brennan 2004;von Davier et al. 2004). The last data collection design is a more uncommon case and is used if we have additional information which is correlated with the test scores. For a detailed description please refer to Bränberg (2010) and Bränberg and Wiberg (2011).

Data collection designs
We have incorporated the possibility of five different data collection designs: The equivalent groups design (EG): Two independent random samples are drawn from a common population of test takers, P , and the test form X is administered to one sample while test form Y is administered to the other sample. No test takers are taking both X and Y .
The single group design (SG): Two test forms X and Y are administered to the same group of test takers drawn from a single population P . All test takers are taking both X and Y .
The counter balanced design (CB): Two test forms X and Y are administered to the same group of test takers drawn from a single population P . One part of the group first takes test form X and then test form Y . The other part of the group takes the test forms in a counterbalanced order, i.e., first test form Y and then test form X. This could also be viewed as two EG designs or as two SG designs.
The non-equivalent groups with anchor test design (NEAT): A sample of test takers from population P are administered test form X, and another sample of test takers from population Q are administered test form Y . Both samples are also administered a set of common (i.e., anchor) items (test form A). With the NEAT design there are two commonly used equating methods:

The kernel method of test equating
Following the notation in von Davier et al. (2004), let X and Y be the names of the two test forms to be equated and X and Y the scores on X and Y . Let T be the target population for which the equating is to be performed. We will assume that the test takers taking the tests are random samples from a population of test takers, so X and Y are regarded as random variables. Observations on X will be denoted by x j for j = 1, . . . , J. Observations on Y will be denoted by y k for k = 1, . . . , K. If X and Y are number-right scores, J and K will be the number of items plus one.
We will use for the probability of a randomly selected individual in population T scoring x j on test X, and for the probability of a randomly selected individual in population T scoring y k on test Y .
The goal is to find the link between X and Y in the form of an equipercentile equating function in the target population T , the population on which the equating is to be done. The equipercentile equating function is defined in terms of the cumulative distribution functions (CDFs) of X and Y in the target population. Let and be the CDFs of X and Y over the target population T . If the two CDFs are continuous and strictly increasing, then the equipercentile equating function of X to Y is defined by With test score data, the CDFs are discrete step functions so the CDFs have to be made continuous somehow. In traditional equipercentile equating this is done using linear interpolation, but kernel equating handles this issue by employing a kernel method instead . The kernel method of test equating includes five steps: pre-smoothing, estimation of the score probabilities, continuization, equating, and computing the standard error of equating (SEE) and the standard error of equating difference (SEED). Most of the steps which comprise what is called the kernel method of test equating were available before the kernel framework was developed, see, e.g., Angoff (1984) for a description of equipercentile equating using linear interpolation and Fairbank (1987) for a discussion on pre-smoothing in equating. SEEs were also derived by Lord (1982) and Jarjoura and Kolen (1985), but prior to the introduction of kernel equating SEEs were not available when using pre-smoothing (Holland, King, and Thayer 1989).
Step 1: Pre-smoothing In pre-smoothing, a statistical model is fitted to the empirical distribution obtained from the sampled data. We assume that much of the irregularities seen in the empirical distributions are due to sampling error, and the goal of smoothing is to reduce this error. In equating, the raw data are two sets of univariate, bivariate or multivariate discrete distributions (depending on the data collection design). One way to perform pre-smoothing is by fitting a polynomial log-linear model to the relative frequencies obtained from the raw data. We will show this for the NEAT design. For details the interested reader is referred to, e.g., Holland and Thayer (2000) or von Davier et al. (2004).
In the NEAT design each test taker has a score on one of the test forms and a score on an anchor test. Let A be the score on the anchor test form A. Observations on A will be denoted by a l for l = 1, . . . , L. Let n Xjl be the number of test takers with X = x j and A = a l , and n Y kl be the number of test takers with Y = y k and A = a l . We assume that n XA = (n X11 , . . . , n XJL ) and n Y A = (n Y 11 , . . . , n Y KL ) are independent and that they each have a multinomial distribution. The log likelihood function for X is given by where p jl = P (X = x j , A = a l | T ). The target population T is a mixture of the two populations P and Q, T = wP + (1 − w) Q, where w is selected from the interval [0, 1].
The log-linear model for p jl is given by where T X and T A are the number of univariate moments for tests X and A, respectively, and I X and I A are the number of cross-moments for the tests X and A, respectively. The log likelihood function for Y and the log-linear model for q kl = P (Y = y k , A = a l | T ) can be written in a similar way. The log-linear models can also contain additional parameters, to take care of lumps and spikes in the marginal distributions. The specification of such models is, however, not discussed further herein (the interested reader is referred to von Davier et al.

2004).
Step 2: Estimation of the score probabilities The score probabilities are obtained from the estimated score distributions from step 1. The most important part of step 2 is the definition and use of the design function. The design function is a function mapping the (estimated) population score distributions into (estimates of) r and s, where r = (r 1 , r 2 , . . . , r J ) and s = (s 1 , s 2 , . . . , s K ) . The function will vary between different data collection designs. For example, in an EG design it is simply the identity function as compared with PSE in a NEAT design, where the design function is given by where p l = (p 1l , p 2l , . . . , p Jl ) and q l = (q 1l , q 2l , . . . , q Kl ) .
Step 3: Continuization Test score distributions are discrete, and the definition of the equipercentile equating function given in Equation 5 cannot be used unless we deal with this discreteness in some way. Previous to the development of kernel equating, linear interpolation was employed to obtain continuous CDFs from the discrete CDFs (Kolen and Brennan 2004). In kernel equating continuous CDFs are used as approximations to the estimated discrete step-function CDFs generated in the presmoothing step. Following von Davier et al. (2004), we will use a Gaussian kernel. Logistic and uniform kernels have also been described in the literature (Lee and von Davier 2011) and are available as options in package kequate. In what follows, only the formulas for X are shown, but the computations for Y are analogous. The discrete CDF F (x) is approximated by where µ X = j x j r j is the mean of X in the target population T , h X is the bandwidth, and Φ (·) is the standard Normal distribution function. The constant a X is defined as where σ 2 X = j (x j − µ X ) 2 r j is the variance of X in the target population T . There are several ways of choosing the bandwidth h X . We want the density functions to be as smooth as possible without losing the characteristics of the distributions. We recommend the use of a penalty function to deal with this problem, see von Davier et al. (2004). For h X the penalty function is given by wheref h X (x) is the estimated density function, i.e., the derivative ofF h X (x), and κ is a constant. B j is an indicator that is equal to one if the derivative of the density function is negative a little to the left of x j and positive a little to the right of x j , or if the derivative is positive a little to the left of x j and negative a little to the right of x j . Otherwise, B j is equal to zero. With a bandwidth that minimizes PEN (h X ) in Equation 11, the estimated continuous density functionf h X (x) will be a good approximation of the discrete distribution of X, without too many modes.
Step 4: Equating Assume that we are interested in equating X to Y. If we use the continuized CDFs described previously, we can define the kernel equating function aŝ which is analogous to the equipercentile equating function defined in Equation 5.
Step 5: Calculating the SEE and the SEED One of the advantages with the kernel method of test equating is that it provides a neat way to compute the SEE. The SEE for equating X to Y is given by In kernel equating the δ-method is used to compute an estimate of the SEE. Let R and S be the vectors of pre-smoothed score distributions. If R and S are estimated independently, the covariance can be written as where The pre-smoothed score distributions are transformed into r and s using the design function. The Jacobian of this function is In the final step of kernel equating, estimates of r and s are used in the equating function to calculate equated scores. The Jacobian of the equating function is given by If R S is approximately normally distributed with mean R S and variance given in and where υ denotes the Euclidean norm of vector υ.
The SEED, which can be used to compare different kernel equating functions, is defined as i.e., the Euclidean norm of the difference between the two vectors J e 1 J DF 1 C and J e 2 J DF 2 C. The equating function is designed to transform the continuous approximation of the distribution of X into the continuous approximation of the distribution of Y. In order to diagnose the effectiveness of the equating function, we need to consider what this transformation does to the discrete distribution of X. One way of doing this is to compare the moments of the distribution of X with the moments of the distribution of Y. Following von Davier et al.
(2004), we use the percent relative error (PRE) in the p-th moments, the PRE (p), which is defined as

kequate for R
The package kequate for R enables the equating of two parallel tests with the kernel method of equating for the EG, SG, CB, NEAT PSE, NEAT CE and NEC designs. The package kequate can use 'glm' objects created using the R function glm() (from package stats; R Core Team 2013) as input arguments and estimate the equating function and associated standard errors directly from the information contained therein. The S4 system of classes and methods (Chambers 2008), a more formal and rigorous way of handling objects in R, is used in package kequate, providing methods for the generic functions plot() and summary() for a number of newly defined classes. The main function of the package is kequate(), which enables the equating of two parallel tests using the previously defined equating designs. The function kequate() has the following formal function call: kequate(design, ...) where design is a character vector indicating the design used and ... should contain the additional arguments which depend partly on the design chosen. The possible data collection designs and the associated function calls are described below. Explanations of each argument that may be supplied to kequate() are collected in Tables 1 and 2. EG : The arguments containing the score probabilities and design matrices that are supplied to kequate can either be objects of class 'glm' or design matrices and estimated probability vectors/matrices. For ease of use, it is recommended to estimate the log-linear models using the R function glm() and use the 'glm' objects as input to kequate(). The estimation of log-linear models using glm() is not covered extensively in this article. The interested reader is referred to Holland and Thayer (2000) and R help files. Optional arguments to specify the continuization parameters directly are also available for all equating designs. In addition, the option exists to only conduct a linear equating and an option to use unsmoothed input frequencies. There is also the option of selecting the kernel to be used.

CE
Optional arguments to specify the continuization parameters manually.
wcb CB The weighting of the two test groups in a counterbalanced design. Default is 1/2. equating using the kernel equating framework. This is accomplished by supplying matrices of probabilities to answer each question correctly for each ability level on two parallel tests X and Y , as estimated beforehand using an IRT model.
The package kequate creates an object of class 'keout' which includes information about the equating. To access information from such an object, a number of get*-functions are available. They are described in   for the functions plot() and summary(). Additionally, the function genseed() can be used to compare any two equatings that utilize the same log-linear models. It takes as input two objects created by kequate and calculates the SEED between them. A useful comparison is, for example, between a chain equating and a post-stratification equating in the NEAT design. A method for the function plot() is implemented for the objects created by genseed(). The package also includes a function kefreq() to tabulate frequency data from individual test score data and functions FTres() and cdist() to be used when specifying the log-linear pre-smoothing models. FTres() calculates the Freeman-Tukey residuals given a specified log-linear model, and cdist() calculates the conditional means, variances, skewnesses and kurtoses of the tests to be equated given an anchor test, for both the fitted distributions and the observed distributions.

Examples
We exemplify the main function kequate() by equating using the EG, NEAT, and NEC designs. The function calls for the other designs are very similar, only having other required arguments. We also demonstrate how to conduct an IRT observed-score equating.

EG design
Let the parallel tests X and Y have common score vectors 0:20. The tests are each administered to a randomized group drawn from the same population, thus we have an EG design. The data used in this example is from Chapter 7 of von Davier et al. (2004) and we have specified identical log-linear models to the book using the glm() function in R. Thus, two objects FXEGglm and FYEGglm have been created. We give the summary of the log-linear model for test X below.

R> summary(FXEGglm)
Call: glm(formula = freq~I(X) + I(X^2), family = poisson, data = FXEG, x = TRUE)  In kequate the function FTres() can be used to calculate the Freeman-Tukey residuals often considered in pre-smoothing. From our log-linear model for test X, we create a numeric vector containing the Freeman-Tukey residuals by writing:

R> Xres <-FTres(FXEGglm$y, FXEGglm$fitted.values)
We plot the resulting vector which can be seen in Figure 1. If the model fits the data well, the Freeman-Tukey residuals are approximately normal distributed, which is tenable in this case. To then equate the two tests using an equipercentile equating with pre-smoothing, we call the function kequate() as follows: R> keEG <-kequate ("EG",0:20,0:20,FXEGglm,FYEGglm) This will create an R object keEG containing information about the equating, retrieved by using the functions described in Table 3. To print useful information about the equating, we can utilize the summary() function. With the EG example above, we write: The summary() function can be used in kequate to print information from any object of class 'keout'. The output is similar for all designs. The first part contains information about the score range and bandwidths. The second part contains the equating function with its standard error. Finally, the PRE is given.
With the EG design, it is also possible to equate two tests using the full kernel equating framework with observed data instead of pre-smoothed data. The additional argument smoothed = FALSE needs to be given to kequate() in such a case. As an example, by using information from the object created in the equating with pre-smoothing, we can write: R> rEGobs <-getScores(keEG)$X$r R> sEGobs <-getScores(keEG)$Y$s R> NEG <-getScores(keEG)$N R> MEG <-getScores(keEG)$M R> keEGobs <-kequate("EG", 0:20, 0:20, rEGobs, sEGobs, N = NEG, M = MEG, The object created contains similar information to an object from an equating with presmoothed data. IRT observed-score equating (IRT-OSE) is also enabled in kequate, using the arguments irtx and irty. We let irt1 and irt2 be matrices where each column represents an ability level in an IRT model and each row represents a question on the test to be equated. Each cell in the matrix should then contain the estimated probability to answer correctly to a question on the parallel tests for a certain ability level. To equate using IRT-OSE, we write: R> keEGirt <-kequate("EG", 0:20, 0:20, FXEGglm, FYEGglm, irtx = irt2, + irty = irt1) This function call will conduct an IRT-OSE in the kernel equating framework in addition to a regular equipercentile equating. It is possible to use unsmoothed frequencies while conducting an IRT-OSE. Specifying linear = TRUE will instruct kequate() to do a linear equating for both the regular method and for the IRT-OSE. Using IRT-OSE is not limited to an EG design. It can be used as a supplement in any of the designs available in kequate.

NEAT design
To illustrate how to do an equating in a NEAT design, we utilize a simulated data set provided with kequate. With the same data we also show how the function kefreq() can be used to tabulate data from individual test takers and how the function cdist() can be used to evaluate a log-linear model. In this example test X and test Y have the same score value vectors 0:20, and test A (the anchor test) has the score value vector 0:10. We now wish to equate tests X and Y . The R object bivar1 is a data frame with columns X and A of length 1000 containing the score on test X and test A for each individual. Similarly, bivar2 is a data frame with columns Y and A of length 1000. For all designs that utilize bivariate frequencies, the data must be sorted first by the score vector for A and then by the score vector for X.
In the SG design, the data must be sorted first by the score vector for test Y and then by the score vector for test X. To create data sorted in the manner appropriate for usage with kequate, we write: R> freq1 <-kefreq(bivar1$X, 0:20, bivar1$A, 0:10) R> freq2 <-kefreq(bivar2$Y, 0:20, bivar2$A, 0:10) The created objects freq1 and freq2 are data frames with three columns: frequency, X, and A, sorted first by the A column and then by the X column. The data frame created by kefreq() can then be used in the glm() model specification. Alternatively, the frequency column can be converted into relative frequencies and utilized to equate tests using observed relative frequencies directly. In this example we use pre-smoothing and assume that the 'glm' objects glmsim1 and glmsim2 have been created. The summary of glmsim1 is given below. We fitted the model using five univariate moments for the score values of the test to be equated and four moments for the score values of the anchor test. The first four cross-moments between the test scores were also added. Adding additional parameters did not improve the model fit much. To evaluate a log-linear model for bivariate test score frequencies, it is a good idea to compare the conditional mean, variance, skewness and kurtosis of the observed and fitted frequencies. It is desirable to maintain the properties of the observed frequencies in the specified model. In kequate the function cdist() can be used to calculate the conditional parameters of observed and fitted frequencies. The input to cdist() are two matrices of bivariate frequencies (one for the fitted and one for the observed frequencies) and the two score value vectors. To specify the necessary input for tests X and A in our example, we write: R> EGPest <-matrix(glmsim1$fitted.values, nrow = 21) R> EGPobs <-matrix(glmsim1$y, nrow = 21) We then use these objects and the score value vectors to calculate the conditional parameters and plot the result: R> NEATdistP <-cdist(EGPest, EGPobs, 0:20, 0:10) R> plot(NEATdistP) The resulting object NEATdistP contains all the conditional parameters, but the plot shows only the conditional mean and variance of the respective tests and distributions. The plot is given in Figure 2, where it can be seen that the conditional mean is very close between the observed and fitted distributions but that the conditional variance is not as well maintained in the fitted distribution. To use the log-linear models specified above to equate the two tests in a NEAT PSE design and also display the summary, we write: R> eqNEATPSE <-kequate ("NEAT_PSE",0:20,0:20,glmsim1  The results can also be plotted by writing:

R> plot(eqNEATPSE)
The resulting graph can be seen in Figure 3, where the first plot compares the score values on X with the equated values, and where the second plot gives the standard error of the equated values for each score value of X. The same type of graph is plotted for all equating designs. Chain equating can also be used in kequate. To equate the same tests as in the NEAT case above but this time using CE, we write: R> eqNEATCE <-kequate("NEAT_CE", 0:20, 0:20, 0:10, glmsim1, glmsim2) Given two different equating functions derived from the same log-linear models, the SEED between two equatings can be calculated. In kequate, the function genseed() takes as input two objects of class 'keout' and calculates the SEED between two kernel equipercentile or linear equatings. By default the kernel equipercentile equatings are used. To instead compare two linear equatings to each other, the logical argument linear = TRUE should be used when calling genseed(). The output from genseed() is an object of class 'genseed' which can be plotted using plot(), creating a suitable plot of the difference between the equating functions and the associated SEED. To compare the NEAT PSE and NEAT CE equatings given above and plot the results, we write: The resulting figure can be seen in Figure 4. The difference between the equating functions is outside of the error bands for many score values, indicating that the equatings significantly differ from each other. In the above function calls, the default settings have been used. Under the default settings, both a KE-equipercentile equating and a linear equating are done. The continuization parameters will by default be set to the optimal value in the KE-equipercentile case and to 1000 · std error for the test scores in the linear case. It is possible to choose these parameters manually by specifying additional arguments in the function call. With a NEAT PSE design there are four continuization parameters to consider: hx, hy, hxlin, and hylin.

NEC design
In the NEC design, instead of using an anchor test to enable the equating of two tests when the groups taking the test are not equivalent, we utilize background information on the individuals taking the tests. In the example used here, an equating is made of two instances of a part of the Swedish Scholastic Assessment Test (variable names testX and testY) with the aid of covariates indicating high school math grade (variable name mattea1) and type of high school education (utb1). Using the function glm() in R, the objects NECYglm and NECXglm for each test have been specified.

R> summary(NECYglm)
Call: glm(formula = frequency~I(testY) + I(testY^2) + I(testY^3) + I(mattea1) + I(mattea1^2) + factor(utb1) + I(testY):I(mattea1) + I(testY):factor(utb1) + I(mattea1):factor(utb1), family = "poisson", data = testdata2, x = TRUE) We included three univariate moments for the test scores and two moments for the grade in mathematics. Additionally, interaction terms were added between the test scores and the covariates and also between the covariates. The resulting model fits the data well. A similar model was fitted for the scores of the other test administration. We now equate the two versions of the test by writing: R> NECeq <-kequate("NEC", 0:22, 0:22, NECYglm, NECXglm) The results from the summary() function are given below, showing that the two tests are fairly equal in difficulty when we have conditioned on relevant background variables. For lower scores it appears that test X is slightly more difficult, while at the higher end test Y is slightly more difficult. Due to the large sample sizes for the two tests, the estimated standard errors are small and our equating is quite reliable. In addition to the default gaussian kernel kequate enables the usage of logistic and uniform kernels. To utilize a different kernel the argument kernel is specified in the kequate() function call. Below, the previously defined log-linear models are used to equate the two tests in the NEC design using a logistic and a uniform kernel.

R> summary(NECeq)
R> NECeqL <-kequate("NEC", 0:22, 0:22, NECYglm, NECXglm, + kernel = "logistic") R> NECeqU <-kequate("NEC", 0:22, 0:22, NECYglm, NECXglm, + kernel = "uniform") In this case the equating function is almost identical between the three kernels, but there are some slight differences in the standard error of equating, which can be seen in Figure 5. For all designs it is also possible to specify the constants KPEN and wpen used in finding the optimal continuization parameters. Defaults are KPEN = 0 and wpen = 1/4. Additionally, the logical argument linear can be used to specify that only a linear equating is to be performed, where the default is linear = FALSE.

Concluding remarks
In standardized achievement tests the most essential part is for all tests to be fair to test takers and between test takers. Since standardized tests are typically given at different time periods and with different test forms, it is essential to make sure that which test a test taker is given does not affect his or her results. In this paper the R package kequate was proposed in order to implement and to make available the kernel method of test equating. This method can be used by researchers, the testing industry, and practitioners, i.e., anyone with an interest in equating. In addition the package includes a new extension of the kernel method when we have collateral information. Finally, IRT observed-score equating was added to allow for comparisons with a well-known frequently used equating method. In the future the kernel method of test equating might be extended to incorporate more methods, and in those cases it will be easy to implement new methods in this package.