intsvy : An R Package for Analyzing International Large-Scale Assessment Data

This paper introduces intsvy , an R package for working with international assessment data (e.g., PISA, TIMSS, PIRLS). The package includes functions for importing data, performing data analysis, and visualizing results. The paper describes the underlying methodology and provides real data examples. Tools for importing data allow useRs to select variables from student, home, school, and teacher survey instruments as well as for speciﬁc countries. Data analysis functions take into account the complex sample design (with replicate weights) and rotated test forms (with plausible values of achievement scores) in the calculation of point estimates and standard errors of means, standard devi-ations, regression coeﬃcients, correlation coeﬃcients, and frequency tables. Visualization tools present data aggregates in standardized graphical form.


Introduction
International large-scale assessments (LSA) studies measure student performance through standardized achievement tests and administer questionnaires to collect data on students, their families, and schools that shed light on the mechanisms responsible for student performance in a number of countries. The results have received a great deal of attention from researchers and policymakers around the world and have had significant impact on educational policy and on the educational debate. The Programme for International Student Assessment (PISA), the Trends in International Mathematics and Science Study (TIMSS), and the Progress in International Reading Literacy Study (PIRLS) stand out for their impact, comparative trend data, and number of participating countries. More recently, attention is directed as well towards the International Computer and Information Literacy Study (ICILS) and the Programme for the International Assessment of Adult Competencies (PIAAC). The data from PISA, TIMSS, PIRLS, ICILS, and PIAAC are publicly available, but its use is somewhat limited by available analytical tools for handling the complex design of LSA studies.
The design of international LSA studies involves complex sampling and testing procedures that have consequences on the analysis stage. Sampling is conducted in two stages: Schools are selected in the first stage and students in the second stage. Testing uses a rotated design consisting of different test versions comparable through a common core of items. Datasets contain sampling variables (e.g., replicate weights) and plausible values of achievement scores in order to account for the complex sampling and test design, respectively. Traditional statistical procedures cannot handle these design complexities. Further, the organization of public datasets from TIMSS and PIRLS in a large number of files by country and survey instrument is not straightforward for users and requires commercial software alternatives (e.g., IDB Analyzer, IEA Hamburg 2017, in combination with SPSS, IBM Corporation 2015) in order to merge and select data. The R (R Core Team 2017) package intsvy (Caro and Biecek 2017) facilitates access to international assessment data by providing tools for importing data and conducting analysis while soundly considering the sample and test design in the calculation of statistics and associated standard errors. intsvy is an acronym for international surveys and the package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=intsvy.

Complex design of international LSA
Obtaining point estimates of any statistic of interest θ (e.g., mean, correlation, percentage, regression coefficient) is not particularly complicated with international assessment data. Standard procedures weighted by the total sampling weight can be used to calculate θ for the observed data. For student performance, the average of plausible values estimates yields the estimate of group-level student performance, where M is the number of imputations, typically 5 in international assessments, θ i is the average score for plausible value M , and θ is the average estimate of student performance.
What is particularly challenging is the calculation of the standard error of θ, that is, the uncertainty associated with its estimation. This is because the complex test and sampling design introduce two sources of error in the estimation of θ: imputation error and sampling error, respectively. And these errors cannot be calculated with standard routines of statistical software. The calculation of correct standard errors is important for making valid comparisons of performance between countries or boys and girls, for example. It is for this reason that specialized tools like the intsvy package are required.

Rotated test design
The total item pool of international assessments consists of hundreds of items that demand hours of testing time in order to produce valid and reliable measures of student achievement constructs. Clearly, it is not feasible to administer a test including the entire item pool for logistic, fatigue, and testing time issues in general. International assessments employ a rotated design form in order to achieve a balance between validity and reasonable testing time. Test items are arranged into clusters that in turn are distributed between booklets administered to students. Clusters are distributed such that it is possible to link test booklets through clusters in common. Cluster linkage between booklets ensures the comparability of results between students and reporting on the same scale. Rotated test forms introduce technical complexities in the estimation of student performance, since students respond only to a subset of items, the ones in the booklet, but inferences on student performance are made as if the students had responded to the entire assessment through plausible value techniques (von Davier, Gonzalez, and Mislevy 2009).
The plausible values approach combines item response theory and latent regression techniques to produce unbiased estimates of student performance at the population level. Plausible values are random draws from the estimated posterior distribution of student performance given student responses to the subset of test items and background information collected in questionnaires. Importantly, plausible values are not used to infer performance at the individual level, since students responded only to a subset of the items and measurement errors at the individual level tend to be large. The average of plausible values estimates was calculated in Equation 1. The variance reflects uncertainty in the estimation associated with making multiple imputations of plausible values based on the posterior distribution of student performance. The formula of the imputation variance, VAR imp [θ], is as follows (Little and Rubin 2002):

Complex sample design
Student samples in international LSA are selected in two stages: Schools are sampled in the first stage and students within the school in the second stage. For example, 15-year-olds are sampled randomly within schools in PISA and intact classes within schools are sampled randomly in TIMSS and PIRLS. The sampling error takes into account the uncertainty related with the sample selection, as different samples of schools and students from the population not necessarily yield the same estimates. The sampling error formula under two-stage sampling cannot assume that observations are independent as in random sampling because students within schools tend to share similar characteristics, for example, family socio-economic status (SES) and the instructional setting. Compared to random sampling, the dependency of observations within schools in two-stage sampling tends to reduce the amount of information and increase the uncertainty of estimates, that is, the standard error. For example, a twostage sample of 100 students per school in 10 schools will likely yield less information than a random sample of 1000 students. In one extreme scenario, if all students within schools are identical, the two-stage sample will represent 10 students and not 1000. In the other extreme, if all students within schools are uncorrelated, the two-stage sample size will be 1000. In real data the dependency of observations lies between these two scenarios (i.e., a sample size of 10 and 1000 students).
Replicate weights are used in international LSA to calculate sampling errors. A replicate weight represents a sample of schools. The PISA dataset, for example, contains 80 replicate weights that represent 80 different school samples. An estimate (e.g., mean, percentage, regression coefficient) can be obtained for each sample. The variability of estimates across samples or replicate weights indicates the uncertainty due to school sample selection or the sampling error. Like multilevel models, replicate weights estimation introduces randomness in the selection of schools. Multilevel models do it by introducing random effects and replicate weights estimation by creating different samples in the data while maintaining the traditional ordinary least squares (OLS) model. From this perspective, replicate weights can be regarded as a case of adapting the data to the model and multilevel models as one of adapting the model to the data. Further, school sample variation with replicate weights of international LSA is not entirely random but takes into account stratification (e.g., one school is selected at random within each stratum for each replicate weight). As a result, multilevel models and replicate weights estimation do not yield exactly the same results. To the extent that multilevel models do not take into account stratification information in random effects, they tend to produce standard errors that are larger than for regression analysis using replicate weights. There are different replication techniques for two-stage sampling. TIMSS and PIRLS employ jackknife repeated replication (JRR) and PISA employs balanced repeated replication (BRR) with Fay's modification. The principles underpinning these techniques and worked examples are presented in technical reports of international assessments (e.g., OECD 2014b). Here we will just present the formulas.
The sampling variance for PIRLS and TIMSS is: The sampling variance in PISA is: R is the number of replicate weights, 75 jackknife replicate weights in PIRLS and TIMSS and 80 BRR replicate weights in PISA. For PIAAC estimation is slightly more complicated because different replication methods and numbers of replications were used in different countries. Thus the general formula for the sampling variance in PIAAC is:

Standard error formula
The total standard error for single observed variables in international assessment data is equal to the sampling error. For the plausible values of student performance the standard error additionally takes into account imputation error. The total variance formula combines the sampling error and the imputation error as follows: The standard error is the square root:

Overview of the package
There are different statistical tools for conducting analysis with international assessment data while handling replicate weights and plausible values. The IEA has produced the International Database IDB Analyzer, an SPSS add-on application for importing and analyzing data from IEA studies (e.g., PIRLS, TIMSS) and PISA. The National Center for Education Statistics (NCES) has developed the International Data Explorer (National Center for Education Statistics 2017), a web-based tool for creating tables and charts with data from PISA, PIRLS, TIMSS, and PIAAC. The OECD has published SPSS and SAS (SAS Institute Inc. 2013) macros for conducting analysis with PISA (OECD 2009). Mplus (Muthén andMuthén 1998-2017) is able to perform structural equation modeling while incorporating replicate weights. In Stata (StataCorp. 2015), REPEST (Avvisati and Keslair 2014) and PV (Macdonald 2008) modules handle plausible values and replicate weights with IEA and OECD data. Non-commercial alternatives in R to analyze survey data include packages survey (Lumley 2004), BIFIEsurvey (BIFIE 2017), lavaan.survey (Oberski 2014), and the http://www.asdfree.com/ code repository (Damico 2015). Moreover packages DAKS (Ünlü and Sargin 2010) and multilevelPSA (Bryer and Pruzek 2011) include additional functionalities for psychometric analyses.
Package intsvy provides a non-commercial and extendible alternative to the IDB Analyzer. Unlike available packages in R for survey analysis, intsvy is tailored towards the analysis of international assessment data specifically. For example, as with the IDB Analyzer, an important purpose of the package is to provide functions to import data from studies conducted by the International Association for the Evaluation of Educational Achievement (IEA), such as TIMSS and PIRLS. Also, analysis functions calculate estimates by education system, percentages of students by international benchmarks (e.g., TIMSS and PIRLS) and proficiency levels (e.g., PISA), estimate percentiles for achievement scores with plausible values, and implicitly assume the replication method used, for example BRR for PISA and JRR with one plausible value used for estimation of sampling error in TIMSS and PIRLS. That is, the useR is not required to enter study-specific parameters (e.g., the replication method, names of weight variables and plausible values) in the analysis or to know in-depth study-specific estimation procedures. With that, intsvy facilitates access and analysis of international assessments. At the same time, study-specific parameters can be modified and the package can be extended to handle data from other studies.
Package intsvy includes functions for importing data and for data analysis. Data importation functions include intsvy.var.label for printing variable names and variable labels by instrument as well as names of participating countries, and intsvy.select.merge for selecting and merging data into a single data frame. Analysis functions include intsvy.mean.pv for calculating means with plausible values, intsvy.mean for calculating means, intsvy. Alternatively, study-specific functions (e.g., pisa.reg.pv, timss.table) that call generic functions (e.g., intsvy.reg.pv, intsvy.table) can be used. For example, the following functions produce the same output of average mathematics scores by country using PISA data, one using the study-specific function pisa.mean.pv and the other with the generic function intsvy.mean.pv.
The architecture of the package is presented in Table 1. For example, the output of functions piaac.

Select and merge data
Package intsvy provides tools for selecting and importing data into R. Data can be imported in two steps. First, the generic function intsvy.var.label facilitates data selection by reporting variable names, variable labels, and names of participating countries in available datasets. Secondly, the generic function intsvy.select.merge produces a single data frame for selected variables and countries. Sampling variables (i.e., replicate weights and total weights) and plausible variables are selected automatically. Alternatively, study-specific functions (e.g., pisa.var.label, pirls.select.merge) can be used.

TIMSS, PIRLS, and ICILS
Variable names, variable labels, and name abbreviations of countries in the PIRLS 2011 datasets are printed with R> pirls.var.label(folder = file.path(getwd(), "PIRLS 2011")) The folder argument indicates where the multiple data files are located. The output is automatically stored in a text file located in the working directory (i.e., getwd()). The location and name of the output file can be modified with the output and name arguments.
Alternatively, the same output with data characteristics can be produced with the generic intsvy.var.label function, R> intsvy.var.label(folder = file.path(getwd(), "PIRLS 2011"), where the argument config = pirls_conf provides specific parameters for the PIRLS study.
Subsequently, selected data of specific variables and countries can be imported into a single data frame using intsvy.select.merge or study-specific functions (such as, for example, timssg8.select.merge, timssg4.select.merge, and pirls.select.merge). Data importing tools are particularly useful for TIMSS, PIRLS, and ICILS because original datasets available from the IEA Data Repository (http://rms.iea-dpc.org/) are organized in a large number of data files by country, school grade, and survey instrument (e.g., student questionnaire, home questionnaire, teacher questionnaire) and useRs are usually not familiar with the data administrative structure.

PISA and PIAAC
The data from PISA has a different structure. Original datasets available from the OECD (http://www.oecd.org/pisa/pisaproducts/pisa2012database-downloadabledata.htm) are organized in large files for the student, school, and parent questionnaire containing data for all participating countries. Accordingly, study-specific functions to describe (i.e., pisa.var.label) and import (i.e., pisa.select.merge) the data have a different structure with arguments for entering names of original data files directly.
Packages with consecutive releases of PISA data are named PISA2000lite, PISA2003lite, PISA2006lite, PISA2009lite, PISA2012lite, while the package with PIAAC data is named PIAAC. For example, the following code installs the package with the PISA 2012 data R> library("devtools") R> install_github("pbiecek/PISA2012lite") Dictionaries with variable names are available in student2012dict, school2012dict and parent2012dict vectors. With aid of the grep function it is possible to find a desired variable. Here is an example for finding the variable with the number of books at home.

R> library("devtools") R> install_github("pbiecek/PIAAC")
A single data frame with PIAAC data is available in the piaac data frame while a dictionary for variable names is stored in the piaacdict vector.
R> data("piaacdict", package = "PIAAC") R> grep(piaacdict, pattern = "Number of books", value = TRUE) J_Q08 "Background -Number of books at home" A frequency table with number of books at home is produced by R> data("piaac", package = "PIAAC") R> Three main arguments are supplied by the useR: pvlabel, by, and data. Argument pvlabel indicates the part of the label in common for the plausible values variables (e.g., "READ", "MATH"). Argument by defines the level of grouping for the analysis (e.g., "CNT") and may contain more than one level (e.g., c("CNT", "SEX")). And argument data defines the dataset to be used in the analysis. Alternatively, the generic function intsvy.mean.pv can be used.

PISA and PIAAC
For example, in PISA 2012, the average math performance by education system and associated standard errors can be calculated as follows (see OECD 2014a, p. 305): R> pisa.mean.pv(pvlabel = "MATH", by = "CNT", data = pisa) The argument pvlabel = "MATH" refers to the name suffix in common of the variables containing the plausible values variables: PV1MATH, PV2MATH, PV3MATH, PV4MATH, and PV5MATH. For science and reading, this argument should be changed to pvlabel = "READ" and pvlabel = "SCIE", for example.
The same output can be produced with: R> intsvy.mean.pv(pvnames = paste0("PV", 1:5, "MATH"), by = "CNT", + data = pisa, config = pisa_conf) where the structure is similar to pisa.mean.pv but names of plausible values are entered directly in pvnames and specific parameters for the PISA dataset are entered in the config argument.
More levels of grouping can be included in the analysis. For example the following code produces results by education system (CNT) and the student's sex (ST04Q01; 1 = female, 2 = male), while exporting results (export = TRUE) into a comma-separated value (CSV) file (see OECD 2014a, p. 305).
R> pisa.mean.pv(pvlabel = "MATH", by = c("CNT", "ST04Q01"), + data = pisa, export = TRUE, name = "PISA mean by sex", + folder = "C:/PISA/PISA 2012/Results") The name of the resulting CSV file is PISA mean by sex.csv and it is located in the folder C:/PISA/PISA 2012/Results. It can be imported directly into a spreadsheet for further analysis or for formatting for publication.

Average estimates without plausible values
Means and standard errors for variables without plausible values, that is, for all of the other variables in the datasets, can be calculated with functions pisa.mean, piaac.mean, timss.mean, pirls.mean or with the generic function intsvy.mean. The same output can be produced with the generic function:

TIMSS and PIRLS
For TIMSS 2011, the following code calculates the average of the index Students Like Learning Mathematics (BSBGSLM) by education system (see   As before, the generic function intsvy.mean can be used to reproduce the same output.

Regression analysis
Regression analysis is performed by functions pisa.reg.pv, timss.reg.pv, pirls.reg.pv, and the generic function intsvy.reg.pv.

PISA and PIAAC
Differences in mean performance calculated previously for boys and girls can be tested for statistical significance using a regression approach. For example, significance tests can be conducted in PISA 2012 as follows (see OECD 2014a, p. 305): R> pisa$SEX[pisa$ST04Q01 == 1] <-"female" R> pisa$SEX[pisa$ST04Q01 == 2] <-"male" R> pisa.reg.pv(pvlabel = "MATH", x = "SEX", by = "CNT", data = pisa) The same output can be produced with the generic function: R> intsvy.reg.pv(pvlabel = "MATH", x = "SEX", by = "CNT", + data = pisa, config = pisa_conf) Argument x defines the independent variable(s), in this case SEX, but more variables can be included separated by commas (e.g., x = c("SEX", "ESCS")). The output is a list with regression results by education system. Coefficient SEXmale captures differences between boys and girls and its t value indicates whether they are statistically significant.
Regression results including replicate estimates and residuals can be stored in an object and retrieved for further analysis. For example, pisa_ses contains results of a regression of mathematics performance on the student's sex and the index of economic, social, and cultural status (ESCS): R> (pisa_ses <-pisa.reg.pv(pvlabel = "MATH", x = c("SEX", "ESCS"), + by = "CNT", data = pisa)) The internal structure of the object is displayed with:

R> str(pisa_ses)
The object contains a list with five elements, one for each education system. In turn, each element is a list containing other five elements, for example:

R> names(pisa_ses[["POL"]])
[1] "replicates" "residuals" "var.w" "var.b" "reg" where var.w and var.b contain the variance within (i.e., sampling error) and between (i.e., imputation error) of regression coefficients, reg is a data frame with final regression results, replicates and residuals are lists again with five elements, one for each plausible value, containing replicate estimates and residuals. With plausible values, the following code estimates the probability of being above proficiency level 5 in mathematics as a function of ESCS. The argument cutoff in intsvy.log.pv defines the level at which the plausible values are dichotomized, in this case 606.99, the lowest score at proficiency level 5. The binary dependent variable takes the value of one for scores above the cutoff and the value of zero for scores below or equal to the cutoff.
R> intsvy.log.pv(pvlabel = "MATH", cutoff = 606.99, x = "ESCS", by = "CNT", + data = pisa, config = pisa_conf) The output reports odds ratios and associated confidence intervals in addition to coefficients, standard errors, and t values. The same output can be produced with: R> pisa.log.pv(pvlabel = "MATH", cutoff = 606.99, x = "ESCS", + by = "CNT", data = pisa) It is also possible to run a logistic regression without plausible values. We could for example estimate a regression of skipping class or school on having arrived late for school. The dependent binary variable is SKIP: The independent variable is LATE: The logistic regression model can be estimated with the generic intsvy.log or with: R> pisa.log(y = "SKIP", x = "LATE", by = "CNT", data = pisa) The following provides an example of regression with literacy scores as dependent variable and the participant's sex as independent variable for PIAAC data.
As before, regression results can be stored in an object for further analysis. We will run the previous regressions again adding one independent variable, BSBGSLM in TIMSS, which is an index of how much students like learning mathematics, and ASBHELA in PIRLS which is the index of early literacy activities at home.

Frequency tables
Functions pisa. For example, the following code produces the frequency and percentage of students in each school grade level (i.e., variable = "ST01Q01") by education system in PISA 2012 (see OECD 2014a, p. 274).
R> pisa. With TIMSS data, it is possible to calculate the percentage of students according to how much they like learning mathematics (1 = like learning mathematics; 2 = somewhat like learning mathematics; 3 = do not like learning mathematics) reported by own students (see Foy et al. 2013, p. 29): R> timss. And using school level data, we can calculate the percentage of students in schools classified by the socio-economic composition (1 = more affluent, 2 = neither more affluent nor more disadvantaged; 3 = more disadvantaged) reported by principals (see Foy et al. 2013, p. 36): R> timss. And the following code calculates specific percentiles for mathematics achievement in TIMSS: R> timss.per.pv(pvlabel = " BSMMAT",per = c(5,10,25,50,75,90,95), + by = "IDCNTRYL", data = timss8g) As before, the same results can be reproduced with the intsvy.per.pv function.

Data visualization
The functions presented above allow to precisely estimate averages, frequencies or regression coefficients together with their standard errors. Since large tables filled with numbers could be difficult to understand at first sight, intsvy provides functions for data visualization that facilitate interpretation of results.

Frequency tables
The plot method for 'intsvy.table' objects produces a ggplot2 based barplot that summarizes frequency tables. Optional arguments for this plot method are stacked (should bars be stacked or not) and se (should standard error be plotted or not).
The following example calculates and plots two tables using the PIAAC dataset. The first is a plot of the age structure (see Figure 1) and the second a plot of the age structure by country and gender (see Figure 2).

Average achievement scores
Functions intsvy.mean.pv and intsvy.mean, as well as associated study-specific functions (e.g., pisa.mean.pv, timss.mean), produce objects of the class 'intsvy.mean'. The associated plot method produces a ggplot2 based dotplot that resents calculated averages and their standard errors.
Optional arguments for the plot method are sort (should groups be sorted along the average or not) and se (should standard error be plotted or not).
The following example calculates and plots average numeracy performance by country (see Figure 6) and by country and age group (see Figure 7) based on the PIAAC dataset.

Regression analysis
Functions intsvy.reg.pv and intsvy.reg produce objects of the class 'intsvy.reg'. The associated plot method produces a ggplot2 based dotplot that summarizes regression based model coefficients and their standard errors.
Optional arguments for the plot method are sort (should groups be sorted along the average or not) and se (should standard error be plotted or not).
The following example calculates and plots regression coefficients based on the PIAAC dataset (see Figure 10).
R> timss_like <-timss.reg.pv(pvlabel = "BSMMAT", by = "IDCNTRYL", + x = c("SEX", "BSBGSLM"), data = timss8g) R> plot(timss_like, vars = "BSBGSLM")  Figure 11: Graphical summary of regression models. This example presents outcomes for regression models with mathematics scores as dependent variable and gender (SEX) and the economic, social, and cultural status index (ESCS) as independent variables based on the PISA dataset.  Figure 13: Graphical summary of regression models. This example presents outcomes for regression models with reading scores as dependent variable and gender (SEX) and the index of early literacy activities at home (ASBHELA) as independent variables based on the PIRLS dataset.

Summary
This article introduced intsvy and demonstrated its use with data from PISA, PIRLS, TIMSS, ICILS, and PIAAC. Package intsvy provides another alternative within R to soundly handle data from international LSA. In addition to analysis and visualization tools, the package includes functions for merging and importing data, which are particularly handy for TIMSS and PIRLS. The package can be extended to handle datasets from different international assessment studies. There are several limitations and plans for incorporating new features in future releases of this package. Currently intsvy handles missing data using listwise deletion, cannot analyze trend data from international LSA, cannot perform tests of statistical significance beyond those provided by regressions, to mention some limitations.