BTLLasso : A Common Framework and Software Package for the Inclusion and Selection of Covariates in Bradley-Terry Models

In paired comparison models, the inclusion of covariates is a tool to account for the heterogeneity of preferences and to investigate which characteristics determine the preferences. Although methods for the selection of variables have been proposed no coherent framework that combines all possible types of covariates is available. There are three different types of covariates that can occur in paired comparisons, the covariates can either vary over the subjects, the objects or both the subjects and the objects of the paired comparisons. This paper gives an overview over all possible types of covariates in paired comparisons and introduces a general framework to include covariate eﬀects into Bradley-Terry models. For each type of covariate, appropriate penalty terms that allow for sparser models and, therefore, easier interpretation are proposed. The whole framework is implemented in the R package BTLLasso . The main functionality and the visualization tools of the package are introduced and illustrated by real data sets.


Introduction
Paired comparisons have a long tradition in statistics with the roots dating back to Zermelo (1929). In many applications the objective is to find the underlying preference scale by presenting objects or items in pairs. The method has been used in various areas, for example, in psychology, to measure the intensity or attractiveness of stimuli, in marketing, to evaluate the attractiveness of brands, in social sciences, to investigate value orientation, e.g., Francis, Dittrich, Hatzinger, and Penn (2002). Paired comparisons are also found in sports whenever two players or teams compete. Then the non-observable scale to be found refers to the strengths of the competitors.
The most widely used model for paired comparison data is the Bradley-Terry model. Originally proposed by Bradley and Terry (1952) it is also referred to as the Bradley-Terry-Luce (BTL) model indicating the strong connection to Luce's choice axiom formulated in Luce (1959). Luce's choice axiom states that the decision between two objects is not influenced by other objects, which is also known as the independence from irrelevant alternatives. Over the years various extensions have been proposed allowing for dependencies among responses, time dependence or simultaneous ranking with respect to more than one attribute. Overviews were given by Bradley (1976), David (1988) and more recently by Cattelan (2012).
The main problem when applying the Bradley-Terry model is that it implies severe restrictions and assumptions. In particular it assumes that there is a one-dimensional latent scale that represents the dominance, strength or attractiveness of objects. In experiments with several persons or, more generally, several subjects, the BTL model is typically fitted under the assumption that the strengths of the objects are equal for all subjects. Since the heterogeneity across subjects is ignored the model often yields a bad fit. The assumption of a latent scale becomes much weaker if one assumes that the strengths of objects depend on covariates. The covariates can be subject-specific or object-specific. The former refers to the characteristics of the subjects who choose between alternatives/objects, the latter refers to the characteristics of the objects which are compared. Also subject-object-specific covariates, that is, characteristics that vary both over subjects and over objects, are possible. Explicit modeling of heterogeneity is rare, and usually restricted to few covariates, see Turner and Firth (2012) or Francis, Dittrich, and Hatzinger (2010). Software packages for the analysis of paired comparisons that are available as R packages (R Core Team 2019) are the package BradleyTerry2 (Turner and Firth 2012) and prefmod (Hatzinger and Dittrich 2012). Both allow for the integration of parametric effects for object-specific and subject-specific covariates into paired comparison models. To make models with object-specific covariates identifiable, package BradleyTerry2 uses random effects for the main object parameters. In package prefmod, the object parameters are excluded. However, the packages do not contain built-in procedures for the selection and clustering of covariate effects and do not use regularized estimation. Also, none of the packages allows for object-specific order effects. More recently, methods that are able to handle a larger number of explanatory variables have been proposed. Casalicchio, Tutz, and Schauberger (2015) presented a boosting approach and developed the corresponding R package ordBTL (Casalicchio 2014). The approach is restricted to subjectspecific covariates. Also, the boosting approach (in contrast to penalization as proposed in this work) is not designed to cluster effects. Strobl, Wickelmaier, and Zeileis (2011) used recursive partitioning techniques and implemented partitioned paired comparison models in the R package psychotree (Zeileis, Strobl, Wickelmaier, Komboz, and Kopf 2018). This approach generates sub-spaces of the predictor space where identical Bradley-Terry models hold, the models differ between the sub-spaces. The results can be visualized by trees which provide a very intuitive and user-friendly interpretation when the main focus is to find sub-groups of subjects with equal preferences. The approach is restricted to subject-specific covariates.
The objective of the present paper is to develop a common framework for the inclusion and selection of covariates in BTL models based on penalization techniques. The inclusion of covariates serves two purposes. It accounts for heterogeneity in the population yielding more realistic models and it allows to investigate which variables determine the strengths of objects. The proposed methods are implemented in the R package BTLLasso (Schauberger 2019) available from the Comprehensive R Archive Network (CRAN) at https: //CRAN.R-project.org/package=BTLLasso. In the following also the use of the package is described and illustrated.

The Bradley-Terry model
In the following the simple binary Bradley-Terry model and extended versions that allow for ordinal responses are briefly sketched.

The basic model
Assuming a set of objects ta 1 , . . . , a m u, in its most simple form the (binary) Bradley-Terry model is given by Ppa r ą a s q " PpY pr,sq " 1q " exppγ r´γs q 1`exppγ r´γs q . (1) The response of the model represents the probability that a certain object a r dominates (or is preferred over) another object a s , a r ą a s . The response is formalized in the random variable Y pr,sq which is defined to be Y pr,sq " 1 if a r is preferred over a s and Y pr,sq " 0 otherwise. The parameters γ r , r " 1, . . . , m, represent the attractiveness or strength of the respective objects. For identifiability, a restriction on the parameters is needed, for example ř m r"1 γ r " 0 or γ m " 0.

Bradley-Terry models with ordered response categories
In many applications the dominance of one of the objects is quite naturally observed on an ordered scale. For example, in sport competitions one often has to account for draws. Early extensions of the BTL model include at least the possibility of ties, see Rao and Kupper (1967), Glenn and David (1960) and Davidson (1970). General models for ordered responses that allow, for example, for a 5-point scale (much better, slightly better, equal, slightly worse, much worse) were proposed by Tutz (1986) and Agresti (1992). For K response categories, the model parameterizes the cumulative probabilities PpY pr,sq ď kq " exppθ k`γr´γs q 1`exppθ k`γr´γs q with k " 1, . . . , K denoting the possible response categories. The parameters θ k represent socalled threshold parameters for the single response categories, they determine the preference for specific categories. In particular, Y pr,sq " 1 represents the maximal preference for object a r over a s and Y pr,sq " K represents the maximal preference for object a s over a r . For ordinal paired comparisons it can be assumed that the response categories have a symmetric interpretation so that PpY pr,sq " kq " PpY ps,rq " K´k`1q holds. Therefore, the threshold parameters should be restricted. One postulates θ k "´θ K´k and, if K is even, θ K{2 " 0 to guarantee symmetric probabilities. The threshold for the last category is fixed to θ K " 8 so that PpY pr,sq ď Kq " 1 holds. The probability for a single response category can be derived from the difference between two adjacent categories, PpY pr,sq " kq " PpY pr,sq ď kq´PpY pr,sq ď k´1q. To guarantee non-negative probabilities for the single response categories one restricts θ 1 ď θ 2 ď . . . ď θ K . The ordinal Bradley-Terry model is a cumulative logit model and can be estimated using methods from this general framework.

Bradley-Terry models with order effects
In some specific paired comparisons it can be decisive in which order the competing objects are presented. Typical examples are sports events, for example, matches in the German Bundesliga which are considered as an exemplary data set in this article. In this application, the first team of the ordered tuple pa r , a s q is the team playing at its home ground where it might have a home advantage over its opponent. Therefore, the assumption that the response categories are symmetric does not hold anymore and the model needs to be adapted accordingly. Extending the basic models by an additional parameter δ yields the binary Bradley-Terry model with an order effect PpY pr,sq " 1q " exppδ`γ r´γs q 1`exppδ`γ r´γs q and the ordinal model with an order effect Here, δ denotes the order effect which is simply incorporated into the design matrix by an additional intercept column. If δ ą 0, it increases the probability of the first object a r to win the comparison or, in the case of an ordinal response, to achieve a superior result. Given the order effect, the symmetry assumption for the response categories still holds.

The general Bradley-Terry model with explanatories
In many applications covariates that characterize the subjects and/or the objects are available. These can be used to model the heterogeneity in the population and to investigate which variables determine the choice between alternatives. We first describe two applications that will be used later, the respective data sets can be found in the package BTLLasso.

Election data
The German Longitudinal Election Study (GLES) is a long-term study of the German electoral process, see Rattinger, Roßteutscher, Schmitt-Beck, Weßels, and Wolf (2014). It collects pre-and post-election data for several federal elections. The specific data we are using here originate from the pre-election survey for the German federal election in 2013. In this specific part of the study, 2003 persons were asked to rate the most important parties (CDU/CSU, SPD, Greens, Left Party, FDP) for the upcoming federal election on a scale from´5 to`5. The rating scales Z ir reflect the general opinions of participant i on party a r with`5 representing a very positive and´5 representing a very negative opinion. The original scales were transformed into paired comparisons by building all pairwise differences of the scores of all five parties per person. The transformation of rating scales to ordered paired comparison data was proposed by Dittrich, Francis, Hatzinger, and Katzenbeisser (2007). They also describe in detail the advantages of the transformation for the analysis of rating scales. Using this procedure one ends up with ordered paired comparisons with values between´10 and 10.
The response was narrowed down to an ordered response with five categories. The data now represent paired comparisons between all parties measured on an ordered five-point scale: Z ir´Zis P t6, . . . , 10u Þ Ñ Y ipr,sq " 1 : "Strong preference of party a r over party a s ." Z ir´Zis P t1, . . . , 5u Þ Ñ Y ipr,sq " 2 : "Weak preference of party a r over party a s ." Z ir´Zis " 0 Þ Ñ Y ipr,sq " 3 : "Equal opinions on parties a r and a s ." Z ir´Zis P t´5, . . . ,´1u Þ Ñ Y ipr,sq " 4 : "Weak preference of party a s over party a r ." Z ir´Zis P t´10, . . . ,´6u Þ Ñ Y ipr,sq " 5 : "Strong preference of party a s over party a r ." There is a crucial difference compared to the simple paired comparison models from Section 2. In addition to the compared objects a r and a s , the response Y ipr,sq now also depends on the subject i that executes the comparison between both objects.
The transformation of rating scale data to paired comparisons is not without problems. First and foremost, when calculating the differences between two rating scales Z ir´Zis one assumes more than an ordinal scale level. Although this is frequently done in the literature it is certainly not optimal. In the present application the effect of building differences is mitigated by using a strongly coarsened response scale. The 21-point scale produced by differences in Likert ratings is narrowed down to a 5-point scale, for which the ordinal scale level is assumed. The less categories one uses for the resulting paired comparisons, the weaker are the assumptions with respect to the scale level of the rating scales. A three-point scale built from just distinguishing between positive, negative and zero differences would fully respect the ordinal nature of the scales. A scale of this type was used by Dittrich et al. (2007). Second, paired comparisons from rating scales are strictly transitive. This entails that for a single subject certain patterns of paired comparisons are not observed. Dittrich et al. (2007) explicitly account for the reduced number of possible patterns that can be derived from the ratings. Lastly, the use of rating scales could contribute to make paired comparisons dependent. Dependence of observations in paired comparison data stemming from various individuals is a general problem. Dittrich, Hatzinger, and Katzenbeisser (2002) use interaction terms that account for the dependencies of pairs of paired comparisons when both refer to a joint object. In Dittrich et al. (2007) the interaction terms are included for paired comparison data obtained from rating scales. The current approach avoids this rather parameter intensive parameterization and tries to account for dependencies by including various subject-specific covariates.
Two different types of covariates are available in the GLES data set, namely subject-specific covariates and subject-object-specific covariates. While the first type of covariates only characterizes the participants of the study, the second type of covariates characterizes both the participants and the parties. A more detailed distinction between the possible types of covariates in paired comparisons follows in Section 3.2. For sake of simplicity, we restrict the analysis to include only the following three covariates (out of eight covariates in the data set GLES in the package) that characterize the participants of the study: • Age: age of participant in years.
The main question regarding these variables will be, if and how they influence the preference for single parties. In addition to these person-specific covariates also three covariates are available that vary both over the persons and over the parties. These variables measure the (self-perceived) distances between participants and the single parties with respect to various political topics. These so-called political dimensions refer to the socio-economic dimension and the topics of immigration and climate change. The participants are supposed to report both their self-perception and their perception of the single parties toward a topic on a Likert scale with 11 ordered levels. For example, for the topic of climate change the Likert scale offers a graded response for the following statements: Level 1: Fight against climate change should take precedence, even if it impairs economic growth.
Level 11: Economic growth should take precedence, even if it impairs the fight against climate change.
The absolute difference between these perceptions can be seen as the (self-perceived) absolute distance between the person (subject) and the party (object) with respect to a political topic. Therefore, the variables SocEc, Immigration and Climate are available separately for all five parties.

Football data
The second data set we consider are data from the season 2015/2016 of the German Bundesliga. The German Bundesliga is played as a double round robin between 18 teams. Table 1 shows the final table of the season 2015/16 together with the abbreviations of the teams used in the application later. As in all three previous seasons, Bayern München won the championship. VfB Stuttgart and Hannover 96 were relegated to the second division.
In the data set, one match is treated as a paired comparison of both teams with respect to their playing abilities. In total, a Bundesliga season has 34 match days with 9 matches per match day. Therefore, in total 306 matches are observed. The response variable Y ipr,sq represents the outcome of a match between team a r (as the home team) and team a s on match day i. We use a 5-point scale defined by • Distance: total amount of km run.
The research question here is to find out whether Distance and/or BallPossession are decisive variables for the strength of a team and if the effects differ across teams. Beside these two variables, also the average market values of the players of the single teams are known. The market values are constant across the whole season and, in contrast to the previous variables, are not specific to single match days but only to single teams. Therefore, distance and ball possession are subject-object-specific covariates while market values are object-specific.

Model specification
In general, when modeling paired comparison data three different types of covariates have to be distinguished. All three types of covariates occur in the exemplary data presented in Section 3.1 and will be described in detail in the following. However, first a clear distinction between subjects and objects in paired comparisons has to be made. In our notation, objects refer to the units that are compared in a paired comparison experiment. For example, in the GLES data from Section 3.1, the parties are the objects which vary in their attractiveness to voters. In football matches (as considered in the Bundesliga data from Section 3.1), the teams are the objects that are compared to measure their playing abilities. In contrast, the subjects in paired comparisons refer to the units that actively make a decision concerning the dominance of objects in a paired comparisons. In the election example the voters (or rather the participants of the study) represent the subjects. In football, the definition of subjects is less obvious but also here subjects can be identified. We consider the day of the match or the match itself, which can be characterized by covariates, as the subject of the paired comparison.
In the following, we distinguish between subject-specific covariates x i , object-specific covariates z r and subject-object-specific covariates z ir . In the general framework proposed here, those covariates can be included (simultaneously) into Bradley-Terry models. For that purpose we consider the (ordinal) Bradley-Terry model in a generalized form. Let Y ipr,sq P 1, . . . , K denote the response of subject i if the objects pa r , a s q are presented. The ordering in the tuple pa r , a s q is not arbitrary, it may represent the ordering of presentation (a r first, a s second) or the location of the meeting of teams (a r at home, a s away). Then the model has the form In contrast to model (3), the global object parameters γ r are replaced by subject-specific parameters γ ir for subjects i, i " 1, . . . , n and objects a r , r " 1, . . . , m. The extension allows to model heterogeneity of subjects by including object-specific and subject-(object-)specific covariates. Generally, for all possible covariates we assume a linear dependence structure with the linear predictors of the model given by but with the strength parameters γ ir and γ is depending on explanatory variables.

Subject-specific covariates
Subject-specific covariates characterize the subjects that perform the comparisons between the objects. In the case of the election data, the participants of the study represent the subjects. Covariates like age, gender or unemployment status of the participants are treated as subject-specific covariates, which may modify the attractiveness of parties. This is obtained by assuming that the attractiveness of the parties (the strength parameters γ ir ) have the form where x i is a vector of subject-specific covariates. The inclusion of subject-specific covariates into a paired comparison model automatically requires object-specific effects β r , which vary over objects or alternatives. In the election study, for example, one wants to know if and how the attractiveness of parties depends on the gender of the voters. Consequently, for every party a distinct (object-specific) gender effect has to be specified. Accordingly, the attractiveness of object a r depends on covariates x i of subject i. The parameter β rj represents the increase of attractiveness of object r if the jth variable increases by one unit. To obtain identifiability of the parameters, symmetric side constraints are used, i.e., one postulates ř m r"1 β rj " 0 for all j.
The inclusion of subject-specific covariates strongly weakens the assumptions on the model. It is no longer assumed that the strengths of the objects are fixed and equal for all subjects when they choose between a pair of objects. The model explicitly accounts for heterogeneity over subjects. Subject-specific covariates in paired comparisons were considered before by Turner and Firth (2012), Francis et al. (2002) or Francis et al. (2010). More recently, Casalicchio et al. (2015) presented a boosting approach and Schauberger and Tutz (2017) a penalization approach based on an L 1 -penalty. Both are able to include explanatory variables and select the relevant ones. An alternative approach has been proposed by Strobl et al. (2011). It is based on recursive partitioning techniques (also known as trees) and automatically selects the relevant variables among a potentially large set of variables.

Subject-object-specific covariates
Covariates may also vary both over subjects and objects yielding subject-object-specific covariates. For example, in election studies as in the election data presented above it is frequently of interest how political issues determine the preference of specific parties. The so-called spatial election theory assumes that each party is characterized by a position in a finite-dimensional space, with each dimension corresponding to a political issue. Spatial election theory then explains the party choice of voters by a utility function that depends on the distance between the voter's own position and that of the parties within the space of policy-dimensions. For details see, for example, Thurner and Eymann (2000) and Mauerer, Pössnecker, Thurner, and Tutz (2015). In the election data, the covariates SocEc, Immigration and Climate measure the distance between participants and the single parties on three different political dimensions and are, therefore, considered as subject-object-specific covariates.
The effect of subject-object-specific covariates can be modeled in the same way as the effect of subject-specific covariates. With z ir denoting the corresponding covariate vector one uses We use the notation z ir for the vector and α r for the weights to distinguish the cases subjectspecific and subject-object-specific covariates.
In contrast to subject-specific covariates, subject-object-specific covariates can also be modeled with global effects, which yields the simpler form The underlying assumption is that the effect is equal across all objects. That means, for example, that the distances between the positions of respondents and parties on an issue like climate change have the same relevance for all parties. With the usual constraint on the intercepts, for example, ř m r"1 β r0 " 0, parameters are identified if there is enough variation over subjects, even in the case of object-specific effects. It may be checked by investigating if the corresponding design matrix for all paired comparisons has full rank. However, due to the penalization techniques we will use for estimation the parameters will be estimable even in cases where the design matrix might not have full rank if the penalization is strong enough.

Object-specific covariates
It is more difficult to include pure object-specific covariates. Object-specific covariates characterize the compared objects but are constant across the subjects. In the football data, the market value is an example for such an object-specific variable because the market values vary across teams but are constant across match days. Similar to the procedure proposed here, Tutz and Schauberger (2015) included an object-specific covariate in an analysis on the German Bundesliga.
The contribution of the object-specific covariates z r to the strength of object a r may be specified by γ ir " γ r " β r0`z J r τ , where τ is a global parameter. Then z J r τ represents the part of the attractiveness that is due to the covariates and β r0 represents the part that is not explained by covariates. For example, in the election data if one is interested if governing makes parties attractive one can use a dummy variable z " 1 for parties in the government and z " 0 for parties of the opposition. Then, τ represents the importance of being part of the government for the attractiveness of a party. In the case of object-specific covariates, no heterogeneity across subjects is modeled.
It is obvious that the parameterization leads to identifiability problems and additional side constraints are needed. Let us consider the simple example of a binary covariate, as before let z " 1 indicate that the party is part of the government and z " 0 indicate that the party is not part of the government. Then, the parties are nested within the factor z. Additional side constraints have to restrict the parameters within the blocks formed by z " 1 and z " 0. One may use ř jPM 1 β j0 " 0, where M 1 and M 0 are partitions of the parties with M 1 collecting all the parties with z " 1 and M 0 all parties with z " 0, respectively. Identifiability issues will be considered in more detail in the following sections.

Intercepts
In most models, in addition to the various covariate effects also object-specific intercepts β r0 are needed. Without covariates, the intercepts correspond to the regular strength/ability parameters from the basic Bradley-Terry models (1) or (2). Also for the intercept parameters the additional side constraint ř m r"1 β r0 " 0 is necessary for identifiability.

Order effects
As already mentioned in Section 2 in some circumstances the order of the competing objects can be decisive and order effects are necessary. Instead of a global order effect δ, which is equal across all objects, also object-specific order effects δ r are possible. For example, in the German Bundesliga an order effect is distinctly necessary as it is well known that a team playing at its home ground has an advantage over the away team. If object-specific order effects are considered, one allows for different home effects across the teams.
To sum up all options how to specify components in the general paired comparison model (4), Table 2 collects all types of covariates and all possible parameterizations we consider.

Representation as generalized linear model
It is straightforward to embed the considered models into the framework of generalized linear models (GLMs). The basic model (4) is a so-called cumulative ordinal regression model, which is a multivariate GLM. It has the form logpPpY ipr,sq ď kq{PpY ipr,sq ą kqq " η ipr,sqk with linear predictors η ipr,sqk " δ`θ k`γir´γis , k " 1, . . . , K´1.

Covariate type Effect type
global`δ`δ order effect object-spec.`δ r`δr All the models considered in the previous section are also multivariate GLMs with more structured predictors. For example, in the case of subject-object-specific covariates the predictor has the form . As far as estimation by maximum likelihood (ML) is concerned the only difference is in the structure of the linear predictor and the design matrix, which do not only consist of dummies for the objects as in the basic model.
Thus, in principle all the tools provided by GLMs, including tests, computation of standard errors, and algorithms for the maximization of the log-likelihood, are available. For the theory and computational tools, see, for example, Fahrmeir and Tutz (1997) or Tutz (2012).

Regularized estimation
In BTL models the inclusion of covariates yields models that can contain a huge number of parameters. If one includes, for example, a p-dimensional vector of subject-specific covariates one has to estimate a pp`1q-dimensional parameter vector pβ r0 , β r q for each object, which already for a moderate number of objects and a moderate size of the covariate vector yields a large number of parameters to estimate. Therefore, in applications ML estimates tend to be unstable or may not even exist.
One of the main features of package BTLLasso is the selection of relevant terms and the implicit reduction of the complexity of the model. In the simplest case selection refers to the selection of variables, since typically in high-dimensional settings not all of the explanatory variables have an impact on the preference of objects. However, model complexity can also be reduced by identifying clusters of objects that share the same strength or variable effects.
For the selection of relevant terms, we propose to use a penalized likelihood approach. Instead of maximizing the general log-likelihood lpξq, one maximizes the penalized log-likelihood where lpξq is the usual log-likelihood with ξ denoting a vector that contains all the parameters of the model. Jpξq is a penalty term that penalizes specific structures in the parameter vector.
The parameter λ is a tuning parameter that specifies how seriously the penalty term has to be taken. For the extreme choice λ " 0 one obtains the ML estimate. The choice of the tuning parameter is discussed later.

Penalties
Whatever the type of the variables and the components included in the model, the threshold parameters θ k of the ordinal model are never penalized. For the other model components we propose fusion and selection penalties, which are given in the following.
For λ Ñ 8 one obtains one big cluster with β 10 "¨¨¨" β m0 . Due to the symmetric side constraint, that means that β 10 "¨¨¨" β m0 " 0 and all intercept parameters are eliminated from the model. For an appropriately chosen tuning parameter λ ideally the clusters of objects that share equal intercepts are found. If no covariates are included the intercepts are equivalent to the total strength of the objects. Therefore, the penalty is a tool to identify the objects that really have to be distinguished. If explanatory variables are included the interpretation of the clusters refers to the strength or attractiveness left after accounting for the effects of the variables.
For pure object-specific covariates the attractiveness has the form γ ir " β r0`z J ir τ . In this case, the penalty provides a way to obtain unique parameter estimates despite the nonidentifiability of the model. The penalty serves to find out how much of the attractiveness of an object is due to observed characteristics of the objects. When used in addition to the regular side constraint ř m r"1 β rj " 0 it restricts the intercepts and, therefore, the individual attractiveness that is left if one accounts for the object-specific covariates. If the tuning parameter gets large, λ Ñ 8, all strength parameters β r0 are estimated as identical and the total strength is determined solely by z J ir τ . The corresponding model assumes that the abilities do depend on explanatory variables only. This restricted model was proposed by Springall (1973), however, it is hardly appropriate when a limited number of explanatory variables is available. Package BTLLasso can also be used to estimate the ranking lasso as proposed by Masarotto and Varin (2012). This is obtained by applying the penalty P 1 described above to a setting without additional covariates.

Clustering and selection of variable effects
For object-specific effects, which are used for subject-specific covariates, the penalty P 2 pβ 1 , . . . , β m q " px ÿ j"1 ÿ răs |β rj´βsj | yields clusters of objects that share the same effect for the single variables (denoted by j). Let us consider the variable gender in the preference for parties. If β rj is the same for r P R, comparisons between parties from a set of parties R do not depend on gender. However, if β rj " β p1q .j for r P R p1q and β rj " β p2q .j for r P R p2q , gender distinguishes between the groups of parties R p1q and R p2q . As in the case of the intercept parameters, side constraints like ř m r"1 β rj " 0 for every covariate j are needed. If λ increases successively all effects corresponding to single covariates are merged to one cluster and the corresponding covariates are eliminated from the model as all effects become zero. For λ small enough some of the covariates are deleted and for the others clusters of objects that share the same effect are identified.
For global effects, which are used for object-specific or subject-object-specific covariates, one uses the penalty It is a simple lasso penalty that restricts the absolute values of the parameters. For λ Ñ 8 all corresponding covariates are eliminated from the model.
For object-specific parameters that are used on subject-object-specific covariates, we use the modified penalty For this kind of parameterization, no further side constraints are needed. Therefore, the respective penalty is composed of two parts. The first part penalizes all absolute pairwise differences between the coefficients corresponding to one covariate, the second part (optionally for ν 1 " 1) penalizes the absolute values of all parameters. By specifying ν 1 , the user has the option to decide if only clustering of the objects with regard to the covariates is intended (ν 1 " 0) or if, additionally, selection of the parameters should be enabled (ν 1 " 1). For λ Ñ 8, the specification of ν 1 leads to different results. While for ν 1 " 0 one ends up with global effects (equal for all objects but not zero), for ν 1 " 1 all respective covariates are eliminated from the model.

Clustering and selection of order effects
The order effects are penalized by the terms P 5 pδq " |δ|, P 6 pδ 1 , . . . , δ m q " ÿ răs |δ r´δs |`ν 2 m ÿ r"1 |δ r |, ν 2 P t0, 1u, depending on whether a global order effect δ is specified or an object-specific order effect δ r . The first penalty P 5 is used if one starts with one order effect that does not vary over objects, and only serves the purpose to restrict the size of the effect, which, in particular, may be zero. Penalty P 6 for object-specific order effects allows to identify clusters of objects that share the same order effect. Additionally, the parameter ν 2 allows for a decision between solely clustering the order effects (ν 2 " 0) or additional selection of the order effects (ν 2 " 1). With growing tuning parameter λ, for ν 2 " 0 one ends up with one global order effect while for ν 2 " 1 all order effects can be eliminated from the model.

Combining penalties
For the sake of simplicity the adaptive weights have been omitted in the presentation of the penalty terms. Finally, all different penalty terms are combined using where ψ l are penalty-specific weights. Although different tuning parameters for the different penalty terms are conceivable, the optimization of tuning parameters would become overly complicated. Therefore, all penalty terms are combined into one joint penalty controlled by the joint tuning parameter λ. In order to allow for comparability of the different penalty terms, two requirements have to be fulfilled. First, all covariates have to be transformed into comparable scales. For that purpose, • the subject-specific covariates are scaled so that each vector x .1 , . . . , x .px referring to one covariate across all subjects has variance 1; • the subject-object-specific covariates are scaled so that each vector z ..1 , . . . , z ..p 1 referring to one covariate across all subjects and all objects has variance 1; • the object-specific covariates are scaled so that each covariate vector z .1 , . . . , z .p 2 referring to one covariate across all objects has variance 1.
By default, the scaling listed above is executed automatically and can be controlled by the argument scale in ctrl.BTLLasso(). It should be noted that all print() and plot() commands include the option rescale to specify whether the estimates should be rescaled to the original scale or not. While the rescaled estimates can be interpreted directly with respect to the corresponding covariates, only the scaled parameters can be compared to each other in terms of effect sizes.
Second, the different penalty terms have to be weighted with weights ψ l according to the number of penalties and according to the number of free parameters they include, following the principle of weighting penalties proposed by Bondell and Reich (2009) and Oelker and Tutz (2017). The weights are applied if weight.penalties = TRUE in ctrl.BTLLasso() and are multiplied with the (adaptive lasso) weights if adaptive = TRUE. The optimal tuning parameter is chosen by k-fold cross-validation which is described in more detail in the following section.

Implementation
In package BTLLasso, penalized Fisher scoring is used to fit the proposed models. Generally, all implemented penalties listed in Table 3 are L 1 penalties and, therefore, are not differentiable. However, following the suggestions of Fan and Li (2001) and Oelker and Tutz (2017) L 1 penalties can be approximated by quadratic terms. Quadratic penalty terms are differentiable and can, therefore, easily be incorporated in a (penalized) Fisher scoring algorithm. A first implementation of quadratically approximated L 1 penalties can be found in the R package gvcm.cat (Oelker 2015), but the package does not contain an implementation for cumulative logit models. For shorter computation time, the fitting algorithm itself is implemented in C++ and integrated into R using the packages Rcpp (Eddelbuettel and François 2011;Eddelbuettel 2013) and RcppArmadillo (Eddelbuettel and Sanderson 2014).

Use of package BTLLasso
The which carry out the penalized estimation of the specified model for a pre-specified grid of tuning parameters λ. The grid of tuning parameters can be specified via the argument lambda or, if lambda = NULL, is created automatically so that the maximal value of λ sets all penalized values or differences to exactly zero. Compared to BTLLasso(), cv.BTLLasso() additionally performs cross-validation to find the optimal value for the tuning parameter. The function ctrl.BTLLasso() contains many different options to control the model fit.
Of course, also simple Bradley-Terry models without covariates as described in Section 2 can be computed with package BTLLasso. Therefore, package BTLLasso can also be a useful tool for the computation of Bradley-Terry models with ordinal response or for models with (possibly object-specific) order effects. For that purpose, all arguments to supply covariates (X, Z1 and Z2) and all penalties can be ignored.

Argument Covariate type Effect type Parameter Penalty
subject-object-spec. z ir object-spec. α rj P 4 Table 4: Overview over the three possible arguments to supply covariates in BTLLasso.

Model specification
In the following, it is described which functions and arguments are needed to specify a paired comparison model with package BTLLasso. More details and illustrations can be found in Section 6 where specific examples based on exemplary data sets are presented.

Response
Both in BTLLasso() and cv.BTLLasso(), the response (i.e., the paired comparisons) is specified using the argument Y. To create a response object the function response.BTLLasso() is used where one needs to declare the arguments response, first.object, second.object and (possibly) subject. Only if a subject is specified, the paired comparisons can be related to subject-specific or subject-object-specific covariates. To facilitate the compatibility of the package with the psychotree package, response.BTLLasso() can also take a 'paircomp' object (the standard object class used in psychotree) as its only argument (argument response). Then, all necessary information is extracted automatically and stored in a 'response.BTLLasso' object. The argument order.vec is explained in the paragraph on order effects.

Covariates
In the functions BTLLasso() and cv.BTLLasso(), three different arguments are provided to specify all covariates that can be incorporated in a model, namely the arguments X, Z1 and Z2. The argument X contains all subject-specific covariates x i , the number of subjectspecific covariates (the number of columns of X) is denoted by p x . In contrast, Z1 and Z2 contain object-specific covariates. In particular, Z2 contains all object-specific covariates z r or subject-object-specific covariates z ir which are modeled with global parameters τ . In total, Z2 contains p 2 covariates. Z1 contains p 1 subject-object-specific covariates z ir modeled with object-specific parameters α r . Table 4 contains an overview over all possible assignments of covariates.

Order effects
In principle, the order effect (home advantage in sport events) could also be considered to be a special case of a subject-object-specific covariate. In the example of Bundesliga matches, the teams are considered to be the objects and the match days are considered to be the subjects. Therefore, one can create a subject-object-specific covariate z ir representing the home effect, either using effect coding or dummy coding. With effect coding, one specifies z ir " 1 if team a r is the home team on match day i and z is "´1 for the away team a s , for all other teams one sets z it " 0, t R tr, su. With dummy coding, one specifies z ir " 1 if team a r is the home team on match day i and z it " 0, t ‰ r, otherwise. As subject-object-specific covariates, order effects can either be parameterized with a global parameter δ (global home effect) or with a team-specific effect δ r (team-specific home effect). Technically order effects could be integrated into the modeling framework as subject-object-specific covariates. Nevertheless, they play a special role in the modeling of paired comparisons. Even if no characteristics of objects or subjects are available one might want to include order effects when modeling data because ignoring the presence of an order effect may lead to a severe bias in the estimates. Then one implicitly uses subject-object-specific covariates, maybe without being aware of using this type of covariate.
In package BTLLasso, order effects are already implemented and can be applied using the arguments order.effect and object.order.effect from the ctrl.BTLLasso() function for global and object-specific order effects, respectively. In the case of object-specific order effects, the option order.center = TRUE allows for the use of effect coding instead of dummy coding in the design matrix. In specific cases it might occur that an order effect is present only in some of the paired comparisons while being absent in the other paired comparisons. For example, in the UEFA Champions League all matches are played in the home stadium of one of the teams with the exception of the final match (which is played on neutral ground). To account for such a case, the function response.BTLLasso contains the argument with.order. This argument allows to specify a Boolean vector (of the same length as the response) that contains the information which paired comparisons have an order effect.

Intercepts
By default, object-specific intercepts are included in the model. To eliminate the intercepts, the option include.intercepts = FALSE from ctrl.BTLLasso() can be used.

Specification of penalties
In Section 4.1 several possible penalties were proposed which can be applied to reduce the complexity of the model and to make the model easier to interpret. All these penalties can be applied to the specified model using specific options within ctrl.BTLLasso(). In particular, the user can specify the following arguments: • penalize.intercepts: (de-)activates penalty P 1 .
The arguments can either be set to TRUE or FALSE and, thereby, (de-)activate the respective penalties. The arguments penalize.X, penalize.Z1.diffs, penalize.Z.absolute and penalize.Z2 can also be used differently. Instead of setting them to TRUE or FALSE they can also take character vectors as input. Then, the character vectors contain the variable names corresponding to the parameters that should be penalized. As an example, we look at the GLES data where X consists of the three variables Age, Gender and Abitur. Then, specifying penalize.X = c("Age", "Gender") prevents the parameters corresponding to Abitur from being penalized. The arguments penalize.X = c("Age", "Gender", "Abitur") and penalize.X = TRUE are equivalent.

Cross-validation
To perform cross-validation of the specified model the function cv.BTLLasso() is used. By default, this is done as a 10-fold cross-validation (folds = 10) and is parallelized on several cores using the argument cores. One can choose between two different options for the crossvalidation criterion, namely the ranked probability score (RPS) and the deviance. For ordinal response y P t1, . . . , Ku the RPS (Gneiting and Raftery 2007) is recommended which can be denoted by where πpkq represents the cumulative probability πpkq " Ppy ď kq. In contrast to the deviance, the RPS takes advantage of the ordinal structure of the response.
The advantage of cross-validation over information based selection criteria is that it avoids the problem to determine the effective degrees of freedom. However, in specific data settings, in particular if no covariates are involved, it might be necessary to adapt the number of folds. If one fits, for example, a classical Bradley-Terry model, in which only strength parameters are estimated, employing cross-validation may be problematic since one might remove all competitions in which a specific object is involved. Then the strength parameter of this object cannot be estimated. Let us consider in more detail the case of a Bradley-Terry model with order effects in a complete design. Then the number of pairwise comparisons is mpm´1q and each object occurs in 2pm´1q paired comparisons. If one uses 10-fold crossvalidation mpm´1q{10 paired comparisons are left out. Then the critical case, in which the strength parameter of an object cannot be estimated, can occur if mpm´1q{10 ą 2pm´1q, or equivalently, m{10 ą 2. For small to moderate m this will not happen. For example, if m " 10 one has 90 paired comparisons and each object is involved in 18 comparisons. When using 10-fold cross-validation one leaves out 9 paired comparisons, which means that for each object at least 9 paired comparisons are left. However, if m is large and therefore m{10 ą 2 holds not all parameters might be estimable. If, for example, m " 30 there is some danger that some parameters cannot be estimated. Then one has to increase the number of folds. The problem occurs in particular for large m, which is the rare case in paired comparison models. Nevertheless, we return an error message whenever one of the parameters is not estimable. Then, the user is asked to use another starting value or to increase the number of folds for cross-validation.

Bootstrap intervals
To receive bootstrap intervals for the parameters a bootstrap method is implemented in the function boot.BTLLasso(). In each of the B bootstrap iterations the method is applied to a data set randomly sampled from the original data set. In every iteration the optimal tuning parameter is determined separately by cross-validation to fully account for the additional variability induced by the cross-validation. Therefore, the bootstrap procedure can be very time-consuming, although parallelization is possible using the argument cores. In general, the model generated by cv.BTLLasso() is supplied to boot.BTLLasso(). To speed up computation of the bootstrap, additionally a new (shorter) vector of tuning parameters lambda can be supplied. Finally, from the bootstrap estimates the requested empirical quantiles are calculated and bootstrap intervals can be obtained and visualized by a print() and a plot() method.

Visualization and supportive methods
In addition to the aforementioned plot() method for empirical bootstrap intervals, two further functions for visualization of the results are provided. The main visualization tool is the plot() method for 'BTLLasso' and 'cv.BTLLasso' objects. This method plots the coefficient paths of every single estimate produced both by BTLLasso() and cv.BTLLasso() along the tuning parameter. With plot(), for every covariate (and for intercepts and order effects), separate plots are generated. These plots are especially helpful to investigate the clustering behavior of the penalties along the tuning parameter and to investigate different covariate effects across objects. In contrast, the second visualization function called paths() generates one path per covariate and, therefore, only generates a single plot containing all paths. The paths either visualize the penalty terms for the single model components (y.axis = "penalty") or the L 2 norm (y.axis = "L2") of the vector of all respective parameters, e.g., the L 2 norm of all parameters corresponding to one covariate. This type of visualization helps to compare effect sizes or variable importance between different variables or model components. If a cross-validated model is supplied instead of the regular model, additionally the optimal model according to cross-validation is highlighted. Both visualization techniques will be illustrated in the following section.
In addition to these visualization methods, the package also provides several further supportive methods to analyze and process the results of the main functions of the package. First of all, for all objects generated by boot.BTLLasso(), BTLLasso() or cv.BTLLasso() also print() methods exist to print the most important results of the respective objects into the console. Beside that one can apply coef() and logLik() to 'cv.BTLLasso' objects. A very helpful method is predict() which can be applied both to 'BTLLasso' and 'cv.BTLLasso' objects. The argument type can be specified to be type = "link" to return the linear predictors, type = "response" to return probabilities for the single response categories and type = "trait" to return the traits/abilities separately for both competing objects. predict() can be applied both to new data and to the original data used to estimate the model.

Applications
In the following section, the presented methods are applied to the exemplary data sets of the German election study and the German Bundesliga as described in Section 3.1.

Election data
In the election data, both subject-specific and subject-object-specific covariates are available.
In contrast to Schauberger and Tutz (2017), where only subject-specific covariates are considered, here both types of covariates are included in the model simultaneously. We choose to allow for subject-specific effects for the subject-object-specific covariates SocEc, Immigration and Climate which have to be supplied to cv.BTLLasso() via the Z1 argument. The subjectspecific covariates Age, Gender and Abitur are supplied via X. The following code is used to fit the optimal cross-validated model to our data: For our model, we wish to penalize both the covariates included in X and Z1. Furthermore, no order effect is necessary as the order of the parties is random in this application. These model assumptions match the default settings in ctrl.BTLLasso() and, therefore, no changes have to be applied to the control argument.
Subsequently, the estimated parameter paths can be plotted using the commands R> plot(mcv_GLES, which = 5:7, plots_per_page = 3) R> plot(mcv_GLES, which = 2:4, plots_per_page = 3) separately for all covariates as depicted in Figure 1. Here, plots_per_page specifies the number of plots placed on one page. The default is plots_per_page = 1 where each model component is placed on a separate page, in an interactive R session each page is switched by hand. The which argument specifies the model component to plot, we omitted here the first component which are the (unpenalized) intercepts. In Figure 1, the optimal model according to the 10-fold cross-validation is highlighted by the red dashed line.
The parameter paths are drawn along the tuning parameter or rather logpλ`1q. On first sight, a clear difference between the subject-object-specific covariates in Figure 1a and the subject-specific covariates in Figure 1b becomes obvious. Beside the restriction implied by the penalty, the coefficients in Figure 1a are not restricted while the coefficients in Figure 1b have to sum up to zero for every covariate. Therefore, the estimates for subject-object-specific covariates like the attitude towards climate change can be different from zero but still equal for all parties, which is not possible (and not sensible) for variables like Gender or Age.
It can be seen, that (according to the red dashed line indicating the optimal model according to cross-validation) a rather non-sparse model is chosen. Most of the possible covariate effects differ across parties. For the subject-object-specific variables, all effects are negative. This result is very intuitive, it simply states that for a higher (self-perceived) distance of voter and party on a specific issue the attractiveness of the party for the voter decreases. For the socioeconomic dimension, three clusters of parties are distinguished, namely tFDP, SPD, Greensu, tCDU/CSUu and tLeft Partyu. In that case a cluster means, that for these parties the distance between party and voter is equally important, it is most important for the attractiveness of the Left Party and less important for the attractiveness of FDP, SPD and Greens. Comparing the effect sizes, the issue of immigration has the lowest effect. For the climate issue, the  differences between the parties are rather high. Also for the subject-specific covariates rather many different clusters are found for the single covariates, four clusters for Age and Abitur and three clusters for Gender. For example, the attractiveness of CDU/CSU is increased with growing age while the attractiveness of Greens and Left Party decreases. and shows the L 2 norms of all covariates along the tuning parameter. The L 2 norm measures the size of each vector which collects all estimates corresponding to one covariate. Therefore, it can be seen as measure of variable importance as long as the estimates for the scaled covariates are used. It can be seen that the subject-object-specific covariates representing the different political dimensions are more important than the subject-specific variables Age, Gender and Abitur.
To conclude the analysis of the election data, we present how to generate empirical bootstrap intervals. The bootstrap estimation is executed using the following command: The main argument of the function is the object generated from cross-validation. Furthermore, B specifies the number of bootstrap samples and cores specifies the number of cores to use. The commands R> plot(mboot_GLES, which = 5:7, plots_per_page = 3) R> plot(mboot_GLES, which = 2:4, plots_per_page = 3) are used to visualize the empirical bootstrap intervals which are shown in Figure 3.
In general, the bootstrap intervals illustrate the variability of the single estimates and provide hints if single estimates are distinctly different or not. For example, the bootstrap intervals for Age show that for this variable distinct differences exist between the single parties while this seems not to be the case for Gender.

Football data
For the data from the German Bundesliga, two different models are fitted for illustration. In the first model, the subject-object-specific variables Distance and BallPossession are used to investigate if these variables influence the strength of the different football teams. In the second model, the influence of the object-specific variable market value is investigated.
The first model in total contains four different model components, namely team-specific home effects (i.e., subject-specific order effects), intercepts and team-specific effects for both subjectobject-specific variables Distance and BallPossession. The model is computed using the following code:  R> ctrl.buli <-ctrl.BTLLasso(object.order.effect = TRUE, + name.order = "Home", penalize.order.effect.diffs = TRUE, + penalize.order.effect.absolute = FALSE, order.center = TRUE) R> set.seed(1860) R> mcv_buli <-cv.BTLLasso(Y = Y, Z1 = Z1, control = ctrl.buli) The decisive part is the definition of the control argument using ctrl.BTLLasso(). In football matches, unarguably a home effect is present which is equivalent to the order effect in paired comparisons because the first-named team is the home team. As we do not doubt the general presence of a home effect, the absolute values of the home effects of the single teams are not penalized (penalize.order.effect.absolute = FALSE). Only the differences between the single home effects are penalized (penalize.order.effect.diffs = TRUE) to find out, whether all teams share the same home effect or if (clusters of) single teams have distinct home effects. In addition, the covariates in Z1 are centered for a better interpretability of the intercepts (which are not penalized). Figure 4 shows the coefficient paths for the four model components along logpλ`1q created by plot(). It can be seen that the optimal model in this case is very sparse. All teams share the same home effect. Furthermore, all teams share the same, quite strong positive effect of Distance. That means, a strong running performance is strongly associated with a better performance of a team. For BallPossession, most teams (except for Hertha BSC Berlin) share the same negative effect. Therefore, higher values in BallPossession are associated with rather worse results. Overall, the path plots illustrate the behavior of the penalty terms. Both for the home effects and for the effects for Distance and BallPossession the penalty forms clusters of teams with equal effects. With decreasing values of the tuning parameter, the clusters are decomposed into smaller clusters ending up with singleton clusters for each team.
The second model contains only three model components, namely the home effects, the intercepts and the effect of the object-specific covariate market value. The model is computed using the following code: R> data("Buli1516", package = "BTLLasso") R> Y <-Buli1516$Y5 R> Z2 <-Buli1516$Z2 R> ctrl.buli2 <-ctrl.BTLLasso(object.order.effect = TRUE, + name.order = "Home", penalize.order.effect.diffs = TRUE, + penalize.order.effect.absolute = FALSE, order.center = TRUE, + penalize.intercepts = TRUE) R> set.seed(1860) R> mcv_buli2 <-cv.BTLLasso(Y = Y, Z2 = Z2, control = ctrl.buli2) In contrast to the previous model, now the intercepts are also penalized due to the identifiability issues with object-specific covariates described in Section 3.2. Therefore, the question answered by the model is whether the market value (which is not penalized) is sufficient to explain the match outcomes or if, additionally, home effects and intercepts are necessary for (clusters of) single teams. Figure 5 shows the coefficient paths for this model created by plot(). Obviously, the market value has a positive effect for the performance of a team. Both for the home effects and for the intercepts, in the optimal model several clusters occur. For the home effects, seven different clusters evolve with tMGB, WOBu and tDARu having the best and worst home effect, respectively. For the intercepts, five clusters are found with Borussia Dortmund forming a cluster of its own with the highest value.

Summary
The R package BTLLasso presented in this work provides a general framework for the inclusion of different kinds of covariates in paired comparison models. Covariates can vary over subjects or objects or both over subjects and objects of paired comparisons. In addition, (possibly object-specific) order effects can be included. The paired comparisons can be both binary and ordinal. To obtain sparser models and easier interpretation, suitable penalty terms for all model components are provided. For all model components with object-specific effects, the absolute pairwise differences of the parameters can be penalized to force clustering of objects with respect to the corresponding model component. The package contains functions for cross-validation, bootstrap intervals and for the visualization of results. The package also provides exemplary data sets that illustrate the possible applications of BTLLasso.