Non-Standard Semiparametric Regression via BRugs

We provide several illustrations of Bayesian semiparametric regression analyses in the BRugs package. BRugs facilitates use of the BUGS inference engine from the R computing environment and allows analyses to be managed using scripts. The examples are chosen to represent an array of non-standard situations, for which mixed model software is not viable. The situations include: the response variable being outside of the one-parameter exponential family, data subject to missingness, data subject to measurement error and parameters entering the model via an index.


Introduction
Semiparametric regression is an enhancement of parametric regression that uses penalized spline basis functions to achieve greater flexibility. Many semiparametric regression models have useful formulations as hierarchical Bayesian models, with variance component parameters used to control the degrees of freedom of smooth functions; see, for example, Chapter 16 of Ruppert et al. (2003) and Brezger and Lang (2006). An immediate pay-off is that Markov chain Monte Carlo (MCMC) software for hierarchical Bayesian models can be used for fitting and inference. This fact has been exploited by, for example, Crainiceanu et al. (2005), Gurrin et al. (2005) and Zhao et al. (2006) in various semiparametric regression contexts via the WinBUGS Bayesian inference package (Lunn et al. 2000;Spiegelhalter et al. 2003b) -a Microsoft Windows interface to the BUGS inference engine (Spiegelhalter et al. 2003a) In this article we focus on non-standard semiparametric regression situations. By 'nonstandard' we mean variants of semiparametric regression that fall outside the conventional set-up in which the response distributions are in the one-parameter exponential family and all data are cleanly observed. Examples of non-standard situations include (1) overdispersed count responses that warrant more flexible distributions such as those in the negative binomial family, (2) predictors being subject to measurement error, in which case a semiparametric regression model is augmented by a measurement error model, and (3) presence of an index inside a smooth function. Through a suite of examples we demonstrate how the BUGS inference engine is able to effectively deal with such non-standard situations. We access BUGS using the package BRugs (Ligges et al. 2009) in the R computing environment (R Development Core Team 2010). Employment of BRugs has the advantage that an entire analysis can be managed using a single R script. Because R is used at the front-end and back-end of the analysis, one can take advantage of R's functionality for data input and pre-processing, as well as summary and graphical display. The package R2WinBUGS (Sturtz et al. 2005) has capabilities similar to those of BRugs.
Currently, BRugs only runs on the Windows operating system and communicates with the version of BUGS known as OpenBUGS (Thomas et al. 2006). The BRugs package is not available from the Comprehensive R Archive Network (CRAN), but located at the CRANextras repository: http://www.stats.ox.ac.uk/pub/RWin/src/contrib/.
The frequentist counterpart of this general approach to semiparametric regression involves mixed model representations of penalized splines. Ngo and Wand (2004) provided numerous examples of semiparametric analyses using the mixed model software modules PROC MIXED in SAS and lme() in S-PLUS. (The R version of lme() was not used in Ngo and Wand (2004), but the S-PLUS code given there requires only minor modification to run in R.) Unfortunately, conventional mixed model software does not cater for departures from standard situations. At the time of this writing, BUGS is essentially the only established software product that supports non-standard semiparametric regression analysis. Another possible contender is the alternative MCMC-based package, rjags (Plummer 2003(Plummer , 2009), although we have not explored its use. The software package BayesX (Brezger et al. 2005) also supports Bayesian semiparametric regression, but it is not able to handle some of the non-standard situations treated here.
An undercurrent of the use of BUGS is the embedding of non-standard semiparametric regression within a graphical models framework. See Wand (2009) for more details on this viewpoint of semiparametric regression, in which graphical models, and associated inference engines, have great potential for streamlining complicated analyses.
Section 2 lays down infrastructure required for the rest of the paper. Each of the following sections deal with a specific non-standard semiparametric regression setting and illustrative example. In Section 9 we address the issue of formal chain diagnosis. Conclusions are summarized in Section 10

Preparatory infrastructure
Before embarking on the examples we lay out the mathematical infrastructure on which they are based.

Distributional notation
The density function of a random vector x is denoted by [x]. The conditional density function of y given x is denoted by [y|x]. If, for 1 ≤ i ≤ n, y i has distribution D i and the y i are

Distribution
Density function in x Abbreviation

Penalized splines
In many semiparametric regression contexts, nonparametric functional relationships are handled through the modelling mechanism: Here z 1 , . . . , z K are a set of spline basis functions. Our default is the z k corresponding to O'Sullivan splines as described in Wand and Ormerod (2008), which provides a close approximation to smoothing splines. In some of our examples the z k need to be computed inside BUGS and computational procedures required for O'Sullivan splines (e.g. spectral decomposition) are not available. For this reason, we also use simple truncated line basis functions: z k (x) = (x − κ k ) + for some knot sequence κ 1 , . . . , κ K . The knots are usually taken to be equally spaced over the range of the predictor. For most function estimation situations, including all in the examples of this article, K = 25 is an adequate number of knots.
In penalized spline smoothing σ u plays the role of a smoothing parameter. However, a more meaningful measure of the amount of smoothing is effective degrees of freedom (Buja et al. 1989); sometimes shortened to degrees of freedom. For the Bayesian Gaussian response model 1≤i≤n is the design matrix and I d is the d × d identity matrix. By analogy with effective degrees of freedom in frequentist contexts, an appropriate definition of degrees of freedom in this Bayesian context is For deviations from this simplest situation, such as those considered in Sections 4 and 6, the appropriate definition of effective degrees of freedom is somewhat thorny. However, we continue to use (1) as a reasonable measure of the amount of smoothing.

Standardization and default priors
As a general rule, we standardize all continuous variables of interest before commencing Bayesian analyses. This makes the priors scale invariant and can also lead to better behaviour of the MCMC. Standardization is especially important for the real data examples, where the measurements are recorded on several different scales. The examples involving simulated data are such that standardization makes little difference, so this step is omitted to keep the code simpler.
In each of the examples we do not have prior knowledge about the model parameters, so use non-informative priors. The default prior for a fixed effects parameter vector β is a diffuse Gaussian distribution: β ∼ N (0, σ 2 β I). with σ 2 β = 10 8 . The default prior for a standard deviation parameter σ is σ ∼ Half-Cauchy(A).
with A = 25. This is consistent with recommendations given in Gelman (2006) for achieving non-informativeness for variance parameters. BUGS does not offer direct specification of Half-Cauchy distributions. We get around this by using the result: It follows that we can obtain a Half-Cauchy(25) variate by dividing a N (0, 625) random variate by a standard normal one, and taking absolute value. This fact is exploited in the BUGS code in each of the examples. However, since BUGS works with precision parameters τ = 1/σ 2 the above translates to The examples to follow use only these single hyperparameter choices (σ 2 β = 10 8 and A = 25) for reasons of brevity. In practice, it is recommended that sensitivity checks be conducted on the hyperparameters.

Markov chain Monte Carlo defaults
All Bayesian models are fitted using standardized versions of continuous variables. Unless otherwise stated, Markov chain Monte Carlo examples use a burn-in period of 5000 iterations and then retain 5000 iterations. They are then thinned by a factor of 5, resulting in samples of size 1000 being retained for inference.

Negative binomial additive model
In this example we revisit the data set and model given in Thurston et al. (2000). The response variable data consists of adduct counts (adductCount) for 77 former smokers in a lung cancer study (source: Wiencke et al. 1999). Four predictors are available: ageInit: Age of smoking initiation. As explained in Thurston et al. (2000), adductCount is over-dispersed and a Poisson additive model is not realistic. They make a case for a model of the form: where the f j s are arbitrary smooth functions. Thurston et al. (2000) devised kernel methods to fit negative binomial additive models such as (2). Instead we take a hierarchical Bayesian approach with penalized spline modelling for the f j s (Section 2.2).
Section 2.3 describes choice of priors for the fixed effects and standard deviation parameters. The prior for the shape parameter κ is motivated by the result: The coefficient of variation of the latent gamma variable g is This means that κ is the squared reciprocal of the standard deviation-like parameter ω. Based on the advice in Gelman (2006), this suggests use of (25); the same prior placed on other standard deviation parameters.

We settled on [ω] ∼ Half-Cauchy
Even though BUGS has a command named dnegbin for specification of negative binomial distributions it has restrictions, such as the shape parameter being an integer. Therefore, we use the latent gamma random variable representation (3) for specification of a negative binomial node.
In Figure 2 we summarize the MCMC output, both graphically and numerically. The plots in the 'lag 1' column are the MCMC sample against its 1-lagged values and those in the 'acf' column are the sample autocorrelation function of the MCMC sample. These sets of plots allow for quick visual appreciation of the 'stickiness' of the chains. The 95% credible interval for κ is (0.52, 0.97) which is consistent with over-dispersion (the Poisson distribution is a limiting case of the negative binomial as κ → ∞). The credible intervals for the four σ u parameters are away from zero, which is indicative of all effects being non-linear.
In Section 9 we carry out some formal convergence diagnostics for the third σ u parameter. Figure 3 shows the behaviour of the MCMC for estimation of the f j s at the quartiles of the corresponding predictor. All chains are seen to be well-behaved.
Recently, we became aware of the R package gamlss (Stasinopoulos and Rigby 2007). The gamlss documentation indicates support for negative binomial additive models, although we are yet to explore this in any depth.

Nonparametric regression with missingness
In this section we consider the simple nonparametric regression setting for a smooth function f . Assume that the x i s can be modelled as coming from a normal distribution with mean µ x and variance σ 2 x , but are subject to missingness. An appropriate hierarchical Bayesian model for this situation is We illustrate BRugs fitting of (5) to simulated data with and 20% of the x i s missing completely at random.
The upper panels of Figure 5 summarizes the MCMC output produced by BUGS for the nodes µ x , σ x and σ ε , as well as f evaluated at quartiles of x. The true values (6) from which the data were simulated are shown as vertical dashed lines in the posterior density plots. The middle panel monitors the effective degrees of freedom for estimation of f . The chains are seen to be reasonably well-behaved. Lastly, we study some of the output for the unobserved xs, which we denote here by x mis i . Five components were chosen at random and the MCMC summaries are shown in Figure 6. Interestingly, the posterior densities of some of the x mis i s are multimodal. This arises from the periodic nature of the underlying signal. Knowledge about the ordinate manifests in the posterior of the x mis i s as two or three clumps of probability mass corresponding, roughly, to horizontal slicing of f at that ordinate.
In this example we have stayed with the simplest semiparametric regression setting to elucidate the missing data aspects. More elaborate semiparametric parametric models with missing data can be handled similarly.

Nonparametric regression with measurement error
In the previous section the x i s in (5) were subject to missingness. Now suppose instead that they are subject to measurement error. Rather than observing x i we observe where the z i ind.
∼ N (0, σ 2 z ) and independent of the x i s. The contamination variance, σ 2 z , is assumed to be known. This is an instance of nonparametric regression with measurement error. Carroll et al. (2006)  is a recent survey of this and related topics. A hierarchical Bayesian model for (4) and (7) is Models of type (8) were first formulated by Berry et al. (2002).
The upper panels of Figure 8 are the analogue of Figure 5 for the current measurement error example. Once again, the chains are seen to be reasonably well-behaved and true parameters are inside the 95% credible sets.
Even though BRugs and BUGS provide quite pleasing results for this example, it comes at a price in terms of computing time. While most of the other examples in this article take minutes to run on our computers, this example takes about one day.

Robust nonparametric regression via the t distribution
Robustification of regression methodology has a large literature, some of which is surveyed in Rousseeuw and Leroy (1987). An attractive model-based approach is to use the t distribution for the responses since, for low values of the degrees of freedom parameter, gross outliers occur with moderate probability. An early reference of t distribution-based robust regression is Lange et al. (1989). Recently, Staudenmayer et al. (2009) described a penalized spline mixed model approach to nonparametric regression using the t distribution. They took a maximum likelihood approach, with EM algorithm used for fitting. In this section we instead take a hierarchical Bayesian approach and show how BRugs can be used for effective fitting and inference.
Suppose that we observe regression data (x i , y i ), 1 ≤ i ≤ n, but the y i s are subject to occasional outlying values. Then an appropriate model is for a smooth function f . The BUGS command dt supports specification of t distributed nodes but not all three parameters can be treated stochastically. Instead we use the result Following Verdinelli and Wasserman (1991) we put a Beta prior on the reciprocal of ν. Specifically υ ∼ Beta(1.75, 2.5) where υ = 1/ν.
Their justification for this prior is that it corresponds to a fair amount of uncertainty about ν and there is not too much probability near normality. The full Bayesian penalized spline model is σ u ∼ Half-Cauchy(25), σ ε ∼ Half-Cauchy(25) and υ ∼ Beta(1.75, 2.5), υ = 1/ν.
Graphical assessment of the MCMC and inferential summaries are provided by Figure 10. Mixing is seen to be excellent for this example. In addition, the posterior density functions conform quite well with the true parameter values.

Generalized partially linear single index model
The generalized partially linear single index model was devised by Carroll et al. (1997) for flexible dependence of a categorical response variable on a single index. The binary response version of the model takes the general form where the y i are the responses, the x i are vectors containing measurements on a set of continuous predictors and the z i contain other covariates. The function f is a smooth, but otherwise arbitrary, function of the single index α x. For identifiability we impose the constraint α = 1. Carroll et al. (1997) applied a version of (12) to data on coronary heart disease (CHD) from the Framingham Heart Study (Kannel et al. 1986) with local polynomial smoothing used for estimation of f . Here we take a hierarchical Bayesian approach with f estimated via penalized splines. The model is where age denotes the patient's age, trBlood = log(systolic blood pressure − 25), logChol is the logarithm of cholesterol level and smoker is an indicator for the patient being a smoker. As done by Carroll et al. (1997) we worked with versions of age, trBlood and logChol that were first linearly transformed to the unit interval.
We impose the restriction α = 1 by working with spherical coordinates as follows: The prior distributions for φ and θ are each taken to be independent uniforms on (0, π) and (0, 2π) respectively. Since the single index depends on parameter values the basis functions for estimation of f have to be computed inside BUGS. For this reason we worked with the truncated line model with priors β 0 , β 1 ind.
The estimated function f of the single index is plotted in Figure 12. It is monotonic, but has a pronounced non-linear shape. Carroll et al. (1997) obtained a similar estimate of f using a local polynomial approach.

Generalized additive model with measurement error
As was done Section 5, we will augment a semiparametric model with a measurement error model. In this case there are several covariates and the response variable is a count, so a generalized additive model is appropriate.
The data consists of daily measurements on weather, air pollution and mortalities in Milan, Italy, over the period 1980-1989. The response variable is the total number of deaths (mort). There are four predictor variables available, the last of which may contain measurement error: day: The sequential day number.
temp: The average temperature.
humid: The average relative humidity.

TSP:
The natural logarithm of the measured total suspended particles.
As in the simulated example in Section 5, the TSP i , for 1 ≤ i ≤ 3652, are taken to be unbiased but more variable versions of the true measures of TSP (trueTSP) that we would have liked to have observed. That is: With a choice of σ 2 z = 0.1, Figure 13 shows the resulting fitted functions for model (13) applied to the Milan mortality data. As σ 2 z increases, the slope of the fitted function depicting the relationship between the response mort and the estimated true measures of total suspended particles trueTSP in the lower right panel in Figure 13 becomes steeper. The other fitted functions change very little with different values of σ 2 z . A graphical and numerical summary of the MCMC output is provided in Figure 14. The credible intervals for the three σ 2 u parameters do not include zero, meaning the three corresponding effects are nonlinear. This is in agreement with the plots in Figure 13. Figure 15 shows the behaviour of the MCMC for the estimation of the f j s at the three quartiles of the corresponding predictor and for the estimation of β for the estimated trueTSP at the three quartiles of the estimated trueTSP. All the chains can be seen to be well-behaved, indicating satisfactory convergence.
As was the case with the nonparametric regression with measurement error example in the previous section, these results take an incredible amount of computing time to produce. The analyst must be prepared to wait when the dataset is large, as for the Milan mortality dataset which has 3652 observations. On our computers, this example took about 6 days to run.

Formal convergence diagnostics
An issue that we have left aside until now is formal diagnosis of convergence. Consider, for example, the standard deviation parameter corresponding to f 3 (yearsSinceQuit) in the negative binomial additive model example of Section 3. We denote this parameter by σ u3 . The trace plot for this parameter in Figure 2 shows some upwards movement in the chain towards the end. Does this mean that the chain has not yet converged? Or is this movement in keeping with the posterior distribution of σ u3 ? We now address this question using some established convergence diagnostics.
Several methods have been proposed for diagnosing chain convergence. Perhaps the most popular are those developed in Gelman and Rubin (1992) and Brooks and Gelman (1998) and have been become known as Brooks-Gelman-Rubin statistics. The essence of their approach is to run several chains and compare the combined chain behaviour with within-chain behaviour. Figure 16 illustrates use of the Brooks-Gelman-Rubin statistic R interval (Brooks and Gelman 1998) based on 3 chains for σ u3 with different starting values.
Convergence, according to this diagnostic, is seen to occur after about 30000 iterations. Since we used a burnin of 100000 iterations for all examples, Figure 16 provides some reassurance for the validity of the inference for σ u3 given in Figure 2.
A complete Bayesian/MCMC analysis requires multiple chain runs and graphical checks of the type presented in Figure 16 for all key parameters in the model.

Conclusions
We have demonstrated that complex semiparametric regression scenarios can be handled using the BUGS inference engine. When combined with BRugs, analysis and inference for a particular problem can be managed using a single computer script. Aside from the time factor, there is no firm barrier on the complications that can be handled via this approach. We envisage this time factor become less of an issue as Bayesian inference technology and computing power continue to improve.

Disclaimer
Views expressed in this paper are those of the authors, and do not necessarily represent those of the Australian Bureau of Statistics. Where quoted, they should be attributed clearly to the authors.

A. R and BRugs scripts
Scripts, functions and data sets for running each of the examples accompany this paper, and are available along with this paper. Table 2