The R Package tipsae : Tools for Mapping Proportions and Indicators on the Unit Interval

The tipsae package implements a set of small area estimation tools for mapping proportions and indicators defined on the unit interval. It provides for small area models defined at area level, including the classical beta regression, zero-and/or one-inflated beta and flexible beta ones, possibly accounting for spatial and/or temporal dependency structures. The models, developed within a Bayesian framework, are estimated through Stan language, allowing fast estimation and customized parallel computing. The additional features of the tipsae package, such as diagnostics, visualization and exporting functions as well as variance smoothing and benchmarking functions, improve the user experience through the entire process of estimation, validation and outcome presentation. A shiny application with a user-friendly interface further eases the implementation of Bayesian models for small area analysis.


Introduction
The growing demand for timely and reliable statistical estimates leads to the extensive exploitation of survey data at an increasingly greater level of disaggregation.However, domains or areas of study are often different from the ones for which the survey was originally planned, leading to possibly unreliable direct estimates due to observation-poor samples.Small area estimation (SAE) tackles this problem by providing a set of indirect estimation techniques, relying on external information, which borrow strength across areas and increase the efficiency of the estimates (Rao and Molina 2015).Indirect estimators based on explicit regression models are labeled model-based estimators and assume a relationship between the target variable and explanatory variables, which remains constant across areas.Classical small area models embrace two basic linear mixed models: the Fay-Herriot model (Fay and Herriot 1979) and the Battese-Harter-Fuller model (Battese, Harter, and Fuller 1988), which are foundational for the strand of area-level models and unit-level models, respectively.While the former relates area-specific target quantities to area covariates, the second relates individual observations of the underlying variables of interest to individual covariates.
Hereafter, we focus on area-level models due to their practical convenience.In fact, they only require covariates aggregated at the area level and account for design-based properties; in contrast with unit-level models that generally need auxiliary information available for the entire population.In area-level contexts, a well-established body of literature is concentrated on Gaussian models.However, many quantities of interest have specific features that are not considered in the Gaussian setting and need to be accounted for, such as bounded support and markedly skewed or heavy-tailed distributions.Specifically, we focus on unit interval responses, common in SAE modeling because of the growing need for rates and proportions releases in official statistics, such as head-count Ratio for poverty mapping (Fabrizi, Ferrante, Pacei, and Trivisano 2011) or health insurance coverage rates (Bauder, Luery, and Szelepka 2015).Not to mention the treatment of other measures of interest defined in (0, 1) or [0, 1], such as some inequality measures (e.g., Gini index).
In this regard, two different bodies of literature revolve around linear mixed models, possibly with suitable transformations (Rao and Molina 2015), and beta regression models (Janicki 2020).The first approach is widely used, as the Fay-Herriot model is often a good option when the response variable is not close to the boundaries and/or the auxiliary variables have strong predictive power.We recall the works by Marhuenda, Molina, and Morales (2013); Marhuenda, Morales, and del Carmen Pardo (2014), Morales, Pagliarella, and Salvatore (2015), and Esteban, Morales, Pérez, and Santamaría (2012); Esteban, Lombardía, López-Vizcaíno, Morales, and Pérez (2020).The second strand focus on classical beta regression, both in the univariate case (Liu, Lahiri, and Kalton 2007;Bauder et al. 2015;Fabrizi and Trivisano 2016;Giovinazzi and Cocchi 2021) and in the multivariate ones (Fabrizi et al. 2011;Souza and Moura 2016), considering also zero and/or one inflated extensions (Wieczorek, Nugent, and Hawala 2012;Fabrizi, Ferrante, andTrivisano 2016, 2020).Lastly, a beta mixture approach in SAE has been proposed by De Nicolò, Ferrante, and Pacei (2022b).
By considering the SAE field as a whole, there is a clear imbalance between a plethora of methodological proposals defined in academic literature and the tight circle of methods actually used in official statistics and applied researchers.A bridge-building process between methodological and applied fields is needed, involving collaboration, dissemination, and development of user-friendly tools to facilitate tough steps.With the latter aim, several routines for SAE have been released by developer teams of R (R Core Team 2023), SAS (SAS Institute Inc. 2003), SPSS (IBM Corporation 2010), and Stata (Stata Corporation 2007).Our focus is on R routines due to flexibility and availability reasons as well as for the equipment of complementary tools.Several R packages have been developed to implement SAE tools, and in the following, we attempt to provide a clear overview focusing on model-based methods.
In general, the most complete released packages are: • sae (Molina and Marhuenda 2015).It implements a wide range of small area methods from a frequentist perspective, including both area-level and unit-level models.
• mcmcsae (Boonstra 2024).It comprises hierarchical area and unit-level models estimated via Markov chain Monte Carlo (MCMC) simulation, allowing for spatial and temporal dependencies.It includes different prior settings, model diagnostics, and posterior predictive checks functions.
Among the listed packages, only the emdi package (Schmid, Bruckschen, Salvati, and Zbiranski 2017) directly accounts for unit interval responses at area-level by providing the arc-sin transformation in a Gaussian setting.Thus, while a Fay-Herriot model for unit interval responses may be implemented via existing packages, beta-based small area models lack proper implementations.
Note that estimating beta mixed regression models is possible in R through other packages that, however, cannot easily accommodate peculiarities of small area models, e.g., the assumption of known dispersion parameter and popular structured random effects.Within the frequentist framework, the betareg package (Cribari-Neto and Zeileis 2010) is worth to be mentioned.On the other hand, the function stan_betareg() of the rstanarm package (Goodrich, Gabry, Ali, and Brilleman 2024) can be employed to fit beta models in the Bayesian setting.Lastly, the zoib package (Liu and Kong 2015) allows fitting Bayesian zero/one inflated beta models.
The tipsae package (De Nicolò and Gardini 2024) aims at filling this gap by implementing beta-based small area models specified at the area-level on measures that can assume values in (0, 1), [0, 1), (0, 1], and [0, 1] intervals.We decided to operate in a Bayesian fashion in order to exploit the advantages brought by approaching this inferential framework via MCMC methods.For instance, it is possible to easily manage non-Gaussian assumptions, incorporate structured random effects, obtain straightforward estimates for out-of-sample areas, and capture the uncertainty about all target parameters through posterior inference.Nowadays, several tools are available to implement Bayesian models with probabilistic languages: our choice falls on Stan (Carpenter et al. 2017), which can be easily employed to fit statistical models within R packages thanks to the tools provided by the rstantools package (Gabry, Goodrich, Lysy, and Johnson 2024).
The main features of the tipsae package are listed in the following: • It includes a variety of area-level models based on the beta likelihood.Besides the standard beta-regression model, zero and/or one inflated beta (ZOIB) and flexible beta models can be chosen.Moreover, particular dependence structures can be modeled, including spatial and/or temporal random effects.
• It implements an efficient Hamiltonian Monte Carlo (HMC) fitting algorithm and customized parallel computing imported from rstan (Stan Development Team 2024).We also tested other languages that build MCMC samplers, and Stan turned out to be the most efficient one for beta regression models.Such models are particularly tricky to handle: location and scale parameters are non-orthogonal (Ferrari and Cribari-Neto 2004) due to the re-parametrization of the beta distribution that introduces correlation between them.
• The 'stanfit' S4 object produced by the rstan package can be exploited to check convergence, monitor sampler diagnostics, and, lastly, perform an exhaustive posterior analysis, relying on existing tools such as loo (Vehtari et al. 2024) and bayesplot (Gabry and Mahr 2024) packages.In this way, users familiar with posterior predictive checks can carefully assess the model performance.
• Specific diagnostics for small area models are produced by ad-hoc functions, facing the most relevant aspects to deepen within the SAE framework.We implemented both visualization tools for graphical assessments and functions that easily export the final results.Moreover, variance smoothing routines and benchmarking procedures are also provided, remarking that, to the best of our knowledge, the first tool is not available in any existing SAE package.
• To further facilitate the workflow for non-expert users of R, a shiny application (Chang et al. 2023) with an intuitive graphical user interface can be launched through runShiny_tipsae().The application assists the user in carrying out a complete SAE analysis, exploiting all the main features of the tipsae package.
The paper is organized as follows: covered models and implemented methodology are set out in Section 2, the datasets made available in the package are presented in Section 3, while Section 4 provides a step-by-step description of inputs and outputs of the available functions.Section 5 outlines the features of the shiny application and, eventually, Section 6 contains some concluding remarks, discussing possible extensions that could be supplied.

Methodology
In this section, the theory behind the statistical methods implemented in the tipsae package is summarized.The main aspects are those related to the area-level models for indices and proportions that can be estimated using the function fit_sae().
From now on, we consider a finite population of size N that is partitioned into D small areas having sizes N 1 , . . ., N D .We are interested in estimating a generic measure defined on the unit interval that we denote as θ d , d = 1, . . ., D. To this aim, a random sample of size n is drawn from the whole population using a possibly complex survey design, obtaining subsamples of sizes n 1 , . . ., n D , specified for each domain.Among them, we define the first D domains, with D ≤ D as the ones actually observed, i.e., with n d > 0. The observations recorded at the individual level are aggregated to produce the direct estimates y d , which are stored in the vector y and are the observed determinations of the direct estimator Y d for a quantity of interest θ d , with d = 1, . . .D. The Bayesian area-level model is specified for Y d , including also a set of auxiliary variables x d , which are assumed to be available for each domain.
The details about the statistical models that can be set through the argument likelihood are discussed in Section 2.1.Furthermore, a small area model usually includes random effects in the linear predictor.The random effect part, hereafter indicated with e d , can incorporate either a temporal and/or a spatial dependency structure, as will be discussed in Section 2.2, devoted to the prior specification settings.In addition, different prior assumptions can be specified for the unstructured random effects, allowing for robust and shrinking priors.
In small area models, the dispersion parameters are generally assumed as given and previously estimated from the data.Separate estimation could involve a smoothing procedure to refine the sampling variances estimates and reduce their errors.Section 2.3 describes the proposed algorithms to carry out this step if required.Eventually, Section 2.4 outlines the main aspects of posterior inference: we will mainly focus on the out-of-sample treatment, diagnostics, and goodness-of-fit tools employed to validate or select the models and, lastly, the benchmarking procedures complementing SAE analysis.

Area-level models: Likelihoods
The statistical models available in tipsae are set out in the following sections, whereas a comprehensive overview of the key quantities under each model is provided in Table 1.In particular, we specify the response support, the conditional expectation, constituting the predictor for θ d , the conditional variance, allowed parametrizations, and the out-of-sample predictor (denoted with θ oos d ).From now on, η indicates the vector of all the model parameters.

The beta model
Let us consider the mean-precision parametrization of the beta random variable (Ferrari and Cribari-Neto 2004): in this case, if Y ∼ Beta(µϕ, (1 − µ)ϕ), then its probability density function is where µ ∈ (0, 1) is the location parameter and ϕ ∈ (0, +∞) is the dispersion one.In SAE context, the beta regression area-level model is usually specified as where β is the vector of regression coefficients and ϕ d is the area-specific dispersion parameter, usually assumed to be known to guarantee identifiability.Recalling the expression of from Table 1, it can be shown that, when the target response is a proportion, the parameter ϕ d is related to the effective sample size, i.e., the corresponding sample size under simple random sampling (Janicki 2020).For a more complete explanation of those aspects, we refer to the discussion in Section 2.3.On the other hand, if a generic indicator (e.g., Gini index) is considered, the meaning of ϕ d becomes less clear.For this reason, we let the user specify the model parametrization (argument type_disp), choosing between: • "neff" option, namely an estimate of the effective sample size ϕ d + 1 is provided; • "var" option, in which an estimate of the sampling variance of the direct estimator i.e., V[Y d ], is used.In this case, the parameters ϕ d are retrieved using the relations in Table 1, replacing and substantially changing model parameterization.

The flexible beta model
When the distribution of the response is characterized by heavy tails and/or high skewness, the standard beta regression could fail in properly modeling Y d (Bayes, Bazán, and García 2012;Migliorati, Di Brisco, and Ongaro 2018).To improve the model performances in these
conditions, the standard beta distribution can be replaced by the flexible beta distribution.The flexible beta small area model has been proposed by De Nicolò et al. (2022b).It is defined as a mixture of two beta random variables having a common dispersion parameter ϕ d : In this case, only the direct estimator variance (i.e., disp_type = "var") can be used as input to determine the dispersion parameter of the model.

The zero/one inflated beta model
The supports of beta and flexible beta models do not include the extremes 0 and 1.However, in some applications, zero and one values are observed, and a model able to encompass them is required.Therefore, following Wieczorek et al. (2012), we include in the package the ZOIB model, specified as: where p z d and p o d denote the probabilities of observing zero and one values, respectively.They are modeled by means of a logit regression model having coefficients β z p and β o p .The notation 1{A} defines the indicator function that assumes value 1 if the event A is observed, and 0 otherwise.The user can specify a model that accounts both for zeroes and ones setting likelihood = "Infbeta01"; however, simpler versions inflating only the ones or the zeroes are also available ("Infbeta1" and "Infbeta0", respectively).Relevant quantities for each version of the ZOIB model are listed in Table 1, having defined For further details, see Ospina and Ferrari (2010).

Prior distributions
To facilitate practitioners, standard wide-range prior distributions are assumed for the parameters included in the model.Starting from the priors for the regression coefficients, classical Gaussian priors are specified.To avoid issues related to possibly different magnitudes, auxiliary variables are standardized and posterior results must be interpreted accordingly.Hence, the priors for intercept and the regression coefficients are: with h c = 2.5 as default option, following suggestions from the popular rstanarm package (Goodrich et al. 2024).Such value can be changed through the scale_prior argument.Note that the same prior setting is also assumed for coefficients β z p and β o p involved in ZOIB models.As regards the flexible beta model, we additionally specify the following priors for the mixing probability p and the differences between the means of mixture components: The priors for the random effects are discussed in the following considering the case of unstructured random effects, spatially structured random effects, and temporal random effects.

Unstructured random effects
The basic assumption on the random effect is e d = v d , where v d is an unstructured areaspecific random effect accounting for deviations from the synthetic predictor.We propose three different strategies to specify its prior distribution, that can be chosen through the prior_reff argument of fit_sae().Firstly, a zero-mean normal prior with scale σ v is considered ("normal" option, default), putting a half-normal prior for σ v , in line with Gelman (2006): . The default choice of such half-normal prior with h v = 2.5 is usually weakly informative, compared to the scale of the random effects.However, the scale_prior argument allows tuning such value.This might speed up the computational algorithms.
When covariates have poor explanatory power, in some domains, it is possible to observe large deviations of the predicted value from the observed one, requiring more flexible handling of random effect through a robust prior.Among those proposed in the literature, we implement the one introduced by Figueroa-Zúñiga, Arellano-Valle, and Ferrari (2013), and previously considered in the small area framework by Fabrizi and Trivisano (2016).It consists of a Student's t prior with exponential hyperprior for degrees of freedom ν and half-normal hyperprior for the scale σ v ("t" option): The notation t (ν, 0, σ v ) indicates a Student's t distribution with ν degrees of freedom, location parameter equal to 0, and scale σ v .
In other cases, the variability of the small area parameters may not require the inclusion of a random effect term in presence of very informative covariates (Datta, Hall, and Mandal 2011b).Therefore, the variance gamma shrinkage prior introduced by Brown and Griffin (2010) and implemented in a small area application by Fabrizi, Ferrante, and Trivisano (2018) is included as a prior choice for v d ("VG" option).This option enables for shrinking to 0 the random effects related to a subset of the areas by mimicking the behaviour of a spike-and-slab prior.Following Fabrizi et al. (2018) andDe Nicolò, Fabrizi, andGardini (2022a), we propose the following general hyperparameters choice: It can be noted that the independent ψ d are local scales, whereas σ v is a global scale hyperparameter.

Spatially structured random effects
Setting the argument spatial_error equal to TRUE, we let the user add a spatially structured effect s d to the linear predictor, leading to the formulation e d = v d + s d .For the vector s = (s 1 , . . ., s D ), we assume an intrinsic conditional autoregressive (ICAR) prior (Besag, York, and Mollié 1991), i.e., an improper prior whose density is proportional to: where K− is the generalized inverse of a singular precision matrix.To describe its structure, we first define K = D−W, where D is a diagonal matrix containing the number of connections for each area and W is the adjacency matrix (the generic entry [w] ij is 1 if area i and j are adjacent and 0 otherwise).Following Freni-Sterrantino, Ventrucci, and Rue (2018), the actual precision matrix K is obtained with a scaling procedure aimed at reducing the impact of the structure on the prior variability, keeping into consideration the possible presence of G ≥ 1 disconnected graphs in the model (e.g., islands).Note that G−1 dummy variables are added to the linear predictor in order to obtain island-specific means, placing a sum-to-zero constraint on the random effects related to the same island.Islands defined by singleton areas are also allowed, even if they do not constitute a graph counted in G. Lastly, a half-normal prior with variance h 2 s is fixed for the hyperparameter σ s .For further details on the implementation of ICAR priors in Stan, see Morris, Wheeler-Martin, Simpson, Mooney, Gelman, and DiMaggio (2019).
To include a spatially structured effect, an object of class 'SpatialPolygonsDataFrame' (from the sp package, Bivand, Pebesma, and Gomez-Rubio 2013) is required as input of the spatial_df argument.Furthermore, the argument domains_spatial_df must receive as input a string containing the name of a column in spatial_df@data.Such a variable contains the labels of the areas in spatial_df that need to match the names in the domains column of data.For this reason, the user must carefully check the coherence of the denominations in the two objects.

Temporally structured random effects
If multiple observations of the target indicator are available for different time periods, a suitable model can be specified, in order to borrow strength from time repetitions.In this framework, a second subscript must be added in the notation: Y dt indicates the direct estimator for area d at time t = 1, . . ., T , whereas e dt is the random effect component in the linear predictor.The user can choose to add a temporal random effect u dt to the unstructured one (e dt = v d + u dt ) setting temporal_error = TRUE and declaring in temporal_variable the name of the dataset column that contains the times of the observations.Such a variable must be numeric and the underlying assumption is that the time periods are all equispaced.Note that all the areas must be contained in the dataset the same number of times, hence, possible missing observations need to be included in the dataset, specifying NA in the columns related to survey information.If both temporal and spatial random effects are declared in fit_sae(), then a spatio-temporal model is fitted, removing the unstructured random effect (e dt = s d + u dt ).
As prior for the sequence of random effects {u dt } t , we specify a random walk prior of order 1, assuming independence among the areas (Rao and Molina 2015).It represents a flexible prior that can be defined recursively as: implicitly assuming a uniform improper prior on u d1 .Sum-to-zero constraints are placed for each area-specific time sequences, to guarantee the identifiability of all the parameters in the linear predictor.Again, a half-normal prior with variance h 2 u is fixed for the hyperparameter σ u and the contribution of the correlation structure to the prior variability is mitigated by adopting a scaling procedure (Riebler, Sørbye, Simpson, and Rue 2016).

Data pre-processing
Area-level models require given sampling variances V[Y d ] for each domain as an input.In small sample sizes contexts, the estimates of such sampling variances can be imprecise and unreliable.As this may affect model performances, its unreliability can be mitigated via different approaches.The most common ones are the treatment of variance estimates by means of a smoothing procedure and the estimation of alternative quantities such as the so-called effective sample size in case of proportion (Chen, Sartore, Benecha, Bejleri, and Nandram 2022).
Let us consider that, under simple random sampling, a general variance function has the following structure: where n d is the sample size.Note that, if the target quantity is a proportion, then When dealing with complex survey designs, the selection process invariably introduces a correlation structure in the data.In this way, the information actually available may be lower than the one provided by a sample of the same size under simple random sampling.This is formalized by the effective sample size ñd , which denotes the amount of effective information provided by the sample and generally ñd < n d holds.It can be estimated as ñd = n d /deff, where deff is the design effect, defined as the ratio between the complex design-based variance Clearly, under simple random sampling, ñd equals n d .
We propose the smoothing() function that performs both variance smoothing and ñd estimation through three different methods.The first method "kish" enables the estimation of the effective sample sizes through the use of survey weights.When survey weights are not accessible and/or raw estimates of the sampling variance (e.g., from bootstrap) are available, the remaining two methods "ols" and "gls" perform a variance smoothing procedure of the raw estimates, allowing for different variance functions (argument var_function) and providing also related effective sample sizes in case of proportion.In particular, the argument method allows to choose among: • "kish", implementing an area-specific design effect estimation proposed by Kish (1992).
It employs solely the design weights and requires an additional data frame as input of the survey_data argument, whose structure is specified in Section 4.3.The specific design effect is estimated as: where h refers to a generic sampling unit in area d (e.g., the household).Indicating with subscript c the generic individual in sampling unit h, we define W dh = Ndh / Nd , Ndh = c∈h w dhc , Nd = h∈d w dh and n d = h∈d n dh .We denote with w dh and n dh the design weight and the sample size of unit h in area d, respectively; while w dhc is the individual design weight.Thus, the design-based variance can be defined as while ñd = n d /deff d .This method has already been used in small area contexts by Wieczorek and Hawala (2011) and Liu et al. (2007).Kalton, Brick, and Le (2005) found this approximation accurate for proportions ranging between 0.2 and 0.8.
• "ols", implementing a variance smoothing model using a Generalized Variance Function approach, as in Fabrizi et al. (2011) and Fabrizi and Trivisano (2016).Considering the design-based variance as the smoothing procedure is based on the assumption that the design effect does not vary across areas.By assuming Vraw [Y d ] as a raw estimator of complex survey variance, let us specify the following smoothing equation: where ψ = 1/deff and ϵ d are zero-mean and homoscedastic residuals.The model is estimated using ordinary least squares via the gls() function from nlme package (Pinheiro, Bates, DebRoy, Sarkar, and R Core Team 2023), providing the smoothed dispersion parameters ñd = ψn d and the refined variance estimate • "gls", extending the "ols" method in case of heteroskedasticity of the error component ϵ d of Equation 1.The default method assumes only heteroskedastic error with a power variance function on absolute fitted values (see Pinheiro et al. 2023, for further details).
We remark that when the response is a proportion, φd = ñd − 1; alternatively, only the variance specification must be considered.The output estimates are ready to be used as known parameters in an area-level model, and they need to be added to the analyzed 'data.frame'object.

Posterior inference
We are interested in making posterior inference on θ d .Since we are not dealing with conjugate models, not even conditionally, the posterior inference is carried out through MCMC draws.
As a point estimate, the optimal Bayes estimator of θ d under quadratic loss is considered, i.e., the posterior mean.We indicate it with the notation: where HB states for hierarchical Bayes.The point estimates can be complemented with uncertain measures like the posterior standard deviation and credible intervals, determined by the quantiles of the posterior distribution.The generic method summary() applied on as S3 object of class 'fitsae' produces by default point estimates (posterior mean and median) and credible intervals (at 95% and 50% levels) for predictors, basic model parameters, and random effects.

Out-of-sample treatment
The package automatically provides out-of-sample predictions, which are made available through the export() function.In practice, when the direct estimates for an area are missing but auxiliary information is observed, then the area can be included in the dataset labeled as NA.In this way, the functions of the package draw a sample from the posterior distribution of the predictor.This feature is available for all considered likelihood, except for flexible beta, since in this specific case, θ d depends on its sampling variance, which is not available in case of out-of-samples.
Recalling that θ oos d , d = D, . . ., D denotes the out-of-sample target quantity, their predictors are reported in Table 1.Note that they depend on e d : when spatial and temporal dependencies are defined, s d and u dt gain information from the assumed correlation structure, whereas v d is always drawn from a zero-mean distribution, contributing only to the posterior variability of θ oos d .Exploiting the MCMC estimation framework, it is possible to obtain a sample from the posterior of θ oos d by combining the samples drawn from the posterior of the involved parameters.Eventually, the point predictor defined in (2) holds also for out-of-sample observations, together with the other posterior summaries.

Diagnostics and goodness-of-fit tools
The method summary() returns, in addition, goodness-of-fit and model validation diagnostics, as well as SAE-specific diagnostics.In the following, we provide a brief theoretical overview of such measures.
One of the main advantages of estimating models within the Bayesian framework is the plethora of tools that allow investigating model performances.Among the most relevant ones, we can find those relying on the posterior predictive distribution, which we denote with Y • d | y, d = 1, . . ., D. Area-specific Bayesian p values (BP d ) under the following discrepancy measure (You and Rao 2002;Fabrizi et al. 2011) are computed: In absence of systematic deviations, the expected Bayesian p value is 0.5, whereas values near 0 or 1 highlight issues of over-estimation and under-estimation, respectively.
Information criteria are widely used in Bayesian inference to compare models with different specifications, e.g., diverse distributional assumptions, random effects structures, or covariates.Following Vehtari, Gelman, and Gabry (2017), we consider the approximate leave-oneout cross-validation information criterion (LOOIC) computed using Pareto-smoothed importance sampling.It can be retrieved through the loo package (Vehtari et al. 2024) and is provided together with the approximate standard errors for estimated predictive errors.
Stepping into SAE-specific diagnostics, the standard deviation reduction (SDR d ) indicator is commonly used to assess the decrease of uncertainty associated with the employment of a small area model.It is obtained by evaluating where the denominator is defined in this way when type_disp = "neff", taking into account the fact that V [Y d | η] has a posterior distribution to be summarized.Conversely, if type_disp = "var", the denominator is replaced by V [Y d ].This diagnostic has to be considered with caution when performing model selection since it does not account for the different biases that can affect distinct model-based estimators.
Lastly, the shrinkage bound rate (SBR) is computed: where is the synthetic estimate of θ d .In fact, in the standard Fay-Herriot model, the shrinking process is clearly identified by the shape of the best linear unbiased predictor, for known values of β and σ 2 v such as .
Beta regression models do not provide a closed-form predictor, since the conditional distribution of θ d , ∀d = 1, . . ., D does not belong to a standard family.Janicki (2020) shows that, in a beta regression model with standard diffuse priors, θHB  (Janicki 2020).Thus, checking whether model estimates fit inside the bound, could yield important insights into the shrinking process and estimators consistency.

Benchmarking procedure
The benchmark() function gives the chance to perform a benchmarking procedure on modelbased estimates.Small area models do not guarantee coherence of obtained estimates with respect to aggregates known in population (external benchmark) or retrieved by direct estimates (internal benchmark).In particular, latter quantities could refer to a larger geographical area or a larger socio-demographic group whose target domains are a subset of, and, therefore, might be reliable.This feature may introduce drawbacks in many situations (e.g., when small area estimates are used to allocate funding), and exact benchmarking is required to avoid surpluses or shortfalls (Datta, Ghosh, Steorts, and Maples 2011a;Zhang and Bryant 2020).
A standard approach in the Bayesian literature is the one proposed by Datta et al. (2011a), which is implemented in the benchmark() function.It consists of an ex-post treatment of posterior area estimates, as those in formula (2).In this case, the point estimates, obtained by the fit_sae() function, are adjusted leading to a new set of estimates θBM d , d = 1, . . ., D. They minimize the posterior risk under a weighted squared error loss satisfying the benchmarking constraints.This procedure could solely target the point estimators (single benchmarking) or, alternatively, also the ensemble variability (double benchmarking).When in-sample areas are treated and a single benchmarking is performed, an estimate of the overall posterior risk is provided.
The procedure requires the definition of an area-specific set of weights, which, in the case of proportions, is • The "ratio" method provides the following benchmarked estimates: where r d = θHB d , and s = d w d θHB d .The posterior risk for the whole set of benchmarked estimates is • The "raking" method provides the benchmarked estimate in Equation 6 and the posterior risk in Equation 7 with r d = 1 and s = 1.
• The "double" method extends this procedure accounting for a further benchmark on the weighted ensemble variability The benchmarking procedure can be performed also in the case of temporal or spatio-temporal models by specifying multiple benchmarks related to different time periods.
Lastly, we remark that such methods, even if widely used and easy to be implemented, are endowed with some drawbacks as summarized in Okonek and Wakefield (2022).For example, being the benchmarked estimates obtained under a weighted squared error loss, they are not guaranteed to lay in the unit interval.Furthermore, the estimate uncertainty measure cannot be computed for each area, as the method is not fully Bayesian.

Datasets
In the SAE field, data typically come from multiple sources.Direct estimators and their sampling variances typically result from survey data, aggregated at area-level, while covariates come from census and/or administrative/register sources.As a consequence, explanatory variables, aggregated at area level, are required to be defined at population level i.e., without error, and potentially correlated with the target variable.In order to outline the workflow of tipsae package, its functions are illustrated in Section 4 and applied to an example dataset, released within the package.The whole dataset is named emilia and consists of a panel on poverty mapping concerning 38 health districts within the Emilia-Romagna region, located in North-East of Italy, with annual observations recorded from 2014 to 2018.We built it starting from model-based estimates and related coefficients of variation freely available on the Emilia-Romagna region website 1 .Since it is used for illustrative purposes only, such estimates are assumed to be unreliable direct estimates, requiring an SAE procedure.
We considered the Head-Count Ratio estimates as direct ($hcr) and its associated variance as sampling variance ($vars).A fake standardized covariate $x has been generated.We also provide area sample sizes ($n), population sizes ($pop), province identification ($prov), years ($year) and health district name ($id).The emilia dataset can be loaded as follows.
R> library("tipsae") R> data("emilia") R> head(emilia) A cross-sectional subset concerning a single year ( 2016) is taken from emilia, for nontemporal models illustration purposes: it is named emilia_cs and can be loaded as follows.

Workflow
In this section, a typical flow of an SAE analysis is outlined with step-by-step instructions, showing the potential of tipsae tools.As illustrated with a flowchart in Figure 1, the package is structured into three parts that relate to: model building and fitting (in green, Section 4.1), diagnostics and results displaying (in yellow, Section 4.2), and complementary tools for SAE analysis (in blue, Section 4.3).Figure 1 displays also the possible connections with external functions, drawn with dashed arrows, useful to further exploit the produced objects.

Model building and fitting
The first step of the workflow represents the core of our package, concerning the estimation of models with the diverse extensions and parametrizations defined in Section 2. The sole function fit_sae() allows users to construct personalized models and fit them using Stan routines, called up through the sampling() function of rstan package.It also allows customized parallel computing when the model runs on multiple chains.A simple parallelization can be set out using the following command, which imposes a number of R processes equal to the number of CPU cores.
R> options(mc.cores= parallel::detectCores()) The function setDefaultClusterOptions() from parallel package can be used to change the default options for parallelization.For further details, see rstan guidelines.
A complete list of the input arguments of the fit_sae() function is specified in -domain_size data column name indicating domain sizes (optional).In out-of-sample areas, sizes must be NA.
"  emilia_cs dataset contains the sampling variance as a measure of dispersion, disp_direct must be fixed equal to "var", setting a mean-variance parametrization.Moreover, argument domains_size has to be specified for having visual design consistency diagnostics in the subsequent plotting function.
The estimation can be done in practice by running the fit_sae() function as follows.For the sake of reproducibility, we set seed=0.
Different models can be estimated relying on diverse assumptions, being subsequently compared with each other.For example, we assume a flexible beta likelihood and a variance gamma shrinking prior for the unstructured random effect, in order to propose a more flexible model for the data.Given the increasing complexity of model assumptions, more HMC iterations are required, together with a higher proposal acceptance probability (adapt_delta).

Diagnostics and results displaying
After the MCMC drawing, a careful check on algorithm convergence is required, in order to validate posterior results.With this aim, our suggestion is to exploit the plethora of diagnostic methods implemented for 'stanfit' objects within the bayesplot package.For example, the following code generates the trace plots related to the fit_beta model, as in Figure 2, useful to visually inspect the convergence of the chains to a unique stationary distribution.
However, small area diagnostics are required at this stage, in order to check whether results meet specific properties which turn out to be desirable in such a context.Peculiar diagnostic measures can be obtained through summary() method applied on 'fitsae' objects.
Besides the printed output, the method produces an object of class 'summary_fitsae' which contains relevant information for posterior inference.Argument probs allows specifying the quantiles of interest to be visualized as posterior summary measures.The logical argument compute_loo allows deciding whether LOOIC should be computed or not.

R> summ_beta
Warnings: Some Pareto k diagnostic values are too high.See help('pareto-k-diagnostic') for details.
Another element that can be employed in external functions to assess model goodness of fit is $y_rep, an array with values generated from the posterior predictive distribution, enabling the implementation of posterior predictive checks through the bayesplot package.The observed data, required for the checks, can be extracted through $direct_est element.The following code allows comparing the empirical densities of generated samples under the considered models, reported in Figure 3. Lastly, all the posterior summaries related to random effects are stored in the $raneff element, being a list of 'data.frame'objects, one for each type: $unstructured, $temporal, and $spatial.Such outputs may be exploited to produce meaningful plots, e.g., the caterpillar plot of Figure 4, created via the following code.

Ad-hoc plot functions
Our package comes equipped with ad-hoc functions for visual diagnostic tools.The S3 object 'summary_fitsae' can be used as input for plot() and density() visual methods as well as for map() function.
The generic method plot() provides, in a grid (default) or sequence, (a) a scatterplot of direct estimates versus model-based estimates, visually capturing the shrinking process, (b) a Bayesian p values histogram, (c) a boxplot of standard deviation reduction values, and, if areas sample sizes are provided as input in fit_sae(), (d) a scatterplot of model residuals versus sample sizes, in order to check for design-consistency i.e., as long as sizes increase residuals should converge to zero.The following code line produces Figure 5.

R> plot(summ_beta)
The method density() provides, in a grid (default) or sequence, the density plot of direct estimates versus HB model estimates and the density plot of standardized posterior means  of the random effects versus standard normal, in order to check for Gaussian assumption.Figure 6 is produced as the output of the following command.

R> density(summ_beta)
Lastly, the map() function enables the investigation of the analysed phenomenon by accounting for its geographical dimension, if it exists.More in detail, a 'SpatialPolygonsDataFrame' object from the sp package should be provided as input in spatial_df argument.The spatial_id_domains argument must receive as input the name of spatial_df variable containing area denominations, in order to correctly match the areas.If such names match the ones provided through the original dataset, no extra arguments are required.Otherwise, the match_names argument should receive an encoding two-columns 'data.frame': the first with the original data coding (domains) and the second one with corresponding spatial_df object labels.The feature to be displayed on the map can be defined in quantity argument, choosing among HB model estimates HB_est, direct estimates Direct_est, posterior standard deviations SD, and benchmarked estimates Bench_est when a 'benchmark_fitsae' class object is given as input (see Section 4.3).The following code loads the Emilia-Romagna health districts shapefile and produces the maps in Figure 7, with model-based estimates and their posterior standard deviations.
A function for exporting such results in CSV format is directly accessible, with name export().This function requires an 'estimate_fitsae' object and a character string naming the output file (argument file).It is also possible to indicate whether to export both in-sample and out-of-sample areas results (default, type = "all"), or only in-sample or out-of-sample areas, ("in" or "out", respectively), as follows.

Complementary tools
Complementary tools for small-area analysis provided the package are the smoothing and benchmarking functions.The smoothing() function allows for data pre-processing of sampling variance estimates and retrieving effective sample sizes, as described in Section 2.3.After its usage, output results have to be incorporated in the dataset used as input of the fit_sae() function.The smoothing() function requires as input the data including the direct estimates, whose variable name has to be specified in direct_estimates argument, the method to be used among "ols", "gls" and "kish" (method), and the specification of a variance function f (θ), through var_function argument.The default option (NULL) for f (θ) matches the proportion case, being equal to θ(1 − θ), while for other measures it can widely differ, for instance, the Gini index variance can be approximated to f (θ) = θ 2 (1 − θ 2 ) (Fabrizi and Trivisano 2016) and therefore the following object has to be provided in var_function argument: If method "ols" or "gls" is chosen, the function requires the raw variance estimates (argument raw_variance), areas sample sizes (areas_sample_sizes), and, possibly, additional covariates (additional_covariates), all of them being column names of the 'data.frame'provided to the data argument.On the other hand, method "kish" requires the domain names (area_id, as column name in data) and the specification of an additional dataset (survey_data), defined at sampling unit level (e.g., households).The dataset must include sampling weights (weights), unit sizes (sizes) and domain names (survey_area_id), in order to allow for matching.The output is an object of 'smoothing_fitsae' class, being a list of vectors including dispersion estimates: the variance and, if no alternative variance functions are specified, φd .If "ols" or "gls" method has been selected, the list incorporates also an object of class 'gls' from nlme package, ready to be further explored through nlme additional tools.The plot() method is available for 'smoothing_fitsae' objects, showing a boxplot of variance estimates, when effective sample sizes are estimated through "kish" method, or a scatterplot of both original and smoothed estimates versus sample sizes, when variance smoothing is performed through "ols" or "gls".
R> shares <-emilia_cs$pop / sum(emilia_cs$pop) R> bmk <-benchmark(x = summ_beta, bench = 0.13, share = shares, + method = "raking") R> map(x = bmk, spatial_df = emilia_shp, + spatial_id_domains = "NAME_DISTRICT") Benchmarking can be done on the whole set of areas (default option) or even on a subset of them.In the latter case, the vector containing the names of the considered areas has to be indicated through the areas argument.Moreover, the function automatically takes out-ofsample estimates if they are involved in the benchmarking procedure.Benchmark estimates and posterior risk are stored within an object of class 'benchmark_fitsae'.

Spatio-temporal examples
As explained in Section 2, it is possible to fit models that incorporate a spatial dependency structure, a temporal dependency structure or even both of them.The first extension, useful when the domains of interest are geographical entities, relaxes the assumption of spatial independence.Commonly, the boundaries across areas are arbitrarily set, and thus it can be reasonable to assume that the quantities of interest belonging to neighboring areas are correlated.This can happen when dealing with data where the spatial dimension is relevant, e.g., agricultural, environmental, economic and epidemiological analyses.A spatial extension can be implemented through the fit_sae() function by switching to TRUE the spatial_error argument and supplying an object of class 'SpatialPolygonsDataFrame' in spatial_df argument, being careful to include it ordered as the data object.
When dealing with panel data, such as the emilia dataset, a temporal dependency structure has to be taken into account due to the presence of repeated measures across time.It is possible to implement a temporal model by switching to TRUE the temporal_error argument and by providing the name of the dataset temporal variable in temporal_variable argument.
To fit a spatio-temporal model, the emilia dataset, together with the shapefile stored in emilia_shp must be loaded.

R> data("emilia") R> data("emilia_shp")
Notice that the shiny app does not include all the package tools.Specifically, the benchmarking procedure has not been implemented and the smoothing procedure does not include as an option the "kish" method.Such options, however, may be included in future releases of the package.

Conclusions and future developments
The tipsae package is a dedicated tool for mapping proportions and indicators defined on the unit interval, widely used to measure, for instance, unemployment, educational attainment and also disease prevalence.To the best of our knowledge, it is the first package implementing beta-based small area methods, particularly indicated for unit interval responses.Such methods, developed within a Bayesian framework, come equipped with a set of diagnostics and complementary tools, visualizing and exporting functions.The features of the tipsae package assist the user in carrying out a complete SAE analysis through the entire process of estimation, validation and results presentation, making the application of Bayesian algorithms and complex SAE methods straightforward.A shiny application with a user-friendly interface can be launched to further simplifies the process.
Additional features to be integrated into future releases could be, firstly, the implementation of shrinking priors for the regression coefficients, useful for variable selection when several covariates are employed.Secondly, the beta zero-and/or one-inflated version already implemented could fail when very few zero or one values are observed.Thus, a possible extension could comprise further flexible alternatives.Lastly, other directions may focus on model extensions for variance shrinking (You and Chapman 2006;Sugasawa, Tamae, and Kubokawa 2017), able to relax the assumption of known dispersion parameter, and for covariates measured with error (Arima, Datta, and Liseo 2015).

d
converges to the direct estimate as V(Y d | η) − → 0 and the synthetic estimates as σ 2 v − → 0. The first property has also been proved byFabrizi et al. (2020).However, θHB d is not bounded by its convergence limits, conjecturing Y d < θHB d < p * d will hold only for V(Y d | η) sufficiently small where N d is the population size for area d.In what follows, the constraint is indicated with B. The function allows performing three different benchmarking methods, according to the argument method.
. The simultaneous constraints are d w d θBM d = B and d w d ( θBM d − B) 2 = H, where H is a prespecified value of the benchmarked estimators variability.The resulting estimate is: θBM d

Figure 2 :
Figure 2: Traceplots of the parameters β 0 and β 1 of the beta regression model.

Figure 3 :
Figure 3: The empirical densities from posterior predictive samples (y rep ) versus the observed data one (y).

Figure 4 :
Figure 4: Caterpillar plot of unstructured random effects from beta regression model.

Table 2
Figure 1: Flowchart that describe the structure of the tools implemented in tipsae package.
"Unstructured" for h v , "Spatial" for h s , "Temporal" for h u and "Coeff."for h c .
Further inputs for the sampling function.