modelsummary : Data and Model Summaries in R

modelsummary is a package to summarize data and statistical models in R . It supports over one hundred types of models out-of-the-box, and allows users to report the results of those models side-by-side in a table, or in coefficient plots. It makes it easy to execute common tasks such as computing robust standard errors, adding significance stars, and manipulating coefficient and model labels. Beyond model summaries, the package also includes a suite of tools to produce highly flexible data summary tables, such as dataset overviews, correlation matrices, (multi-level) cross-tabulations, and balance tables (also known as “Table 1”). The appearance of the tables produced by modelsummary can be customized using external packages such as kableExtra , gt , flextable , or huxtable ; the plots can be customized using ggplot2 . Tables can be exported to many output formats, including HTML, L A TEX, Text/Markdown, Microsoft Word, Powerpoint, Excel, RTF, PDF, and image files. Tables and plots can be embedded seamlessly in rmarkdown , knitr , or Sweave dynamic documents. The modelsummary package is designed to be simple, robust, modular, and extensible.


Introduction
Data analysts often communicate their results using regression tables, coefficient plots, descriptive summaries, balance tables, crosstabs, or correlation tables. Creating these tables and plots can be a time-consuming and aggravating process. The modelsummary package eases this burden by allowing users to create a wide range of publication-ready data and model summaries under one roof, using a simple, consistent, and powerful set of functions.
The modelsummary package follows four key design principles: 1. Simplicity. All of the package's functions use a simple and consistent user interface.
The number of arguments per function is limited. Each argument is well documented and accompanied by copious examples.
The modelsummary package includes two main families of functions. The first produces data summaries. The second produces model summaries. Table 1 gives an overview of the package's main functions. These functions are all designed with a simple and consistent user interface. This makes modelsummary easy to learn for users who seek an integrated tool for scientific communication. The package modelsummary is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=modelsummary. The next section illustrates the benefits of the package with practical examples.

Illustrations
To illustrate how to use the modelsummary package, we will consider two datasets. The first holds information about penguins near Palmer Station in Antarctica (Horst, Hill, and Gorman 2020). The second tracks the results of RuPaul's Drag Race, a televised reality competition series (Miller 2022). Both datasets are available as standalone R packages and at the RDatasets archive, a website which hosts over 1700 free datasets in CSV format: https://vincentarelbundock.github.io/Rdatasets.

], variables)
To begin the data exploration, we use datasummary_skim. This function was heavily inspired by the skimr package (Waring, Quinn, McNamara, Arino de la Rubia, Zhu, and Ellis 2022). It gives a high-level overview of the dataset, with key descriptive statistics and an inline histogram. This code produces Table 2: R> datasummary_skim(penguins) By default, datasummary_skim only summarizes continuous variables, but this behavior can be altered with the type argument.
After "skimming" the data, it is often useful to report descriptive statistics for subgroups of the sample. In an experimental setting, for example, the analyst may want to verify that covariates are balanced between treatment arms, and might like to highlight differences in means for important variables. This kind of table is colloquially referred to as "Table 1" or "Balance Table." To create such a table, we use the datasummary_balance command.    The first argument of datasummary_balance is a one-sided formula which identifies the groups that we want to compare. The output of this command is shown in Table 3: R> datasummary_balance(~Sex, data = penguins) Under the hood, datasummary_balance uses the estimatr package (Blair, Cooper, Coppock, Humphreys, and Sonnet 2022) to estimate standard errors around the differences in means. As a result, the function can produce estimates that automatically take into account clusters, weights, or blocked experimental designs.
To explore the relationships between numeric variables, it is often useful to create a correlation table. The datasummary_correlation makes this easy. This function can report different kinds of correlations (e.g., Pearson, Kendall, Spearman), and allows users to supply their own methods. This code produces Table 4:

R> datasummary_correlation(penguins)
The datasummary_crosstab can help us explore the relationships between categorical variables. The first argument of this function is a two-sided formula, where the left-hand side  represents the row variable and the right-hand side identifies the column variables. For example, to draw a cross-tab of the Island variable against the Species variable, we could type: R> datasummary_crosstab(Island~Species, data = penguins) The datasummary_crosstab function can also produce multi-level crosstabs. To achieve this, we use the asterisk (*) as a nesting operator. The code below produces Table 5, which shows counts and shares of penguins by Island, Sex, and Species: R> datasummary_crosstab(Island~Sex * Species, data = penguins) If Tables 2-5 do not fill our particular needs, we can turn to the datasummary function. datasummary is a general purpose tool built on top of the tables package (Murdoch 2020). It can make crosstabs and data summaries, and it can display the output of virtually any function in the R language.
The datasummary function builds a table by reference to a two-sided formula: the left side defines rows and the right side defines columns. The terms of the formula represent variables and the functions that we want to apply to those variables. Terms linked by a + sign are displayed one after the other, in the order in which they enter the formula.
For example, if we want to display the Flipper and Body Mass variables on separate rows, and the variables' means and standard deviations in distinct columns, we type: R> datasummary(Flipper +`Body Mass`~Mean + SD, data = penguins) The command above produces Table 6a. Two aspects of this code are noteworthy. First, when a variable name includes spaces (e.g., Body Mass), we can enclose it in backticks in the datasummary formula. Second, the Mean and SD terms of the formula above are convenience functions supplied by the modelsummary package. These functions call the corresponding mean and sd functions in base R, but set na.rm = TRUE by default. Since the Flipper and Body Mass variables of the penguins dataset include missing observations, using the base R mean and sd functions in the datasummary formula -with their na.rm = FALSE default -would produce a datasummary also allows users to compute statistics for different subgroups of the data. To achieve this, we use the asterisk (*) as a nesting operator, connecting a factor variable to a function: Sex * Mean will calculate the mean of each variable, for each value of the Sex variable. This code produces Table 6c: R> datasummary(Flipper + Bill~Sex * Mean + SD, data = penguins) The * nesting operator can be "distributed" across terms with parentheses. For instance, Sex * (Mean + SD) will calculate both the means and standard deviations of our variables, for each value of Sex. This code produces Table 6d: R> datasummary(Flipper + Bill~Sex * (Mean + SD), data = penguins) Thanks to the tables package, datasummary supports a series of "pseudo-functions" which can be used in formulae. When a formula includes factor or character variables, we can use N to count the number of observations in each category; Percent() to compute percentages; and the number 1 to represent a "total" category. These pseudo-functions can be useful to create highly customized cross-tabs. For example, this code produces Table 6e: R> datasummary(Island + 1~N + Percent(), data = penguins) Here again, we can use the * nesting operators to count penguins in subgroups. Inserting Sex * Species in the formula will compute statistics for each sex/species subgroup. Heading is another useful pseudo-function which allows us to rename columns and rows. This code produces Table 6f: R> datasummary(Sex * Species~Heading("#") * N + Heading("%") * Percent(), + data = penguins) In sum, the modelsummary package includes a convenient set of "templates" to produce common tables: skim, balance, correlation, and cross-tabs. Thanks to the work of Murdoch (2020), users can also leverage a powerful formula syntax to build highly customized tables using the datasummary function.

Model summaries
The second family of functions in the modelsummary package are designed to help users communicate the results of statistical models. The modelsummary function produces tables to summarize one or many models side-by-side. The modelplot function draws coefficient plots (dot and whisker). Both functions are highly customizable, and they support over one hundred types of statistical models out-of-the-box.
To illustrate, we load the dragracer dataset, which includes information about contestants' performance in each episode of RuPaul's Drag Race: R> data("rpdr_contep", package = "dragracer") R> dragracer <-rpdr_contep Columns of this dataset include a contestant's rank in each episode (rank), Mini Challenge winners (minichalw), Miss Congeniality titles (missc), a season identifier (season), and the position of each episode within a season (episode).
Our analysis begins by using the lm function to estimate a linear model with rank as dependent variable. We then use the modelsummary function to summarize the findings. The output of this code is displayed in Table 7: R> mod <-lm(rank~minichalw + missc + episode, data = dragracer) R> modelsummary(mod, gof_omit = "RMSE") Now suppose we want to compare three models: the simple linear model, a linear mixed effects model with random slopes by season, and a generalized (Poisson) linear mixed effects.
We also want those models to be clearly labelled in the table. To do this, we use the lmer and glmer functions from the lme4 package to estimate mixed effects models, and we store everything in a named list:  R> library("lme4") R> models <-list( + "LM" = lm(rank~minichalw + episode, data = dragracer), + "LMER" = lmer(rank~minichalw + missc + episode + (1 | season), + data = dragracer), + "GLMER" = glmer(rank~minichalw + missc + episode + (1 | season), + data = dragracer, family = poisson)) Named or unnamed lists of models can be fed directly to modelsummary. We can also use the align = "lddd" argument to left-align the first column and dot-align the other ones (each character represents a column). 1 This code produces Table 8: R> modelsummary(models, output = "latex", align = "lddd") We can improve and customize this table by altering the argument values of the modelsummary function. Assign nicer labels to the coefficients of our models by passing a named vector to coef_rename. Change the number of digits with fmt. Set the type of (classical, robust, or clustered) standard errors to display for each model using the vcov argument. Feed a regular expression to gof_omit to omit all statistics from the bottom panel, except the number of observations and the standard errors identifier. 2 Change the width of the confidence intervals with conf_level. Drop the uncertainty statistics in parentheses by setting statistic = NULL. Define a caption with the title argument. Add a note to the bottom of the table with the notes argument.
A useful feature of the modelsummary function is that it can leverage the glue package to accept interpreted string literals (Hester 2022). Users can thus define exactly how they want coefficient or uncertainty estimates to be displayed in their tables, by using the glue curly braces syntax. For example, to display a confidence interval in brackets next to the estimate, we can set estimate = "{estimate} [{conf.low}, {conf.high}]". In this expression, the estimate, conf.low, and conf.high values follow the naming convention established by the broom package. Users can see a list of available values by applying the modelsummary::get_estimates function to one of their models.
In the current application, we want to compare six different confidence intervals for a single model. Therefore, we create a named list which includes the same model six times. Then, we call the modelplot function, and customize the plot's appearance with ggplot2's scale_color_brewer, guides, and theme functions. The result of this code is shown in Figure 2: R> library("ggplot2") R> mod_list <-lapply(vcov_list, function(x) mod) R> modelplot(mod_list, vcov = vcov_list, conf_level = 0.99, + coef_omit = "Intercept|episode", coef_rename = coef_labels) + + scale_color_brewer(palette = "Dark2") + + guides(color = guide_legend(reverse = TRUE)) + + theme(text = element_text(family = "Times")) Note that when the analyst supplies a single model but multiple vcov entries, the modelsummary and modelplot functions will automatically "recycle" the model by repeating it as many times as necessary. However, the resulting table or plot would not be as nicely labeled as Figure 2.
By default, when users add layers to a ggplot2 plot using the + operator, new geoms will be added on top of the default point range. We can add ggplot2 geoms in the background of a plot (e.g., a vertical line at 0) using the background argument.

Customizing tables
One of the major benefits of modelsummary is that the tables it produces are compatible with four of the most popular table-making packages in the R ecosystem: kableExtra, gt, flextable, huxtable. By default, the functions produce kableExtra tables, which means that we can use the |> operator to pass our tables to that packages' functions for customization.

Saving and exporting
The tables produced by the datasummary and modelsummary families of functions can be saved and exported to a wide array of formats, including HTML, L A T E X, Text/Markdown, Microsoft Word, Powerpoint, Excel, RTF, PDF, and image files.
When compiling R Markdown documents, modelsummary infers the target format and sets the output argument automatically, such that no further user intervention is required. An R Markdown document with simple calls like modelsummary(mod) or datasummary_skim(mtcars) will typically compile to PDF or HTML without modification.

Package internals
The modelsummary package is designed to be modular and extensible. Figure 4 gives a schematic representation of its internal structure.
The left branch of Figure 4 gives an overview of the code used to summarize statistical models. The box at the top represents the user interface. The modelsummary and modeplot functions are harmoniously designed, in the sense that they accept almost all the same arguments. When these functions are called, user inputs are validated with the checkmate package, a dependency-free argument checking tool which returns helpful error messages on failure (Lang 2017).
The next step is to extract results from model objects. The get_estimates function tries to extract estimates with broom::tidy, and then falls back to parameters::parameters. The get_gof function tries to extract goodness-of-fit statistics with broom::glance, and then falls back to performance::performance. The order of priority between broom (Robinson et al. 2022), performance (Lüdecke, Ben-Shachar, Patil, Waggoner, and Makowski 2021), and parameters (Lüdecke et al. 2020) can be modified by changing the modelsummary_get global option. The get_estimates and get_gof functions were designed for internal use by modelsummary, but they are exported to the namespace for users who need a versatile and standardized way to extract raw results from over a hundred distinct object types.
Once statistical results are extracted, modelsummary transforms the data to suit the desired output format: multiple models are merged; coefficients and statistics are renamed, omitted, and/or sorted; numeric values are rounded; robust standard errors are computed; significance stars are added; etc.
Finally, modelsummary infers the output format that users need by looking at the output argument, and it feeds the data to a "table factory" function. By default, the factory builds HTML, L A T E X, and Markdown tables using kableExtra, but users can request a different object type by changing the output argument. If the user specifies a valid file path as output (e.g., output = "file.tex"), an appropriate table factory is selected based on the file extension, and the table is saved to file automatically.

Model summaries
In the right branch of Figure 4, we see that functions in the data summary family go through a similar process. However, instead of extracting results from statistical models, the datasummary functions use the tables package to compute statistical summaries from a dataset. Then, internal functions from the modelsummary package transform the results slightly, before feeding them to a table factory.
The key benefit of the modular design described in Figure 4 is that both families of functions are funnelled to the same table factories. This means that the model and data summary functions can have very similar user interfaces, and that the resulting tables can be customized and saved in exactly the same ways.
Another benefit of the modular approach is that modelsummary is very easy to extend. As of version 0.9.4, tables can be exported using four different table-making packages. Adding support for new table factories is a trivial task, often requiring less than 50 lines of code. The next section shows that it is also very easy to add support for new models and statistics.

Extending and customizing modelsummary
There are many ways to extend and customize the modelsummary package or the outputs of its functions. Here, we consider a few: supporting new statistical models, transforming the numerical results of a model, and adding custom statistics.

Support for new statistical models
The modelsummary package supports over one hundred statistical models out-of-the-box.
To add support for a new model type, users can define two S3 methods (tidy and glance) which conform to the specification described on the broom package website: https://broom. tidymodels.org/ The tidy method is a function called tidy.CLASSNAME, which accepts a statistical model of class 'CLASSNAME', and returns a data frame with one row per term/coefficient, and distinct columns with standardized names: term, estimate, std.error, statistic, p.value, conf.low, conf.high. For example, a minimal tidy method to extract results from a model of class 'lm' could be: R> tidy.lm <-function(x, ...) { + out <-data.frame( + term = names(coef(x)), + estimate = coef(x), + std.error = sqrt(vcov(x))) + return(out) + } The glance method is a function called glance.CLASSNAME, which accepts a statistical model of class 'CLASSNAME', and returns a data frame with a single row, and one model characteristic per column. For example, a minimal glance method to extract information from a model of class 'lm' could be: R> glance.lm <-function(x, ...) { + out <-data.frame( + nobs = nobs(x), + r.squared = summary(x)$r.squared) + return(out) + } The minimalist methods given above are superfluous because modelsummary already supports lm models by default. But they illustrate the general point: As soon as valid tidy.CLASSNAME and glance.CLASSNAME methods are defined, modelsummary automatically supports all models of the relevant class. When those methods are defined, calling modelsummary(mod) should just work.
Users who define tidy and glance methods to support new statistical models are strongly encouraged to give back to the community by submitting their methods for inclusion in the broom package. Interested readers are also encouraged to visit the parameters and performance websites to learn how they can support the work of those who develop these essential infrastructure packages.
Two other extension strategies deserve a note. First, since broom supports the objects produced by the coeftest function of the lmtest package (Zeileis and Hothorn 2002), any model supported by that package will automatically be supported by modelsummary. All that users need to do is apply coeftest to the model before feeding the result to modelsummary. Second, modelsummary allows users to summarize arbitrary data by storing them in a list of class modelsummary_list. See details on the modelsummary website: https:// vincentarelbundock.github.io/modelsummary/.

Transformations
Analysts often wish to transform their model estimates before reporting them. Some transformations are so common that packages like broom offer built-in machinery to execute them. For example, it is common to exponentiate the coefficient estimates produced by a logistic regression model, and the broom::tidy function includes an exponentiate argument to do just that. This argument can be supplied directly to modelsummary, which will use the ellipsis "..." argument to push through the request to broom::tidy. This code estimates a logistic regression model and draws a table with exponentiated coefficients and confidence intervals: R> mod_logit <-glm(vs~hp + mpg, data = mtcars, family = binomial) R> modelsummary(mod_logit, exponentiate = TRUE, statistic = "conf.int") For deeper customization, package modelsummary offers an alternative mechanism: defining tidy_custom and glance_custom S3 methods. These methods follow the same specification as the tidy and glance methods described in Section 4.1. When they are defined, their output will override the default values extracted from the models being summarized.

Adding custom statistics
The mechanism described in the previous section can also be used to add custom statistics to a table. For example, many researchers want to adjust the p values that they report to account for multiple comparisons. The p.adjust function in R can calculate many types of corrected p values, adjusted following the methods of Bonferroni (1936); Holm (1979), and others.
To adjust the p values of a linear regression model, we define a new tidy_custom.lm method which returns a data frame with one column called term and another column with the new statistic we wish to report. We can also add statistics to the bottom of the table by defining a glance_custom.lm method which returns a data frame with one row and one piece of information per column. For instance, if the analyst plans to conduct 10 tests with the minichalw coefficient, they could write: R> tidy_custom.lm <-function(x, ...) { + out <-broom::tidy(x) + out$bonferroni <-p.adjust(out$p.value, n = 10, method = "bonferroni") + out$holm <-p.adjust(out$p.value, n = 10, method = "holm") + return(out) + } R> glance_custom.lm <-function(x, ...) { + out <-data.frame("Num.Comparisons" = "10", "Model" = class(x)[1]) + return(out) + } Then, we call modelsummary and use glue strings in the statistic argument to label the different p values. To focus on the minichalw variable, we use the coef_map argument which allows users to select, reorder, and rename a subset of variables. This code produces

Conclusion and comparison
The modelsummary is a useful addition to this thriving ecosystem. First, modelsummary introduces a powerful set of functions which can produce both data and model summaries, using a simple and consistent user interface. Second, by using both broom and easystats to extract estimation results, modelsummary supports more model types than any other R package released to date. Third, modelsummary makes it easier than most other packages to execute common tasks such as displaying clustered standard errors, or deeply customizing the display of results (via the tidy_custom and glance_custom mechanisms). Fourth, modelsummary can export tables using several specialized With respect to data summaries, modelsummary strikes a balance between two general approaches. The first approach is exemplified by the tables package, on which modelsummary relies heavily. tables offers a general purpose tool to create summary tables which can be exported to L A T E X, HTML, and kableExtra formats. The datasummary family of functions in modelsummary build on that foundation by (a) offering convenient "templates" for common use-cases such as balance tables or crosstabs; (b) expanding the range of output formats; and (c) integrating tables's formula syntax in a wider ecosystem with a harmonized user interface.
The second approach for data summaries can be seen in packages such as skimr, tableone, table1, and furniture. These packages overlap with some of the functions introduced in this article; indeed, they have directly inspired many of modelsummary's own features. These packages tend to offer a series of hard-coded "templates" to execute common tasks such as building balance tables or dataset overviews. However, they do not offer the same kind of formula language to create highly customized tables; they can export to fewer output formats; they offer less flexibility to customize the appearance of tables; and they do not share a user interface with functions which can summarize statistical models in addition to raw data.
With respect to model summaries, there are several alternative packages to consider. The first are the popular stargazer (Hlavac 2022) and texreg (Leifeld 2013). Both of these packages offer integrated solutions to extract, reshape, and display statistical results in HTML, L A T E X, and text formats. By handling all of these steps themselves, they obviate the need to call on dependencies. One drawback of this approach is complexity: to master the stargazer function, analysts must sift through the documentation for 85 distinct arguments. Similarly, the texreg function has 48 distinct arguments. Another cost is flexibility: although package developers have made tremendous efforts to allow customization, the stargazer and texreg functions remain less flexible and powerful than dedicated table-drawing packages like kable-Extra or gt. Finally, although texreg offers a package-specific mechanism to support new models, the modelsummary approach is arguably easier, more general, and standardized (see Section 4.1). The stargazer package also seems to pose particular challenges for maintainability and development: the whole package appears to consist of a single 7000 lines long function, with a large number of hard-coded variables.
huxtable is a general-purpose table-making package which can also extract and display results from statistical models. modelsummary supports this package as one of its output formats, by setting: output="huxtable". This means that huxtable functions can be used to customize the appearance of a modelsummary table. When used as a standalone regression table-maker, the main drawbacks of huxtable are that its results customization functions are less flexible than modelsummary's, and that the HTML and L A T E X code it generates is not designed to be human-readable or hand-editable.
memisc is a package which can summarize the results of statistical models (Elff 2021). It supports fewer model types than modelsummary, but produces good-looking text, L A T E X, and HTML tables. This package's main focus area is the analysis of survey data, and it offers many utilities to handle labeled data and to overcome survey-specific challenges, such as displaying clustered variance estimates.
Finally, the gtsummary package is emblematic of a new R generation of packages in this space, similar in spirit to modelsummary: It uses broom to extract results from model objects, it can export to several table-making packages, and it includes many functions to produce data summaries (e.g., " Table 1"). By default, gtsummary produces tables that look like those we typically see in peer reviewed journals in the life sciences. If users do not like modelsummary, this would be a good place to look next.
In sum, whereas several packages offer functionality that overlaps, modelsummary offers an attractive combination of features, thanks to its simplicity, flexibility, robustness, and its strategy to leverage the great work the R community.