salmon: A Symbolic Linear Regression Package for Python

One of the most attractive features of R is its linear modeling capabilities. We describe a Python package, salmon, that brings the best of R's linear modeling functionality to Python in a Pythonic way -- by providing composable objects for specifying and fitting linear models. This object-oriented design also enables other features that enhance ease-of-use, such as automatic visualizations and intelligent model building.


Introduction
Linear models are ubiquitous in statistics, science, engineering, and machine learning.A linear model assumes that the expected value of a response variable y is a linear function of explanatory variables, x 1 , . . ., x p : for coefficients β 0 , β 1 , . . ., β p .Model (1) is more flexible and general than it may first appear.For example, linear models do not necessarily have to be linear in the original explanatory variables.If we add higher-order polynomial terms to a linear model, then we can also model non-linear effects: (2) How are linear models used?First, they can be used for description.For example, we may want to emphasize trends in a scatterplot by superimposing a best-fit line on the points.Second, they may be used for prediction.Once the coefficients β 0 , . . ., β p have been estimated, the linear model can be used to predict the value of the response y for a new observation where only the explanatory variables are known.Finally, linear regression can be used for inference.The coefficients may encode information about nature, such as the causal effect of one variable on another, in which case we need hypothesis tests and confidence intervals for model parameters.
Because of these different use cases, several solutions for fitting linear models have emerged.Figure 1 illustrates the tradeoffs.At one extreme are point-and-click software packages, like JMP (SAS Institute Inc. 2019) and Minitab (Minitab Inc. 2017), which fit linear models and provide automatic visualizations.Although these packages have much built-in functionality, they are not easily extensible.For example, complicated data cleaning and wrangling often have to be done outside these software packages.
At the other extreme are programming languages for scientific computing, like MATLAB (The MathWorks Inc. 2021) and Python (Van Rossum et al. 2011).To fit a linear model, users have to manually construct the matrices to be passed to a least-squares solver.To obtain predictions from the fitted model, users must implement the matrix multiplications.Although these environments can be powerful, users have to keep track of low-level details that distract from the modeling.
Libraries within these languages ease the burden somewhat.For example, scikit-learn is a Python library for machine learning that provides a consistent API (application programming interface) for specifying, fitting, and predicting using a linear regression model (Pedregosa et al. 2011).However, it provides little help with the other two uses of linear regression, description and inference, offering neither visualizations nor uncertainty estimates.Also, the LinearRegression model in scikit-learn assumes that the explanatory variables have already been transformed into the numerical matrix that will be passed into a least-squares solver, so users must manually transform the variables or else define the transformations as part of the model.R (R Core Team 2023) occupies a medium between the two extremes.On the one hand, it is a full-fledged programming language.On the other, it provides a high-level API for specifying and fitting linear regression models through formulas and the lm function.For example, Model 2 could be specified and fit in R as R> lm(y ~poly(x1, 3) + x2, data) Although R provides automatic diagnostic plots, it offers limited visualizations of the fitted model, in comparison with point-and-click software packages such as JMP and Minitab.
patsy (Smith et al. 2022) and statsmodels (Seabold and Perktold 2010) are Python libraries that port R-style modeling to Python.Like R, statsmodels provides some automatic diagnostic plots but not visualizations of the fitted model.However, R formula syntax is not interpretable by Python, so formulas have to be specified as strings: >>> smf.ols ("y ~poly(x1, 3) + x2", data) This creates problems when the column names do not meet the language's rules for variable names.For example, columns with names such as "weight.in.kg" or "person's height" would need to be wrapped in patsy's "quote" object, Q: >>> smf.ols("y ~Q('weight.in.kg') + Q('person\'s height')", data) Because the Q constructor takes a string as input, we have strings within strings -hence the need for two different kinds of string delimiters.If the variable name itself includes an apostrophe or a quotation mark, then that character needs to be escaped.An object-oriented approach to model specification, described in this paper, avoids these complications, allowing any object that is a valid column name in a pandas DataFrame to be used directly in the model specification, without requiring a workaround for problematic variable names.Furthermore, having objects associated with each variable in a model makes it easy to specify customization options for each variable (e.g., the baseline level for a categorical variable).
In this paper, we describe salmon, a Python package for linear regression that offers an objectoriented interface for specifying linear models.The two main contributions of salmon are: 1. Providing an R-like (but Pythonic) API for specifying models.
2. Producing appropriate visualizations of the models, bridging the gap with point-andclick packages, like JMP and Minitab.
Its philosophy and design is similar to other statistical packages in Python, such as Symbulate (Ross and Sun 2019).Throughout, we provide comparisons of model specification in salmon with the similar formula syntax of R.
The easiest way to obtain salmon is to install it via pip: pip install salmon-lm but it can also be installed from source at http://github.com/ajboyd2/salmon.

Model building and fitting
We introduce the design and syntax of salmon by way of a case study.We assume that all salmon objects and methods have already been imported into the global namespace, as follows: >>> from salmon import Q, C, Log, Poly, LinearModel  1.
We will start with the simplest possible model, which assumes the sale price is a linear function of just the square footage: This simple linear regression model can be specified in salmon as follows: Price($) ~1 + Sq.Ft.
Notice that a quantitative variable is specified by creating a Quantitative object, or Q for short, with the name of the column in the DataFrame.Alternatively, we could have created a generic variable using V and let salmon infer the type.Either way, these variable objects become the explanatory (x) and response (y) components of a 'LinearModel' object.
An intercept is added by default, as evidenced by the constant term 1 in the printout.To specify a model without an intercept, we could either insert -1 into the expression (which mirrors the formula syntax of R) or specify intercept = False: Price($) ~Sq.Ft.
To fit the simple_model above to data, we call the .fit()method and pass in a pandas DataFrame containing those variables: >>> simple_model.fit(data)Notice that .fit()returns the standard regression output -containing the coefficients, their standard errors, the t statistic and p value for testing β j = 0, and associated 95% confidence intervals -stored in a DataFrame for easy display and access.From the regression output, Sq.Ft. appears to have substantial explanatory power, but to be sure, we should visualize the model.A model can be visualized using the .plot()method; salmon will automatically choose an appropriate visualization: Notice that the line of code above produces a scatterplot of the data in black, with the fitted regression line superimposed in red (Figure 2).To produce a similar plot from a model that was fit in R, scikit-learn, or statsmodels, we would have had to first create a grid of "test" x values, use the fitted model to predict the value of the response at each of those x values, and then plot these predictions as a line using some plotting library.
Confidence and prediction intervals can be added to any model visualization, by passing the desired error rate α (so that the confidence level is 1 − α) to arguments confidence_band or prediction_band in .plot(): >>> alpha_val = 0.05 >>> simple_model.plot(confidence_band= alpha_val) In Figure 3, the points appear to "fan out" from the fitted model as square footage increases.This suggests that the assumption of constant variance (homoskedasticity) may be violated.One fix is to transform the response variable.In other words, we can instead fit the model: which can be accomplished by literally transforming the response: Compare with R, where the same model would be specified as R> log(`Price($)`) ~`Sq.Ft.Ǹ ote that the backticks are only necessary due to the variable names containing special characters (e.g., spaces and dollar sign).
salmon supports a number of different transformations by default, such as: • Natural Logarithm: Log(X).
Sq. Ft. 5.906e-4 1.055e-5 55.96 0.000 5.699e-4 6.113e-4 Intercept 11.14 0.0166 670.6 0.000 11.11 11.17By default, salmon plots the variables in the original (untransformed) space, which is usually desired.After all, we are interested in Price($), not log(Price($)).However, the log-space view can be helpful for checking whether the linear regression assumptions are met.To produce a plot in the transformed space, specify transformed_y_space = True.
>>> simple_log_model.plot(confidence_band= alpha_val, ... transformed_y_space = True) From Figure 5, it is clear that the fanning is indeed reduced in log-space -as we had hoped!However, the model may be further improved by adding a quadratic term to capture the parabolic relationship.We create a new model for this: A polynomial model, like Model 5, can be specified in salmon using the Poly() class, much as one uses the poly() function in R: >>> poly_model = LinearModel(Poly(Q("Sq.Ft."), 2), Log(Q("Price($)"))) >>> print(poly_model)  The p value for the quadratic term (Sq.Ft.) 2 is practically zero, which suggests that the fit is substantially improved by adding the quadratic term.
So far, the models have only used square footage.It is worth investigating whether the fit can be improved by adding the two categorical variables in the data set -the presence of a fireplace and the overall style of house.The former is binary, with a value of "Yes" indicating that the house has a fireplace.The latter can take three different values: "1 Story", "2 Story" and "Other".
The backquotes are necessary because of the non-standard column names.Also, the code above assumes that Style and Fire? are unambiguously categorical.If they could be confused for quantitative variables, they would have to be explicitly cast to factors: R> log(`Price($)`) ~as.factor(`Style`) + as.factor(`Fire?`)This is an example where salmon's insistence on explicit variable types using C and Q saves typing in the long run.To specify that these variables are categorical, we create a Categorical object (or C for short).
As the regression output shows, salmon automatically chooses a dummy/one-hot encoding of the levels and drops a baseline level.This behavior can be customized by specifying additional arguments to C. For example, salmon chose "1 Story" to be the baseline level for Style.To make "Other" the baseline level, we would specify the variable as C("Style", baseline = "Other").Changing the baseline level does not affect the predictions from the model but does affect the interpretation of the coefficients.
Model 6 assumes no interaction between the factors, which is why the fitted lines in Figure 7 are parallel.To fit a model with an interaction term: The same model would be specified in R as log('Price($)') ~'Style' * 'Fire?'.Note that R uses * to mean factor crossing, rather than multiplication.
>>> interaction_model.plot(confidence_band= alpha_val, ... transformed_y_space = True) In Figure 8, we see that the visualization of the fitted Model 7 is the classical interaction plot (Faraway 2016).Even when we allow for an interaction, the lines are still roughly parallel, suggesting that the interaction is weak.For simplicity, we omit the quadratic term: The same model would be specified in R by: log('Price($)') ~'Sq.Ft.' + 'Style'*'Fire?'.

>>> quant_cat_model.plot(confidence_band = alpha_val)
However, Model 8 assumed no interaction between the quantitative variable and the categorical variables.As one final step, we consider interacting the square footage with the categorical variables, which is equivalent to fitting a separate model for each possible combination of the The sheer number of interaction terms would make this model difficult to specify, but because the variables are represented as composable objects in salmon, they can be stored in variables and reused, as illustrated below.
The new data must be a pandas DataFrame whose column names match the variable names in the model specification.In most cases, the new data will be in the same format as the data used to fit the model, so this requirement will automatically be satisfied.In our example below, the new DataFrame, new_data, is simply the same as data.

>>> full_model.predict(new_data)
Accompanying confidence or prediction intervals can be obtained by passing the desired error rate α to confidence_interval or prediction_interval: >>> full_model.predict(new_data,prediction_interval = alpha_val) Occasionally, it is useful to add variables to a regression model without estimating their coefficients.In salmon, this is achieved by simply shifting the response variable.
Compare this with R, where an offset is specified using the offset() command: Anecdotally, we have evidence that including an offset term by shifting the response is more intuitive.One of our colleagues, who uses R but was unfamiliar with the offset() function, attempted to implement offsets by shifting the response, which does not work in R.This motivated our decision to support shifting the response variable in salmon.

Model diagnostics
The previous section explained how to build regression models; this section focuses on how to evaluate them.There are both visual and analytical diagnostics; we only attempt to highlight a few examples of each.For a full description of the API, please refer to the online documentation.
In the following examples, we will check whether the assumptions are satisfied for Model 9, full_model, as well as compare it to Model 8, quant_cat_model, which is a nested model.Both models were defined in Section 2.

Visual Diagnostics
In Equation 1, we only made assumptions about the conditional expectation of the response y, given the explanatory variables x.However, statistical inferences for linear regression require assumptions about the entire conditional distribution, not just the expectation.Perhaps the easiest way to specify this conditional distribution is to write where the error term ϵ is independent of x and assumed to follow a Normal(0, σ 2 ) distribution.
One can easily verify that this model is a special case of Model 1.
The assumption that the errors are independent and normally distributed with constant variance is usually assessed by inspecting the residuals.In salmon, the residual diagnostic plots can be generated directly from the fitted model:

>>> full_model.plot_residual_diagnostics()
Four plots of the residuals are produced, as shown in Figure 11: A normal Q-Q plot, a histogram, a scatterplot of the fitted values versus the residuals, and a line plot of the residuals versus their order in the data set.The residuals appear to be skewed to the left, but otherwise there do not appear to be obvious violations of the assumptions.The line plot of the residuals versus their order is not as useful for this data set, since the row ordering is arbitrary, but it could potentially reveal violations of independence in data sets where row order is significant, such as time series data.
To further investigate assumptions, we can produce residual plots and partial regression plots by calling the methods model.residual_plots()and model.partial_plots()respectively.These will each produce one plot per explanatory variable.

Analytical Diagnostics
Visual diagnostics are in the eye of the beholder, so analytical diagnostics are equally important.One common, if not always the most useful, diagnostic of a model's fit is the R 2 value.
>>> full_model.r_squared()0.652748378553996 One problem with R 2 is that it is monotonically increasing in the number of variables in the model.A better diagnostic is Adjusted-R 2 , which accounts for the number of variables.To calculate Adjusted-R 2 , we specify the argument adjusted = True.Besides R 2 , AIC and BIC are also available.
A more formal way of evaluating a model is to test it against another model.The anova command, when called on a single model, returns the results of the omnibus F test, as well as the partial F test for the each individual variable in the model.

>>> from salmon import anova >>> anova(full_model)
The inclusion of interactions between the quantitative variable (square footage) and the categorical variables (fireplace and house style) adds complexity to the model that may not be supported by the data.To test this, we can perform a partial F test comparing the full model to a reduced model without these interaction terms using the anova command again, except passing in two fitted models.This mirrors the behavior of the anova() function in R.

>>> anova(full_model, quant_cat_model)
The results of the partial F test confirm that interactions significantly improve the fit of the model to the data.Table 13: Timing comparisons of salmon and competitors on four models.For each model, we time two tasks: 1) fitting the model (including inferential statistics) and 2) using the fitted model to make predictions.All reported times are in milliseconds.
Now, the selected model is guaranteed to not violate standard model building practices.For example, if an interaction term is included, the main effects will be also.

Integration with Python ecosystem
The salmon package is deliberately built on top of pandas, in order to leverage the power of its DataFrame for organizing heterogeneous data (McKinney 2010).It expects pandas DataFrames as inputs, and it outputs results as pandas DataFrames.However, salmon is also integrated with other packages within the Python data science ecosystem.
First, the 'LinearModel' class in salmon implements .fit()and .predict()methods, so any 'LinearModel' can be used as scikit-learn estimators.This means that all of scikit-learn's routines for model selection and evaluation, such as cross validation, will work with models that are created in salmon (Buitinck et al. 2013).
Second, salmon plots are produced using matplotlib.These plots can be further edited and customized in matplotlib because every plotting command in salmon returns a matplotlib Figure (Hunter 2007).

Timing and numerical comparisons
We conducted timing comparisons between salmon, statsmodels, and R.  11.Only salmon recovers the correct coefficients.R fails to produce a coefficient at all, while statsmodels returns the incorrect coefficients with a high degree of confidence.
explanatory variable x.Then, we generate the response variable y according to the relation where ϵ 1 , . . ., ϵ n are i.i.d.Normal(0, 1).Model 11 describes a simple linear regression model with β 0 = β 1 = 2.The large values of x make it nearly collinear with the intercept.
The estimated parameters ( β0 and β1 ) from the three packages are shown in Table 14.R fails to estimate a coefficient at all.The behavior of statsmodels is more troubling; it throws a warning, but proceeds to report incorrect values with a high degree of precision.Only salmon returns the correct values, due to careful handling of numerical linear algebra (see Appendix A.2 for more details).Note that although the intercept is suspiciously large, this is because the standard error of the intercept is very large.Recall that the standard deviation of the intercept is σ 1 n + x2 i (x i −x) 2 ≈ 5471751 for the parameters of our simulation.The Jupyter notebooks to reproduce these experiments can be found in the supplementary materials and on GitHub at https://github.com/ajboyd2/salmon/tree/master/paper_outputs.

Conclusion
The salmon package allows for convenient and intuitive model building, using, and diagnosing for linear regression within Python.As our examples have demonstrated, the consistent interface allows for an linear regression experience in Python that reproduces -and in some cases, enhances -the experience in R.
Future work in salmon will pertain to extending the package's capabilities by implementing other types of models beyond linear regression, such as generalized linear models, as well as providing additional tests that are supported in R, such as linear contrasts.

A.1. Expression tree
A goal for salmon was to be able to represent variables in an object oriented fashion that also allowed easy manipulations/transformations.The solution to this was to represent the variables as an abstract syntax tree, borrowing heavily from a programming languages approach.Every expression of variables can be modeled as a tree where each inner node is an operation such as addition, multiplication, or transformation and each leaf is a variable (either quantitative or categorical).The following code and figure are equivalent in representation.This design allowed for a recursive style implementation when defining actions amongst variables.We denote the class 'Expression' to denote the most general and all encompassing structure of variables.The exact object composition scheme can be described in BNF with the following grammar:

A.2. Model fitting
Our general strategy for fitting the linear regression model is similar to R's (R Core Team 2023).We first calculate a QR decomposition of the design matrix X, which allows us to calculate the regression coefficients β by solving the system R β = Q ⊤ y and the estimated covariance matrix Σ = σ2 (X ⊤ X) −1 by solving the system R ⊤ R Σ = I (since X ⊤ X = R ⊤ R).
When the model includes an intercept, R calculates the coefficients by appending a column of ones to the design matrix X = 1 X .
We take a slightly different approach, centering all of the variables first.The centered versions of the design matrix X and the response variable y are given by

FlexibilityFigure 1 :
Figure 1: Tradeoffs of different solutions for fitting linear models.

Table 3 :
Coefficients and inferences for the fitted Model 4, simple_log_model.

Figure 4 :
Figure 4: Visualization of the fitted Model 4, simple_log_model, with confidence band.By default, salmon plots the original variables, rather than the transformed variables.

Figure 5 :
Figure 5: Two visualizations of Model 4, simple_log_model.The plot on the left shows the original response (Price($)), while the plot on the right shows the transformed response (log(Price($))).

Figure 6 :
Figure 6: Two visualizations of the fitted polynomial Model 5, poly_model.The plot on the left shows the original response (Sale Price), while the plot on the right shows the transformed response (log(Price($))).

Figure 7 :
Figure 7: Two visualizations of Model 6, simple_cat_model, which contains two categorical predictors.The plot on the left shows the original response (Price($)), while the plot on the right shows the transformed response (log(Price($))).

Figure 8 :
Figure 8: Two visualizations of Model 7, interaction_model, which contains two categorical predictors and their interaction.The plot on the left shows the original response (Sale Price), while the plot on the right shows the transformed response (log(Sale Price($))).

Figure 9 :
Figure 9: Visualization of the fitted Model 8, quant_cat_model.The quantitative variable is on the x axis, while each unique combination of the categorical variables is represented by a different line and color.
Figure 10: Visualization of the fitted Model 9, full_model, which includes an interaction between the quantitative and categorical variables.

Figure 11 :
Figure 11: Residual diagnostic plots for the fitted regression model.

Table 7 :
Coefficients and inferences for Model 8, quant_cat_model.(The confidence intervals are not shown for space considerations.) Now, let's combine these categorical variables with the square footage into a single model.

Table 8 :
Coefficients and inferences for Model

Table 9 :
First five predicted values from the full_model using the DataFrame: new_data.

Table 10 :
First five predicted values with 95% prediction intervals from the full_model using the DataFrame: new_data.

Table 14 :
The results are shown in Table13.The salmon package is consistently faster than statsmodels in all scenarios.It is also faster than R on large examples (with p = 1000 predictors), although it is still slower than R on small examples, perhaps because formulas are native in R. The speed of salmon on large examples is likely due to the careful handling of the linear algebra, some details of which are described in Appendix A.2.Estimated parameters (with standard errors in parentheses) from salmon and competitors on synthetic data simulated according to Equation We also compare the numerical stability of salmon, statsmodels, and R on a synthetic example.First, we generate a grid of n = 1000 evenly-spaced values in the range [10 8 −1, 10 8 +1] for the