Informed Bayesian Inference for the A/B Test

Booming in business and a staple analysis in medical trials, the A/B test assesses the effect of an intervention or treatment by comparing its success rate with that of a control condition. Across many practical applications, it is desirable that (1) evidence can be obtained in favor of the null hypothesis that the treatment is ineffective; (2) evidence can be monitored as the data accumulate; (3) expert prior knowledge can be taken into account. Most existing approaches do not fulfill these desiderata. Here we describe a Bayesian A/B procedure based on Kass and Vaidyanathan (1992) that allows one to monitor the evidence for the hypotheses that the treatment has either a positive effect, a negative effect, or, crucially, no effect. Furthermore, this approach enables one to incorporate expert knowledge about the relative prior plausibility of the rival hypotheses and about the expected size of the effect, given that it is non-zero. To facilitate the wider adoption of this Bayesian procedure we developed the abtest package in R. We illustrate the package options and the associated statistical results with a synthetic example.


Introduction
Does the modification of a company website increase the number of online purchases? Does a new drug result in a lower mortality rate? These are just two examples of the kinds of questions that can be addressed with A/B testing, a procedure popular not only in business and medical clinical trials, but also in fields such as psychology, neuroscience, and biology. An A/B test compares the success rate of two options or treatment arms, A and B, and therefore can be conceptualized as a test for a difference between two proportions (Little 1989). 1 Typically, options A and B correspond to a control condition and an intervention or treatment of interest.
Regardless of the specific field of application, we believe three general desiderata for A/B tests can be identified. First, it is desirable that evidence can be obtained in favor of the null hypothesis that there is no difference between options A and B. For instance, suppose a programmer alters code that should leave the appearance of a website unaffected. An A/B test may be conducted to confirm that the code changes did not lead to unintended consequences. Alternatively, suppose that a cheaper drug is introduced as a replacement of the standard drug; here, an A/B test may confirm that the cheaper drug is as effective as the drug that is currently standard.
abtest Second, it is desirable that evidence can be monitored as the data accumulate. Data collection can be time-consuming and expensive, and interim tests allow one to assess whether the results in hand are already sufficiently compelling or whether additional data ought to be obtained. There is also an ethical aspect to this desideratum, one that is particularly pronounced in case of new clinical treatments that are potentially beneficial or harmful; it is unethical to withhold treatment that interim analysis shows to be beneficial, just as it is unethical to continue to administer a treatment that interim analysis shows to be harmful (e.g., Armitage 1960; see also Ware 1989 and the accompanying discussion).
Third, it is desirable that expert knowledge can be taken into account (e.g., O'Hagan 2019). In many A/B testing applications, there exists considerable expert knowledge about what size of effect to expect. For instance, the effect of website changes on conversion rates is often less than 0.5% (Berman et al. 2018). Incorporating such expert knowledge into the statistical analysis will yield a more targeted test.
Unfortunately, the majority of A/B testing procedures that are currently in vogue do not fulfill the above desiderata. Specifically, many companies apply standard p-value-based null hypothesis significance testing to assess whether or not options A and B differ. This approach cannot distinguish between absence of evidence (i.e., the data are inconclusive) and evidence of absence (i.e., the data provide support for the null hypothesis that options A and B do not differ; e.g., Dienes 2014). Furthermore, although common practice, sequentially monitoring the uncorrected p-value (and stopping data collection as soon as the p-value is smaller than some fixed α-level) invalidates the analysis (e.g., Feller 1940). However, there exist valid classical sequential procedures that enable one to monitor a corrected p-value as data accumulate (e.g., Malek et al. 2017). For instance, Optimizely, one of the leading commercial A/B testing platforms, has recently implemented an alternative p-value-based approach that allows users to continuously monitor the test outcome (Johari et al. 2017). Nevertheless, these sequential p-value-based procedures retain the inability to quantify evidence for the absence of an effect. Furthermore, (sequential) p-value-based A/B testing does not allow one to incorporate expert knowledge into the statistical analysis in a straightforward manner.
An alternative A/B testing approach that has become more popular of late is Bayesian estimation. For instance, VWO, another leading A/B testing platform, has recently implemented a Bayesian estimation approach (Stucchio 2015). Since Bayesian inference is immune to optional stopping (Berger and Wolpert 1988), this approach allows one to monitor the analysis output as data accumulate. A Bayesian estimation approach also enables the incorporation of expert knowledge via the specification of a prior distribution that captures the expert's knowledge about a parameter of interest. However, this approach operates under the assumption that an effect exists -since a continuous prior assigns zero probability to a single null valueand consequently does not allow one to obtain evidence in favor of the null hypothesis of no effect. Furthermore, the currently used Bayesian estimation approaches typically assign independent priors to the success probabilities of the control and treatment condition, a practice that was critiqued by Howard (1998). 2 To overcome the limitations of the current A/B tests we developed the abtest package in R (R 2 "do English or Scots cattle have a higher proportion of cows infected with a certain virus? Suppose we were informed (before collecting any data) that the proportion of English cows infected was 0.8. With independent uniform priors we would now give H1 (p1 > p2) a probability of 0.8 (because the chance that p2 > 0.8 is still 0.2). In very many cases this would not be appropriate. Often we will believe (for example) that if p1 is 80%, p2 will be near 80% as well and will be almost equally likely to be larger or smaller." (p. 363) Core Team 2019). The abtest package implements Bayesian inference for the A/B test, using informed prior distributions that induce a dependency between the two success probabilities. The analysis approach is based on a model by Kass and Vaidyanathan (1992); for alternative approaches see Deng et al. (2016), Jamil et al. (2017, Pham-Gia et al. (2017), andSkorski (2019). The implemented Bayesian procedure allows users (1) to obtain evidence in favor of the null hypothesis (e.g., Berger and Delampady 1987;Wagenmakers et al. 2018); (2) monitor the evidence as the data accumulate (e.g., Rouder 2014); and (3) elicit and incorporate expert prior knowledge (e.g., O'Hagan 2019). The abtest package thus fulfills all three desiderata mentioned above.
The abtest package provides functionality for both hypothesis testing and parameter estimation. In line with Jeffreys (1939) and Fisher (1928), we believe that testing and estimation are complementary activities (Haaf et al. 2019): before a parameter is estimated, it should be tested whether there is anything to justify estimation at all. Jeffreys (1939, p. 345) related this principle to Occam's razor: "variation must be taken as random until there is positive evidence to the contrary" (see also Kass and Raftery 1995, section 8.1). However, some researchers and practitioners oppose this idea, for instance because they believe that one should replace hypothesis testing with parameter estimation (e.g., Gelman and Rubin 1995;Cumming 2014). Nevertheless, the abtest package may also be useful for researchers without an interest in hypothesis testing, since the package can also be used exclusively for Bayesian parameter estimation (and prior elicitation).
This article is organized as follows: The next section discusses the implementation details of the Bayesian A/B test procedure used in abtest. Subsequently, the functionality of the abtest package and the practical benefits of the implemented approach are demonstrated using a synthetic example. The article ends with concluding comments.

Implementation details
The Bayesian A/B test implemented in the abtest package is based on Kass and Vaidyanathan (1992, section 3, "Testing Equality of Two Binomial Proportions"). An appendix with detailed derivations is available at https://osf.io/t3ajr/.

Model
Let y 1 denote the number of successes for option A with n 1 denoting the corresponding total number of observations for option A. Similarly, y 2 denotes the number of successes for option B with n 2 denoting the corresponding total number of observations for option B. The Bayesian A/B test model based on Kass and Vaidyanathan (1992) is specified as follows: 3 y 2 ∼ Binomial(n 2 , p 2 ).
(1) 3 Note that this is equivalent to a logistic regression model with a binary covariate (i.e., group membership) that is coded using ±0.5.
abtest Therefore, the model assumes that y 1 and y 2 follow binomial distributions with success probabilities p 1 and p 2 . These probabilities are functions of the two model parameters, β and ψ. Specifically, the log odds corresponding to p 1 are given by β − ψ/2 and the log odds corresponding to p 2 are given by β + ψ/2. The nuisance parameter β corresponds to the grand mean of the log odds and the test-relevant parameter ψ corresponds to the log odds ratio. When ψ is positive, this implies that p 2 > p 1 (i.e., option B has a higher success probability than option A); when ψ is negative this implies that p 2 < p 1 (i.e., option B has a lower success probability than option A).

Hypotheses
The abtest package enables both estimation of the model parameters and testing of hypotheses about the test-relevant log odds ratio parameter ψ. There are four hypotheses that are of potential interest: 1. The null hypothesis H 0 which states that the success probabilities p 1 and p 2 are identical, that is, p 1 = p 2 . This is equivalent to H 0 : ψ = 0. This hypothesis corresponds to the claim that there is no difference between options A and B (i.e., the "A/A test").
2. The two-sided alternative hypothesis H 1 which states that the two success probabilities p 1 and p 2 are not equal (i.e., p 1 = p 2 ), but does not specify which of the two is larger. This is equivalent to H 1 : ψ = 0. This hypothesis corresponds to the claim that options A and B differ but it is not specified which one yields more successes.
3. The one-sided hypothesis H + which states that the second success probability p 2 is larger than the first success probability p 1 . This is equivalent to H + : ψ > 0. This hypothesis corresponds to the claim that option B yields more successes than option A.
4. The one-sided hypothesis H − which states that the first success probability p 1 is larger than the second success probability p 2 . This is equivalent to H − : ψ < 0. This hypothesis corresponds to the claim that option A yields more successes than option B.
Researchers who conduct an A/B test are usually interested in answering the question: Does option B yield more successes than option A (i.e., H + ), fewer successes than option A (i.e., H − ), or is there no difference between options A and B (i.e., H 0 )? Therefore, it may be argued that the hypotheses of interest are typically H + , H − , and H 0 . Consequently, by default, only these three hypotheses are assigned non-zero prior probability in the abtest package. Specifically, a default prior probability of .50 is assigned to the hypothesis that there is no effect (i.e., H 0 ), and the remaining prior probability is split evenly across the hypothesis that there is a positive effect (i.e., H + receives .25) and a negative effect (i.e., H − also receives .25). The user may change these default prior probabilities to custom values. Table 1 provides an overview of five qualitatively different tests that can be conducted by assigning prior probabilities to hypotheses in certain ways. 4 The first column displays the default setting that assigns probability .50 to the null hypothesis and splits the remaining probability evenly across H + and H − . The second column displays a prior probability assignment that implements an undirected test (i.e., H 0 is compared to the undirected H 1 ). The third column displays a prior probability assignment for testing whether the effect is non-existent or positive. The fourth column displays a prior probability assignment for testing whether the effect is non-existent or negative. Finally, the fifth column displays a prior probability assignment for a test of direction, that is, for testing whether the effect is positive or negative. This last setting may be of interest whenever the null hypothesis is a priori deemed implausible, uninteresting, or irrelevant.

Parameter priors
The abtest package assigns normal priors to the model parameters: β ∼ N (µ β , σ 2 β ) and ψ ∼ N (µ ψ , σ 2 ψ ). As illustrated in the example below, these priors result in a dependency in the implied prior for the success probabilities p 1 and p 2 , which is generally desirable (Howard 1998).
For the one-sided hypotheses H + and H − , the prior on ψ is truncated at zero. Specifically, for H + , the prior on ψ is a truncated normal distribution with parameters µ ψ and σ ψ and lower bound at zero. For H − , the prior on ψ is a truncated normal distribution with parameters µ ψ and σ ψ and upper bound at zero. These normal priors are computationally convenient and sufficiently flexible to encode a wide range of prior information.
By default, the abtest package assigns standard normal priors to both β and ψ. For the nuisance parameter β, a standard normal prior results in a relatively flat implied prior on p 1 and p 2 when ψ = 0. Generally, the choice of a prior for the nuisance parameter β is relatively inconsequential (Kass and Vaidyanathan 1992). In contrast, the prior on the test-relevant parameter ψ is consequential, as it defines the extent to which the hypotheses of interest differ from H 0 . Our choice for a default standard normal prior on the test-relevant parameter ψ is motivated by the fact that a zero-centered prior does not favor any of the two options A or B a priori. Furthermore, the standard deviation of 1 results in a prior distribution that assigns mass to a wide range of reasonable log odds ratios (Chen et al. 2010) without being so uninformative that the results unduly favor H 0 (Bartlett 1957;Lindley 1957). 5 However, large changes in the prior standard deviation of the test-relevant parameter may result in large changes in the results, as the prior standard deviation governs the degree to which the hypothesis of interest makes predictions that differ from H 0 . To include prior knowledge abtest about the expected results, the abtest package allows the user to change the default values of the prior distributions for the nuisance parameter β and the test-relevant parameter ψ, either by changing the location of the normal prior distribution, the scale, or both.

Encoding prior information
A straightforward way to encode prior information about the model parameters is to set µ β , σ β , µ ψ , and σ ψ directly. However, it may sometimes be easier to specify prior distributions based on quantities such as the (log) odds ratio, relative risk (i.e., p 2 /p 1 , the ratio of the success probability in condition B and condition A), and absolute risk (i.e., p 2 − p 1 , the difference of the success probability in condition B and condition A). The elicit_prior function allows users to encode prior information about a quantity of interest (either log odds ratio, odds ratio, relative risk, or absolute risk). The function assumes that the prior on β is not the primary target of prior elicitation and is fixed by the user a priori (using the arguments mu_beta and sigma_beta) -for instance, to a standard normal prior which corresponds to a relatively flat implied prior on p 1 and p 2 when ψ = 0.
To encode prior information, the user needs to provide quantiles for a quantity of interest. Let q i , i = 1, . . . , I denote the values of I quantiles provided by the user and let prob i , i = 1, . . . , I denote the corresponding probabilities (e.g., for the median, prob i = 0.5). Least-squares minimization is used to obtain µ ψ and σ ψ as follows: where F (·; µ ψ , σ ψ ) corresponds to the cumulative distribution function (cdf) for the quantity of interest implied by the normal prior on ψ. For some quantities, this cdf also depends on the prior for β; however, as described above, it is assumed that µ β and σ β are fixed a priori.
These Laplace approximations work well in practice, even for sample sizes that are extremely small. As a demonstration, for a range of synthetic data sets we computed the (log of the) Bayes factor BF 10 which compares H 1 to H 0 using the above Laplace approximations and, as a comparison, also using bridge sampling (Meng and Wong 1996;Gronau et al. in press). The priors on β and ψ were standard normal distributions. Figure 1 displays the results and confirms that the Laplace approximation yields accurate results, even for sample sizes as small as n 1 = n 2 = 5.
For the one-sided hypotheses H + and H − , Laplace approximations did not appear to yield accurate results for small sample sizes, even after removing the constraint on ψ through the parameterization (β, ξ) = (β, log (ψ)) for H + and (β, ξ) = (β, log (−ψ)) for H − . The abtest package therefore uses importance sampling to increase the accuracy of the Laplace approximations when computing the marginal likelihoods for H + and H − . Specifically, a Laplace approximation is used to approximate the mode and covariance matrix of the posterior. The importance density is then given by a multivariate t distribution with location set to the approximated posterior mode, scale matrix set to the approximated posterior covariance matrix, and five degrees of freedom (note that the user can change the degrees of freedom). The marginal likelihood for H + is then estimated as follows: where β s ,ξ s S s=1 denotes S samples from the multivariate t importance density g is , and π + (β, ξ) = N (β; µ β , σ 2 β ) N + (exp(ξ); µ ψ , σ 2 ψ ) ξ, Laplace Log BF10 Bridge Log BF10 n1 = 100, n2 = 100 Figure 1: Comparison of the Laplace approximation and bridge sampling for computing the (log of the) Bayes factor BF 10 . We considered all possible combinations of n 1 ∈ {5, 10, 20, 50, 100} and n 2 ∈ {5, 10, 20, 50, 100}. For each of the n 1 -n 2 combinations, we considered all possible combinations of y 1 ∈ { 1 5 n 1 , 2 5 n 1 , 3 5 n 1 , 4 5 n 1 } and y 2 ∈ { 1 5 n 2 , 2 5 n 2 , 3 5 n 2 , 4 5 n 2 }. The results reveal that the two methods yield highly similar results, even when sample size is very small.
where N (x; y, z) denotes the probability density function of a normal distribution with mean y and variance z that is evaluated at x. Furthermore, N + (x; y, z) denotes the density of a normal distribution that is truncated to allow only positive values for x. The marginal likelihood for H − is computed analogously.

Obtaining posterior samples
In a Bayesian A/B test application, one may not only be interested in testing hypotheses, but also in obtaining posterior samples for the model parameters under H 1 , H + , and H − . The abtest package allows the user to obtain posterior samples using sampling importance resampling (e.g., Robert and Casella 2010). Specifically, posterior samples for H + are obtained as follows (samples for the other hypotheses are obtained in an analogous manner): 1. Generate S samples from the multivariate t proposal distribution mentioned before, denoted by β s ,ξ s S s=1 .
4. Resample (with replacement) from the samples obtained from the importance density according to the normalized importance weights v s which yields (approximate) samples from the posterior distribution.

Example: effectiveness of resilience training
Suppose the managers of a large consultancy firm are interested in reducing the number of employees who quit within the first six months, possibly due to the high stress involved in the job. A coaching company offers a resilience training and claims that this training greatly reduces the number of employees who quit. Implementing the training for all newly hired employees would be expensive and some of the managers are not completely convinced that the training is at all effective. Therefore, the managers decide to run an A/B test where half of a sample of newly hired employees will receive the training, the other half will not be trained. The dependent variable is whether or not an employee quit within the first six months (1 = still on the job, 0 = quit).

Prior specification
Before commencing the A/B test, the managers ask the coaching company to specify how effective they believe the training will be. The coaching company claims that, based on past experience with the training, they expect the proportion of employees who do not quit within the first six months to be 15% larger for the group who received the training, with a 95% uncertainty interval ranging from a 2.5% benefit to a 27.5% benefit. Assuming that abtest the claimed 15% corresponds to the prior median, this expectation corresponds to a median absolute risk (i.e., p 2 − p 1 ) of 0.15 with a 95% uncertainty interval ranging from 0.025 to 0.275. The elicit_prior function can be used to encode this prior information: R> library("abtest") R> prior_par <-elicit_prior(q = c(0.025, 0.15, 0.275), + prob = c(.025, .5, .975), + what = "arisk") The obtained prior on the absolute risk can be visualized as follows:

R> plot_prior(prior_par, what = "arisk")
The resulting graph is shown in the top panel of Figure 2. The user can also visualize the (implied) prior for other quantities. For instance, the prior on the log odds ratio (middle panel of Figure 2) is obtained as follows:

R> plot_prior(prior_par, what = "logor")
The implied prior on the success probabilities p 1 and p 2 (bottom panel of Figure 2) is obtained as follows:

R> plot_prior(prior_par, what = "p1p2")
The bottom panel of Figure 2 illustrates that there is a dependency between p 1 and p 2 which is arguably desirable (Howard 1998): When one of the success probabilities is very (small) large, it is likely that the other one will also be (small) large.

Hypothesis testing
After having specified the prior distribution for the test-relevant parameter, the consultancy firm starts to collect data. These (synthetic) data 6 are included in the abtest package (i.e., seqdata) and consist of a total of 1, 000 observations (500 in each group). The number of employees still on the job after six months is 249 in the group without training and 269 in the trained group. Therefore, the observed success probabilities arep 1 = .498 in the control group andp 2 = .538 in the group that received training. Consequently, the observed success probabilities suggest that there is a positive effect of the training of 4%; however, a statistical analysis is required to assess whether this observed difference is statistically compelling. The ab_test function can be used to conduct a Bayesian A/B test as follows: R> data("seqdata") R> set.seed(1) R> ab <-ab_test(data = seqdata, prior_par = prior_par) This yields the following output: The top panel displays the prior distribution for the absolute risk which corresponds to the difference between the probability of still being on the job for the trained and the non-trained employees (i.e., p 2 − p 1 ). The middle panel shows the prior distribution for the log odds ratio parameter ψ. The bottom panel displays the implied joint prior distribution for the success probabilities p 1 and p 2 . The bottom panel illustrates that the two success probabilities are assigned dependent priors. Furthermore, most prior mass is above the main diagonal which represents the coaching company's prior expectation that the training is successful. The first part of the output presents Bayes factors in favor of the hypotheses H 1 , H + , and H − , where the reference hypothesis (i.e., denominator of the Bayes factor) is H 0 . Since all three Bayes factors are smaller than 1, they all indicate evidence in favor of the null hypothesis of no effect. The next part of the output displays the prior probabilities of the hypotheses with non-zero prior probability. As explained before, the default setting assigns probability .50 to the null hypothesis and splits the remaining probability evenly across H + and H − . The user can change this default setting via the prior_prob argument (e.g., to assign non-zero probability to H 1 ). The final part of the output displays the posterior probabilities of the hypotheses with non-zero prior probability. The posterior probability of the null hypothesis H 0 indicates that the data have increased the plausibility of the null hypothesis from .50 to .76. Furthermore, the data have decreased the plausibility of both H + and H − .
As an aside, it may appear paradoxical that the data indicate a 4% positive effect of the training and yet the posterior probability of H − is larger than that of H + . The reason for this result is that the company's prior was overly ambitious, and H + is penalized for having predicted effects that are much too large. Furthermore, note that the test-relevant prior distribution under H − is obtained by truncating the prior on ψ at zero and renormalizing. Since the company's prior assigns almost all mass to positive log odds ratio values, renormalizing the negative part of the distribution results in a prior that is highly similar to H 0 ; this explains why H − receives non-trivial posterior probability. These considerations underscore the fact that the outcome of a Bayesian analysis is always relative to the specific set of models (and associated prior distributions) under consideration. Because highly informed priors can exert a large influence on the results, it is generally wise to examine the robustness of the conclusions by executing the default analysis as well. This analysis is reported in the online appendix available at https://osf.io/t3ajr/.
The abtest package allows users to visualize the posterior probabilities of the hypotheses by P(H+ | data) = 0.053 P(H | data) = 0.187 P(H0 | data) = 0.760

R> prob_wheel(ab)
Overall, the data support the hypothesis that the training is ineffective over the company's hypothesis that the training is highly effective. The Bayes factor for H 0 over H + equals 1/0.138 ≈ 7.2, which indicates moderate evidence (Jeffreys 1939, Appendix I).
Since the data set is of a sequential nature, it may be of interest to consider not only the result based on all observations, but to conduct also a sequential analysis that tracks the evidential flow as a function of the total number of observations (i.e., the number of observations across both groups). This sequential analysis can be conducted as follows: R> plot_sequential(ab, thin = 4) Setting the thin argument to 4 indicates that the evidence is computed after every 4th observation. Thinning can be useful to speed up the analysis in case the data set is very large or in case observations arrive in batches. Figure 4 displays the result of the sequential analysis. The posterior probability of each hypothesis with non-zero prior probability is plotted as a function of the total number of observations. At the top, two probability wheels visualize the prior probabilities of the hypotheses and the posterior probabilities of the hypotheses based on all available data. Figure 4 shows that after some initial fluctuation, adding more observations increased the probability of the null hypothesis that there is no effect of the training.

Parameter estimation
The data indicate evidence in favor of the null hypothesis versus the hypothesis that the training is highly effective, leaving open the possibility that the training does have an effect, but of a more modest size than the company anticipated. To assess this possibility one may investigate the potential size of the effect under the assumption that the effect is non-zero. 7 For parameter estimation, we generally prefer to investigate the posterior distribution for the unconstrained alternative hypothesis H 1 ; however, the abtest package also provides posterior samples and plotting functionality for the constrained hypotheses H + and H − .
The top panel of Figure 5 displays the posterior distribution for the absolute risk (i.e., p 2 −p 1 ) that can be obtained as follows: R> plot_posterior(ab, what = "arisk") The top panel of Figure 5 shows the prior distribution as a dotted line and the posterior distribution (with 95% central credible interval) as a solid line. The plot indicates that, under the assumption that the difference between the two success probabilities is not exactly zero, it is likely to be smaller than expected: the posterior median is 0.067 and the 95% central credible interval ranges from 0.011 to 0.122.
The middle panel of Figure 5 displays the posterior distribution for the log odds ratio ψ that can be obtained as follows: R> plot_posterior(ab, what = "logor") The middle panel of Figure 5 indicates that, given the log odds ratio is not exactly zero, it is likely to be between 0.043 and 0.492, where the posterior median is 0.267.
It may also be of interest to consider the marginal posterior distributions of the success probabilities p 1 and p 2 . This plot can be produced as follows: R> plot_posterior(ab, what = "p1p2") The bottom panel of Figure 5 displays the resulting plot. In this example, p 1 and p 2 correspond to the probability of still being on the job after six month for the non-trained employees and the employees that received the training, respectively. The bottom panel of Figure 5 indicates that the posterior median for p 1 is 0.485, with 95% credible ranging from 0.443 to 0.527, and the posterior median for p 2 is 0.551, with 95% credible interval ranging from 0.509 to 0.592.
In sum, this synthetic data set offers modest evidence in favor of the null hypothesis which states that the training is not effective over the hypothesis that the training is highly effective; nevertheless, the consultancy firm should probably continue to collect data in order to obtain more compelling evidence before deciding whether or not the training should be implemented.
If the true effect is as small as 4%, continued testing will ultimately show compelling evidence for H + over H 0 . Note that continued testing is trivial in the Bayesian framework: the results can simply be updated as new observations arrive.

Concluding Comments
In this article, we have introduced the abtest package that implements both Bayesian hypothesis testing and Bayesian estimation for the A/B test using informed priors. The procedure allows users to (1)  as data accumulate; and (3) elicit and incorporate expert prior distributions. We hope that the provided analysis approach is useful across different fields that apply A/B testing on a routine basis, particularly business and medicine.
Despite the practical benefits that the package offers right now, there are areas for future improvement. For instance, abtest currently allows users to compare two groups; however, there are applications in which one may be interested in simultaneously comparing more than two groups. Furthermore, at the moment, abtest expects the dependent variable to be binary. Nevertheless, in certain scenarios, it may be more natural to compare the two groups based on a continuous outcome variable. This scenario resembles an independent samples t-test for which well-established Bayesian procedures exist (e.g., Rouder et al. 2009;Ly et al. 2016). Moreover, currently, the abtest package does not provide functions for generating predictions. Note, however, that users can generate predictions in a straightforward manner themselves based on the posterior samples that are provided by abtest. The implementation also does not allow users to incorporate utilities explicitly (e.g., Lindley 1985). However, again, based on the provided posterior probabilities and posterior samples, users who wish to take into account utilities may do so in a relatively straightforward way. A more structural limitation is that abtest has been developed to analyze A/B test data, but not to run the A/B test experiment itself.
In sum, A/B testing is ubiquitous in business and medicine. Here we have demonstrated how the abtest package enables relatively complete Bayesian inference including the capability to obtain support for the null, continuously monitor the results, and elicit and incorporate expert prior knowledge. Hopefully, this approach forms a basis for evidence-based conclusions that will benefit both businesses and patients.