bbl : Boltzmann Bayes Learner for High-Dimensional Inference with Discrete Predictors in R

Non-regression-based inferences, such as discriminant analysis, can account for the effect of predictor distributions that may be significant in big data modeling. We describe bbl , an R package for Boltzmann Bayes learning, which enables a comprehensive supervised learning of the association between a large number of categorical predictors and multi-level response variables. Its basic underlying statistical model is a collection of (fully visible) Boltzmann machines inferred for each distinct response level. The algorithm reduces to the naive Bayes learner when interaction is ignored. We illustrate example use cases for various scenarios, ranging from modeling of a relatively small set of factors with heterogeneous levels to those with hundreds or more predictors with uniform levels such as image or genomic data. We show how bbl explicitly quantifies the extra power provided by interactions via higher predictive performance of the model. In comparison to deep learning-based methods such as restricted Boltzmann machines, bbl -trained models can be interpreted directly via their bias and interaction parameters.


Introduction
Many supervised learning tasks involve modeling discrete response variables y using predictors x that can occupy categorical factor levels (Hastie, Tibshirani, and Friedman 2009).Ideally, it would be best to model the joint distribution P (x, y) via maximum likelihood, Θ = arg max Θ [ln P (x, y|Θ)] , to find parameters Θ. Regression-based methods use P (x, y) = P (y|x)P (x) ≈ P (y|x).Many rigorous formal results known for regression coefficients facilitate interpretation of their significance.An alternative is to use P (x, y) = P (x|y)P (y) and fit P (x|y).Since y is low-dimensional, this approach could capture extra information not accessible from regression when there are many covarying predictors.To make predictions for y using P (x|y), one uses the Bayes' formula.Examples include linear and quadratic discriminant analyses (Hastie et al. 2009, pp. 106-119) for continuous x.For discrete x, naive Bayes is the simplest approach, where the covariance among x is ignored via P (x|y) ≈ i P (x i |y) (1) In this paper, we focus on supervised learners taking into account the high-dimensional nature of P (x|y) beyond the naive Bayes-level description given by Equation 1. Namely, a suitable parametrization is provided by the Boltzmann machine (Ackley, Hinton, and Sejnowski 1985), which for the simple binary predictor x i ∈ {0, 1}, induces where Z y is the normalization constant, or partition function.Equation 2 is the Gibbs distribution for Ising-type models in statistical mechanics (Chandler 1987).The two sets of parameters h (y) i and J (y) ij each represent single variable and two-point interaction effects, respectively.When the latter vanishes, the model leads to the naive Bayes classifier.Although exact inference of Equation 2 from data is in general not possible, recent developments led to many accurate and practically usable approximation schemes (Hyvärinen 2006;Morcos et al. 2011;Nguyen and Wood 2016a,b;Nguyen, Zecchina, and Berg 2017), making its use in supervised learning a viable alternative to regression methods.Two approximation methods available for use are pseudo-likelihood inference (Besag 1975) and mean field theory (Chandler 1987;Nguyen et al. 2017).
A recently described package BoltzMM can fit the ('fully visible') Boltzmann machine given by Equation 2 to data using pseudo-likelihood inference (Jones, Bagnall, and Nguyen 2019a;Jones, Nguyen, and Bagnall 2019b).In contrast, classifiers based on this class of models remain largely unexplored.Supervised learners using statistical models of the type (2) usually take the form of the restricted Boltzmann machines instead (Hinton 2012), where (visible) predictors are augmented by hidden units and interactions are zero except between visible and hidden units.The main drawback of such layered Boltzmann machine learners, as is common in all deep learning algorithms, is the difficulty in interpreting trained models.In contrast, with the fully visible architecture, J (y) ij in Equation 2, if inferred with sufficient power while avoiding overfitting, has direct interpretation of interaction between two variables.
We refer to such learning/prediction algorithms using a generalized version of Equation 2 as Boltzmann Bayes inference.An implementation specific to genomic single-nucleotide polymorphism (SNP) data (two response groups, e.g., case and control, and uniform three-level predictors, i.e., allele counts of x i ∈ {0, 1, 2}) has been reported previously (Woo, Yu, Kumar, Gold, and Reifman 2016).However, this C++ software was geared specifically toward genome-wide association studies and is not suitable for use in more general settings.We introduce an R package bbl (Boltzmann Bayes Learner; Woo 2022) which is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=bbl and which uses both R and C++ for usability and performance, allowing the user to train and test statistical models in a variety of different usage settings.

Model and algorithm
For completeness and for reference to software features described in Section 3, we summarize in this section key relevant formulas (Woo et al. 2016) used by bbl, generalized such that predictors each can have varying number of factor levels.

Model description
The discrete response y k for an instance k takes factor values y among G ≥ 2 groups; e.g., y ∈ {case, control} with G = 2; k = 1, . . ., n denotes sample (or configuration) index.We also introduce weights w k , each of which is an integral number of times each configuration was observed in data, such that k w k = n s is the total sample size.If the data take the form of one entry per observation, w k = 1 and n = n s .The use of frequency w k can lead to more efficient learning when the number of predictors is relatively small.We use symbol y for a particular factor value and generic response variables interchangeably.
The model attempts to connect response y to a set of predictors represented by x with elements x i and the observed data for an instance k denoted by x k .We assume that predictor variables take discrete factor levels, each with distinct effect on responses, e.g., x i ∈ {a, t, g, c} for DNA sequence data.The overall likelihood is where p y ≡ P (y) is the marginal distribution of y and n y = k∈Ky w k is the size of group y.
In the parametrization we adopt for the first term in Equation 3, the group-specific predictor distribution is written as The number of parameters (d.f.) per group y in Θ y = {h where L i is the total number of levels in factor x i , which contributes one less parameters to d.f. because one of the factors can be taken as reference with the rest measured against it.Internally, bbl orders factors, assigns codes a i = 0, . . ., L i −1, and sets h ij (a i , a j ) = 0 whenever a i = 0 or a j = 0. We refer to h (y) i (x) and J (y) ij (x, x ′ ) as bias and interaction parameters, respectively.
In the special case where predictor levels are binary (x i ∈ {0, 1}), one may use the spin variables s i = 2x i −1 = ±1, as in the package BoltzMM (Jones et al. 2019a,b).Its distribution (Jones et al. 2019a) is then related to Equation 2 by where parameter superscripts were omitted because response group is not present.

Pseudo-likelihood inference
One option for fitting Equation 4 to data is pseudo-likelihood maximization (Besag 1975): where the effective univariate distribution is conditional to all other predictor values: Including L 2 penalizers (λ h , λ), L iy in Equation 6becomes with first derivatives where are the first and second moments of predictor values and 1(x) is the indicator function.In bbl, Equations 9 are solved in C++ functions using the quasi-Newton optimization function gsl_multimin_fdfminimizer_vector_bfgs2 from the GNU Scientific Library (Galassi et al. 2009).By default, λ h = 0 and only interaction parameters are penalized.As can be seen from the third equality of Equation 6, the pseudo-likelihood inference decouples into individual predictors, and the inference for each i in bbl is performed sequentially.The resulting interaction parameters, however, do not satisfy the required symmetry, After pseudo-likelihood inference, therefore, the interaction parameters are symmetrized as follows: In bbl, the input data are filtered such that predictors with only one factor level (no variation in observed data) are removed.Nevertheless, in cross-validation of the processed data, subdivisions into training and validation sets may lead to instances where factor levels observed for a given predictor within x i in Equation 9 are only a subset of those in the whole data.It is thus possible that optimization based on Equations 9 is ill-defined when any of the predictors are constant.In such cases, we augment the training data by an extra instance, in which constant predictors take other factor levels.

Mean field inference
The other option for predictor distribution inference is mean field approximation.In datadriven inference, the interaction parameters are approximated as (Nguyen et al. 2017) i.e., the negative inverse of the covariance matrix, Equation 11 can be interpreted as treating discrete x as if it were multivariate normal: Equation 4 would then be the counterpart of the multivariate normal probability density function with −J (y) ij (x, x ′ ) corresponding to the precision matrix.In real data where n ∼ d.f. or less, the matrix inversion is often ill-behaved.It is regularized by interpolation of C (y) between non-interacting (naive Bayes) (ϵ = 0) and fully interacting limits (ϵ = 1): Tr C (y) Tr I I + ϵC (y) , where I is the identity matrix of the same dimension as C (y) .The parameter ϵ serves as a good handle for probing the relative importance of interaction effects.
The bias parameters are given in mean field by an analog of Equation 8, ĥ(y) where f (y) i (0) is the frequency of (reference) factor x i for which the parameters are zero (a i = 0).Equation 13relates the effective bias for predictor x i (the first term on the right) as the sum of univariate bias (left-hand side) and combined mean effects of interactions with other variables (the second term on the right) (Chandler 1987).The effective bias is related to frequency via Equation 14 because where the fact that h(y) i (0) = 0 was used in the second equality.As in pseudo-likelihood maximization, mean field inference also may encounter non-varying predictors during cross-validation.To apply the same inference scheme using Equations 12, 13 and 14 to such cases, the single-variable frequency f (y) i (x) and covariance f (y) ij (x, x ′ ) are computed using data augmented by a prior count of 1 uniformly distributed among all L i factor levels for each predictor.

Naive Bayes
When interaction is ignored (J (y) ij = 0), the model can be solved analytically.From Equations 13 and 14, ĥ(y and (Woo et al. 2016) The likelihood ratio statistic for each predictor, where the null hypothesis is h (y) i (x) = h i (x) with h i (x) the "pooled" inference parameters (same values for all response groups), is then The statistic q i ∼ χ 2 with d.f.= (G − 1)(L i − 1).Another example of hypotheses that can be tested is h , where X A is a subset A of predictor values (e.g., in the Titanic model, the effects of Class are the same for 2nd and 3rd Class; see Section 3), for which where N i is the number of predictor levels with distinct parameter values.

Classification
For prediction, we combine predictor distributions for all response groups via Bayes' formula: where For binary response coded as y ∈ {0, 1}, Equation 19 reduces to where 18takes the form of the logistic regression formula.However, the actual naive Bayes parameter values differ from the logistic regression fit.No expression for P (y|x) simpler than Equation 18 exists for data with more than two groups.
In pseudo-likelihood maximization inference, Z y can be approximated by or with the same expression without the factor of 1/2 in the interaction term in the exponent (default).This quantity can be conveniently computed during the optimization process.With the mean field option, the following expression is used: For a test data set for which the actual group identity y k of data instances are known, the accuracy may be defined as where ŷ(x) = arg max If response is binary, the accuracy defined by Equation 24 is sensitive to marginal distributions of the two groups via Equation 20.The area under curve (AUC) of receiver operating characteristic is a more robust performance measure independent of probability cutoff.In bbl, the accuracy given by Equations 24 and 25 is used in general with the option to use AUC for binary response using the R package pROC (Robin et al. 2011(Robin et al. , 2021)).

Logistic regression
To motivate the use of bbl and highlight differences, we first consider the use of logistic regression using glm.We use the base R data set Titanic as an example: R> titanic <-as.data.frame(Titanic)R> titanic We train a glm model with interactions and make prediction on the test data:
For comparison with bbl, which by default includes regularization, we also consider a penalized logistic regression fit using glmnet (Friedman, Hastie, and Tibshirani 2010; Friedman, Hastie, Tibshirani, Narasimhan, Simon, and Qian 2021)

Boltzmann Bayes learning
The logistic regression shown in Section 3.1 allowed for inference and significance testing of linear and interaction coefficients in association with the response variable.However, the regression fit did not provide any further information regarding the source of association: in the examples in Section 3.1, the survival of Titanic passengers was seen to be associated with being Female and not being Crew members.The corresponding linear regression coefficients, which have the same functional form as in Equation 20 (β i (x) in Equation 21if interactions are neglected), are measures of the difference in coefficients h (y) i between the two response groups (Equation 21).The two terms, h (1) i and h (0) i , whose difference yielded the coefficient β i (x) remained unknown.How were the sub-groups distributed among survivor and non-survivor groups?Were there very few Female 3rd-class passengers among the survivor group compared to non-survivor, or were they found in both groups but more so among non-survivors?
The bbl inference estimates the individual distributions of predictors in response groups separately and subsequently combines them to make predictions.For binary response, this inference provides estimates of the two coefficients (h for linear effects and J (1) ij , J (0) ij for interactions) in Equation 20whose difference corresponds to the logistic regression coefficients.More generally, the availability of the direct estimates of predictor distributions in each response group given by Equation 4facilitates model interpretations in a way not possible for regression-based models, as we show below in this section and Section 3.5.
With this comparison in mind, we use the same Titanic data set below to illustrate the Boltzmann Bayes inference.As for glm, bbl uses formula input to train an S3 object of class 'bbl': R> bfit0 <-bbl(Survived ~Class + Sex + Age, data = titanic, weights = Freq, + prior.count= 0) which by default triggers a pair of pseudo-likelihood inferences, solving the maximum pseudolikelihood equations ( 9) first under the alternative hypothesis (individual groups have distinct distributions) and then the null hypothesis (all samples have the same distribution).
The argument prior.countcan be used to add prior counts to frequencies of occurrence of each predictor level.One may observe that when interaction is neglected, the naive Bayes model involves categorical distributions for each predictor.In this special case, therefore, the prior count can be regarded as the hyperparameter of the conjugate Dirichlet prior, making the overall treatment of the model a fully Bayesian extension.
The print method for 'bbl' objects shows the structure of the model and (subsets) of inferred parameters:

R> bfit0
Call ij are stored as lists with argument order (y, i) and (y, i, j), respectively.The inner-most elements of the lists are vectors and matrices of dimension L i − 1 = c(3, 1, 1) and (L i − 1, L j − 1), respectively.The summary method for 'bbl' object prints out parameters and their significance test outcomes under the naive Bayes approximation (no interactions) as a rough overview of the model under consideration:

R> summary(bfit0)
Call: bbl(formula = Survived ~Class + Sex + Age, data = titanic, weights = Freq, prior.count= 0) The test results are those from likelihood ratio tests applied to the naive Bayes result, Equation 17, with the null hypothesis h (y) i (a) = h i (a).The tables of bias parameters shown above include those for two survival status groups.Their signs and magnitudes, along with the computed significance levels, clearly indicate the associations of lower Class status and being Male with non-survivors.There are few children among both survivors and non-survivors; hence highly negative bias parameters in all groups, although less so in the survivor group, as expected.We note that the summary method displays naive Bayes results, for which simple analytic expressions for test results are available, even for models containing interactions.
One may compare the naive Bayes parameter β i (x) with the logistic regression coefficients:

R> plot(bfit)
The parameters printed include those for interactions.The plot method shows a barplot of bias parameters and a heatmap of interaction parameters (Figure 2).Note that Male members were predominant (bias parameters; top), while Male 3rd-class passengers were under-represented (interactions; bottom left), among non-survivors.In addition, Male Childclass had enhanced survival (bottom right).
A major advantage of the bbl fit compared to regression is the availability of predictor distributions in each response group, P (x|y), given by Equation 4. In addition to using the model to make predictions of response groups, one can also examine the predictor distributions and identify configurations dominant in each response group.Since the total number of configurations x grows exponentially with the number of predictors, Markov chain Monte Carlo (MCMC) sampling is necessary for exploration of these distributions except for very low dimensions.The function mcSample performs Gibbs sampling of the predictor distributions using bbl parameters and outputs the most likely configuration in each response group: R> map <-mcSample(bfit, nstep = 1000, progress.bar= FALSE) R> map

$xmax
No Yes Class "3rd" "1st" Sex "Male" "Female" Age "Adult" "Adult" $emax No Yes 3.394166 0.000000 The return value is a list containing the predictor configurations with the highest probability in each response group (columns in map$xmax above) and the corresponding "energy" values, which are exponents of Equation 4.

Simulated data
We next use simulated data to show the effect of penalizers on bbl inference as well as its usefulness under varying sample sizes.
The utility function randompar generates random parameters for predictors.We have set the total number of predictors as m = 5, each taking values 0, 1, 2 (L i = L = 3).
R> xi <-sample_xi(nsample = 10000, predictors = predictors, h = par$h, + J = par$J, code_out = TRUE) R> head(xi) The function sample_xi will list all possible predictor states and sample configurations based on the distribution (4).The total number of states here is L m = 3 5 , which is amenable for exhaustive enumeration.However, this is possible only for small m and L. If either are even moderately larger, sample_xi will hang.
Because there is only one response group, we call the main engine mlestimate of bbl inference directly instead of bbl: In contrast to the bbl function, which fits a model of multiple response groups and predictors in factors, mlestimate is for a single group and requires input matrix xi whose elements are integral codes of factors: a i = 0, . . ., L i − 1. Figure 4 compares the true and inferred parameters.Here, the sample size was large enough that no regularization was necessary.
As shown in Figure 5a, prediction AUC is optimized near ϵ = 0.7.The difference between AUC at ϵ = 0 (naive Bayes limit) and the maximum is a measure of the overall effect of interaction.We select three values of ϵ and examine the fit: As ϵ increases from 0 to 1, interaction parameter J grows from zero to large, usually overfit levels.We verify that the bias and variance strike the best balance under ϵ = 0.7 (Figure 5c), as suggested by cross-validation AUC in Figure 5a.

Genetic code
We consider a different learning task example with a much larger space of response groups, namely those of amino acids; K = 21, which include 20 amino acids plus stop signal ('*'), encoded by DNA sequences (x i ∈ {a, c, g, t}).In DNA sequences, three nucleotides combine to encode specific amino acids.We will train a model attempting to re-discover the mapping from nucleotide sequences to amino acids.b1 b2 b3 1 t a g 2 g t c 3 t a a 4 c g g 5 a a c 6 c t g In the above, we generated random instances of triplet codons for training.We use the package Biostrings (Pagès, Aboyoun, Gentleman, and DebRoy 2021) to translate it into amino acids: R> aa <-Biostrings::DNAString(paste(t(dat), collapse = "")) R> aa 6000-letter DNAString object seq: TAGGTCTAACGGAACCTGGCGATTATACTTG...AGTAAACTCGACAGTGACCGAAGGTACGGGC R> aa <-strsplit(as.character(Biostrings::translate(aa)),split = "") [[1]] R> xdat <-cbind(data.frame(aa= aa), dat) R> head(xdat) aa b1 b2 b3 1 * t a g 2 V g t c 3 * t a a 4 R c g g 5 N a a c 6 L c t g We now cross-validate using bbl: R> cv <-crossVal(aa ~.^2, data = xdat, lambda = 10^seq(-3, 1, 0.5), + verbose = 0) R> cv Optimal lambda = 0.3162278 Max.score: 1 lambda score 1 0.001000000 0.9195 2 0.003162278 0.9195 3 0.010000000 0.9875 4 0.031622777 0.9875 5 0.100000000 0.9925 6 0.316227766 1.0000 7 1.000000000 0.9930 8 3.162277660 0.9770 9 0.9770 Note that with the multinomial response group, the accuracy defined by Equation 24 is used.The class 'cv.bbl' extends 'bbl' and stores the model with the optimal λ.In contrast to Section 3.2, we do not refit the model under this λ because accuracy is maximum.Testing can use all possible codon sequences (4 3 = 64 total): R> panel <-expand.grid(b1= nt, b2 = nt, b3 = nt) R> head(panel) b1 b2 b3 1 a a a 2 c a a 3 g a a 4 t a a 5 a c a 6 c c a q q q q q q q q q q q q q q q q q q 0.0 The trained model has perfect accuracy of 1 and will not make mistakes in any translation of DNA sequences.
The above run will take a few minutes and yield a prediction score of 0.89.By feeding a vector of ϵ values, one can obtain the profile shown in Figure 6.The jump in performance under ϵ * ∼ 0.05 over ϵ → 0 (naive Bayes) limit gives a measure of interaction effects.The relatively small value of ϵ * at the optimal condition, compared to, e.g., Figure 5a, reflects the sparseness of image data.
We performed similar cross-validation and test analyses of the full MNIST data (training n = 60,000 and test n = 10,000) and obtained the accuracy of 0.915 (classification error rate 8.5%), which compares favorably with other large-scale neural network algorithms (Table 1).
As with the Titanic data set, we leverage the unique advantage of the bbl fit of providing predictor distributions and estimate dominant configurations of each response group (Figure 7):

Transcription factor binding site data
One of machine learning tasks of considerable interest in biomedical applications is the detection of transcription factor binding sites within genomic sequences (Wasserman and Sandelin 2004).Transcription factors are proteins that bind to specific DNA sequence segments and regulate gene expression programs.Public databases, such as JASPAR (Khan et al. 2018), host known transcription factors and their binding sequence motifs.Supervised learners allow users to leverage these data sets and search for binding motifs among candidate sequences.
Here, we illustrate such an inference using an example set (MA0014.3) of binding motif sequences from JASPAR (http://jaspar.genereg.net):In both cases, there is an optimal, intermediate range of regularization with maximum AUC (Figure 8).The level of performance attainable with non-interacting models, such as position frequency matrix (Wasserman and Sandelin 2004), corresponds to the ϵ = 0 limit in Figure 8b.The AUC range obtained above is representative of the sensitivity and specificity levels one would get when scanning a genomic segment using a trained model for detection of a binding site to within resolution of ∼ 3 base pairs.distribution accommodating heterogeneous, factor-valued predictors via Equation 4, embedding it in a Bayesian classifier to build supervised learning and prediction models.The basic implementation architecture of bbl follows those of standard base R implementations such as glm.
Compared to more widely applied restricted Boltzmann machine algorithms (Hinton 2012), the Boltzmann Bayes model explicitly infers interaction parameters for all pairs of predictors, making it possible to interpret trained models directly, as illustrated in Figures 2 and 7, the latter using MCMC sampling of predictor distributions.The bbl inference is especially suited to data types where a moderate number of unordered features (such as nucleotide sequences) combine to determine class identity, as in transcription factor binding motifs (Section 3.6).Among the two options for inference methods, mean field (method = "mf") is faster but can become memory intensive for models with a large number of predictors.Pseudo-likelihood maximization (method = "pseudo") is slower but usually provides better performance measured in cross-validation accuracy or AUC.

Computational details
The current version of bbl is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=bbl.Installation of bbl requires the GNU Scientific library (https://www.gnu.org/software/gsl/) to be installed.The results in this paper were obtained using R 4.1.2on Ubuntu Linux 20.04 LTS.R itself and all packages used are available from CRAN at https://CRAN.R-project.org/ and Bioconductor at https: //bioconductor.org/.
y k∈Ky w k ln P (x k , y) ≡ y L y , where the second summation restricts k to the set K y of all k values for which y k = y.The inference is first performed for each group y separately, maximizing L y given by L y = k∈Ky w k ln P (x k |y) + ln P (y) = k∈Ky w k ln P (x k |y) + n y ln p y ,(3)

Figure 1 :Figure 1
Figure 1: Cross-validation run of glmnet on the Titanic data set.
i − h i ; i.e., individual group parameters offset by the pooled values.Internally, the parameters h

Figure 2 :
Figure 2: The plot of a 'bbl' object displays bias (top) and interaction parameters (bottom).All parameters are offset by their pooled (singe-group) values.

Figure 3 :
Figure 3: Cross-validation run of the Titanic data set in bbl.

Figure 4 :
Figure 4: Comparison of true parameters and those inferred from pseudo-likelihood Boltzmann Bayes inference.See the text for conditions.

Figure 5 :
Figure 5: Regularized mean field inference using simulated data.(a) Cross-validation AUC with respect to regularization parameter ϵ. (b-d) Comparison of true and inferred parameters under three ϵ values.Best fit is achieved when AUC is maximum.

Figure 6 :
Figure 6: Cross-validation of Boltzmann Bayes inference on MNIST data using mean field option.

Figure 8 :
Figure 8: Cross-validation of transcription factor binding motif model using bbl with control sequences generated by 3 nucleotide mutations.Data set is from Khan et al. (2018) (sample ID MA0014.3;see text).(a) Pseudo-likelihood and (b) mean field inferences.
purposes because bbl requires discrete factors as predictors.Input data can either be of the form above with unique combinations of predictors in each row along with frequency (input to the weights argument of glm) or in the form of raw data (one observation per row) which we generate using the utility function freq2raw: Philipp, Rusch, Hornik, and Strobl 2018;Philipp, Strobl, Zeileis, Rusch, and Hornik 2021)endricks 2015, or stablelearner,Philipp, Rusch, Hornik, and Strobl 2018;Philipp, Strobl, Zeileis, Rusch, and Hornik 2021), the simpler version above only including factor variables suffices for our A comparison of the linear coefficients and significance levels in the two models suggests that interaction plays important roles; in particular, marginal effects on the linear level remained significant only for sex being Female.To illustrate training and prediction, we divide the sample into train and test sets: R> set.seed(159)R> nsample <-NROW(titanic_raw) R> flag <-rep(TRUE, nsample) R> flag[sample(nsample, nsample/2)] <-FALSE R> dtrain <-titanic_raw[flag, ] R> dtest <-titanic_raw[!flag,]

Table 1 :
Performance comparison of BB inference and other models on the MNIST data set.The bbl inferences used the full MNIST training and test data sets (see text).BB, Boltzmann Bayes; NN, neural network; RBM, restricted Boltzmann machine; ER, error rate.