mbonsai : Application Package for Sequence Classiﬁcation by Tree Methodology

In many applications such as transaction data analysis, the classiﬁcation of long chains of sequences is required. For example, brand purchase history in customer transaction data is in a form like AABCABAA, where A, B, and C are brands of a consumer product. The decision tree-based package mbonsai is designed to handle sequence data of varying lengths using one or multiple variables of interest as predictor variables. This software package uses tree growing and pruning strategies adopted from C4.5 and CART algorithms, and includes new features for handling sequence data and indexing for classiﬁcation purpose. The software uses a simple command line program for learning and predicting processes, and has the ability to generate user-friendly graphics depicting decision trees. The underlying C++ codes are designed to eﬃciently process large data sets in ASCII ﬁles. Two examples from transaction data sets are used to illustrate the application of mbonsai .


Introduction to mbonsai
mbonsai is a tree-based classification program that can delineate patterns in sequence dataan ordered collection of categorical, numerical, or ordinal observations -and provide rules for splitting the sequence variable space such that the classification of individual cases is possible. A prototypical application of mbonsai is to classify customers' purchase of different brands in sequence order in terms of their possibilities to churn -i.e., stop purchasing from the target brand at a later period. To illustrate the key features of mbonsai, consider the following  Table 1: Example of sequence data. Each line corresponds to one customer, with the target variable indicating the churn status of the customer. The alphabetic characters a, b, and c, shown in the "BrandSequence" column, represent the brand purchased by the customer in the order shown. example. A diaper manufacturer of a brand is interested in using previous diaper purchasing pattern (e.g., when using size M) to predict churning (i.e., switching to a different brand when using size L as the baby grows). Table 1 shows the purchase pattern, or sequence, of 9 customers' purchases of 3 brands of diaper of size M and the churn status of the customer. Churn=yes indicates that the customer switches to another brand for size L diaper. Churn status is determined separately using transaction data on size L diapers. The first record, for example, shows a customer purchased brand c, then a, and then b. This customer eventually churned.
mbonsai first uses an alphabet index set to map the brands, or alphabets, into a smaller number of indexes, in this case {1, 2}. The mapping is such that brands a and b map into index 1, and brand c maps to index 2. When the number of alphabets is large, alphabet indexing simplifies a model and, when done properly, preserves information in the sequence patterns for classification purposes. Figure 1 shows decision trees for respective classification based on the original alphabets and the index. When multiple predictor variables are present, it is possible to apply alphabet indexing to all of the predictor variables together or to a set of selected predictors. Details on subsequence and candidate patterns matching is illustrated in Section 3.
The software concept underlying mbonsai was originally initiated as a machine learning system called BONSAI Garden (Shimozono, Shinohara, Shinohara, Miyano, Kuhara, and Arikawa 1994), which was motivated by pattern discovery from amino acid sequences of proteins. The word BONSAI symbolizes knowledge, as a small bonsai tree comprises various regular patterns that are in harmony with alphabet indexing to reduce the size of the tree.
In our implementation of mbonsai, we have extended the algorithms in BONSAI for analyzing large-scale business data, especially for the prediction of consumer behavior (Kawata, Hamuro, Kato, and Yada 2001;Katoh, Yada, and Hamuro 2003;Yada, Hamuro, and Katoh 2007a;Yada, Ip, and Katoh 2007b). The key features in this new implementation are as follows: • Transform patterns from numerical and categorical sequence data into classification conditions.
• Use alphabet indexing as a data reduction technique for sequence data.
• Process multiple variables of sequence data, using numerical and categorical variables as predictors.
• Allow a cost sensitive learning approach to optimize misclassification costs by the integration of a cost matrix.
• Allow separate training and testing of decision tree models.
• Allow two or more classes of the target variable for classification.
• Allow cross-validation for assessing the performance of the predictive model.
mbonsai is a data mining tool within the NYSOL software package (NYSOL Corporation 2014). NYSOL is an integrated framework of knowledge discovery which contains a collection of command driven tools known as m-commands designed for large-scale data processing and data mining. The underlying data processing methodology for the set of m-commands was developed by Yasuyuki Matsuda. Command names within the NYSOL package including mbonsai are preceded by the letter "m" in honor of the developer. The NYSOL package provides a wrapper command underpinned by existing algorithms and data processing programs, for simple execution of data transformation, data aggregation, data mining, and visualization at command line. This methodology allows users to integrate and manage all text-based information in one system throughout the knowledge discovery process. Both mbonsai and NYSOL m-commands were developed in C++ for scalable implementation in the UNIX environment by the JST ERATO Minato Discrete Structure Manipulation System Project hosted by Hokkaido University, Japan, and was distributed under the terms of the GNU Affero General Public License Version 3 (https://www.gnu.org/licenses/agpl-3.0.html). The mbonsai decision tree application can be executed as a simple UNIX command directly on the command line and is customizable with user-defined parameters. mbonsai can be installed as a standalone package on LINUX and Mac OS X platform at the terminal emulator. More details are described in Section 5.
The UNIX-based, command-driven mbonsai was designed for the direct processing of largescale, text-based data. The default input format is CSV (comma-separated values), which has the advantage of being highly accessible and can be processed with great efficiency. This paper is organized as follows. Section 2 provides a brief description of the background and algorithms underlying mbonsai. Section 3 describes the strategy of growing a tree using sequence data. Section 4 describes the required data structure and preparation for data input for the software. Section 5 describes the functions and parameters of mbonsai. The application of mbonsai is then illustrated through two classification examples with real customer purchase transaction data in Section 6. Finally, we provide brief concluding remarks.

Tree algorithms
mbonsai is built upon the C4.5 (Quinlan 1993) and CART (Olshen, Breiman, Friedman, and Stone 1984) algorithms for the classification of sequence data. The core algorithm of tree building in mbonsai is based on information entropy in the C4.5/C5.0 implementation by Rulequest (2013), whereas the algorithm of tree pruning is similar to that used in CART which is implemented by Salford (2015). See Kuhn and Johnson (2013) for a recent review of tree-based classification methods. Briefly, at each node of the tree, mbonsai selects the predictor variable for the split in terms of information entropy. Not unlike CART, mbonsai employs a greedy algorithm for splitting rules, grows a full tree until a terminal node contains a small sample (e.g., n < 5), and then performs cost complexity pruning on the full tree to prevent overfitting. Readers are referred to Quinlan (1993) and Olshen et al. (1984) for technical details for C4.5 and CART algorithms respectively. The differences between C4.5 and CART algorithms are summarized in Wu et al. (2008). The recent development of tree algorithms, and the associated software programs are described in conjunction with unique features in mbonsai.
The first recent development is the use of a non-greedy algorithm to circumvent possible biases introduced in model selection from a greedy algorithm. One example is the evolutionary method for learning globally optimal trees and related software (Grubinger, Zeileis, and Pfeiffer 2011). An alternative approach is to split each node into as many children nodes as the number of classes and use F tests to rank the predictor variables, as implemented in CRUISE (Kim and Loh 2001). Because of the potential large search space in sequences, mbonsai, however, relies on a greedy search algorithm.
The second recent development is the emergence of ensemble-based classifiers. It represents another important development for improving the accuracy of predictions of tree-based classifiers. Bagging (Breiman 1996) and boosting (Freund and Schapire 1997) are two noted examples of this class of classifiers. Open source programs for ensemble-based classifiers include randomForest (Liaw and Wiener 2002) for bagged trees, and gbm (Ridgeway 2017) for generalized boosted models including boosted trees. Some implementations of ensemblebased classifiers such as the R-based package bst (Wang 2018), allow the plugging-in of loss functions. mbonsai also allows specification of loss functions but does not use ensemble-based methods. Because ensemble-based methods are general and can be applied to almost any type of classifier, it is possible to further enhance the accuracy of mbonsai through ensemble-based methods.
A third recent development are the multivariate and longitudinal extensions of C4.5 and CART algorithms, which is especially relevant to mbonsai. Examples of such extensions include Segal (1992) for longitudinal recursive partitioning, Zhang and Singer (2010) for multivariate binary tree classification, and Yu and Lambert (1999) for a functional curve approach. Unlike Yu and Lambert (1999), mbonsai focuses on response vectors that are categorical, not continuous. Yet another recent approach for analyzing longitudinal data is the RE-EM method by Sela and Simonoff (2012), which treats longitudinal data as repeated measurements and the temporal information is viewed more or less as a nuisance factor. In contrast, mbonsai considers temporal information as central to classification and such information is embedded in a sequence.
Two other recent work that are most closely related to the approach in mbonsai are the temporal decision tree (Console, Picardi, and Theseider 2003) and the sequence decision tree (Rokach, Romano, and Maimon 2008). Incidentally, both tree-based models have been inspired by applications in production and manufacturing. Both approaches use some form of C4.5, but are different from mbonsai. For example, the sequence tree algorithm proposed by Rokach et al. (2008) treats temporal data as sequences. The sequence of operational setting of a product (e.g., an automobile) is first coded as a string of token. For example, the sequence B 1-5-9-3-2 F represents assembly procedure 1, 5, 9, 3, 2 (and in that order). B and F respectively denotes begin and finish. The goal of sequence tree analysis is to delineate effects of the operation sequence on the quality of the product (e.g., fail/pass). The algorithm consists of several steps which include (1) using regular expression to represent relevant patterns in pairs of sequences -e.g., (1, 5), (5, 9), (9, 3), . . . in the above example, and (2) using C4.5 to classify useful patterns derived from regular expressions. In other words, preprocessing data plays an important role in this approach, and the original decision tree algorithm C4.5 is applied to the derived pattern data. The approach thus is different from mbonsai within which the C4.5 tree algorithm is extended to directly analyze sequence data. Earlier applications of mbonsai to business transaction data with discrete and continuous variables can be found in Yada et al. (2007b).

The mbonsai algorithm
The process flow of mbonsai is shown in Figure 2, and Figure 3 shows the overall algorithm of mbonsai. The algorithm takes a sequence dataset D and pruning parameter α as input, and it returns a decision tree T with alphabet-index f . Alphabet indexing is a notable   . . , a n } is a alphabet set corresponding to the elements of the sequence, and I = {b 1 , b 2 , . . . , b m } is an index set. Usually m is set at a much smaller number than n, so indexing works as grouping of elements of a sequence. The algorithm explores the best mapping which minimizes error (misclassification rate) of the model.
If n is small, it is possible to exhaustively search the entire space of possible partitions for optimal alphabet indexing. When either n is large or a relatively large number of indexes m is required, a local search technique is used. The local search starts by generating a mapping at random (line 4 in Figure 3). In the examplary sequence shown in Table 1, Σ = {a, b, c} and I = {1, 2}, applying randomized mapping generates a mapping such as f = {(a, 1), (b, 1), (c, 2)}. Afterwards, all mapping combinations similar to mapping f are enumerated and stored to a mapping set F . In the above example, close mappings of f are f 0 = {(a, 2), (b, 1), (c, 2)}, f 1 = {(a, 1), (b, 2), (c, 2)}, f 2 = {(a, 1), (b, 1), (c, 1)}, but the last mapping is eliminated because m = 1, so F = {f 0 , f 1 }. With respect to each element f of the mapping set F , it calculates the best tree T and mapping f with lowest error (line 14-15). Then the best mapping f is updated. The process is repeated until no improvement is observed.
We describe the generation of a learning dataset (line 11) and the construction of a decision tree (line 12) below.

Generating a learning dataset
First of all, the original sequence data on dataset D is converted to an indexed data based on the given mapping f . For example, the sequence data in Table 1 is converted to the one shown in Table 2, based on the mapping f = {(a, 1), (b, 1), (c, 2)}.
Subsequently, a learning dataset for building a decision tree will be generated from the indexed data sequence. Input variables are patterns of the sequence and they take the Boolean value of 0 or 1. mbonsai uses "regular pattern" as pattern.

Regular patterns
"Regular patterns" refer to the collection of generic sequence patterns designed for matching sequences seen in the data. Apparently, the maximum length of a regular pattern of interest equals the maximum length of observed sequence data. However, in practice one requires to limit computation costs by specifying the maximum length of regular patterns (e.g., capped at 5), so the matching procedure will only consider subsequences in the data that have a maximum length of 5. Define the alphabet set Σ as a collection of characters. This could be, for example, a collection of brands as indicated by letters a, b, c and so on. Regular patterns can either be a string or a sequence. In string matching, no other alphabets are allowed between alphabets. For example, if the string aab is to be matched, then the data must appear exactly as aab. However, if aab is treated as sequence, then an observed data of the form acaccb is still considered a match. Thus, as sequence, any data of the form *a*a*b* is considered a match, where * is a wildcard. Note that here we distinguish between the terms "sequence" and "string", whereas the broader term "sequence data" used earlier refers to generic data that contain a chain of alphabets. Formally, define n substrings π 1 , π 2 , . . . , π n on alphabet Σ, and n + 1 substrings x 0 , x 1 , . . . , x n that are used as wild cards. A regular pattern takes the form x 0 π 1 x 1 π 2 x 2 · · · π n x n . In mbonsai, the pattern-matching algorithm allows both string and sequence patterns. For string, mbonsai takes a substring pattern of the form x 0 π 1 x 1 unless otherwise specified (see also the discussion on begin / end match) and subsequence of the form x 0 π 1 x 1 π 2 x 2 . . .. The use of string and sequence patterns can be specified in the second parameter of the mbonsai command at p=. An example of usage is given in Section 5. a b c aa ab cc · · · Churn 1 1 1 0 1 0 yes 1 1 0 0 1 0 yes 1 1 0 0 1 0 yes 1 0 1 0 0 1 · · · yes 1 1 0 1 1 0 yes 1 0 1 1 0 1 no 1 1 1 0 0 0 no 1 1 1 0 1 0 no 1 1 1 0 1 0 no Table 3: Each candidate pattern is extracted from the original sequence data as shown in the "BrandSequence" column from Table 1. Each candidate pattern is matched with the original sequence, and the presence of the matching candidate pattern is converted to 0-1 values. A value of 0 indicates "not matched" and 1 indicates "matched".

Generation of candidate patterns
Candidate patterns are enumerated for each node in a decision tree for classification accuracy, and a selected number of candidate patterns (e.g., the top 30) for each variable will be considered in splits when growing a tree. The enumeration is based on the following heuristic. First, regular patterns generated from the index with length 1 are stored in a priority queue. Regular patterns in the priority queue are ordered by an entropy measure (defined below). At the next step, the regular pattern with the lowest entropy in the priority queue is selected, and a second index is added to the selected regular pattern, resulting in an updated regular pattern with length of 2. The updated regular pattern is stored in the priority queue again and evaluated. The above steps are then repeated. If a regular pattern with length k is selected, an updated regular pattern with length k + 1 is stored in the priority queue. In mbonsai, the default length of index is set at 5, and the upper limit of the regular pattern k can be specified in the sixth parameter of p=. The iteration process terminates when the size of the regular pattern exceeds the number of candidates, which can be specified in cand=. The default is set at cand=30. The following measure of information entropy is used to both evaluate regular patterns and compute splitting rules at the nodes of the decision tree: where c denotes the number of classes, and p ) represents the relative proportion of class i that matches (not matches) with the sample in the regular pattern π, with c i p m(π) i = 1. Additionally, q m(π) (q u(π) ) represents the composition ratio of matching (not matching) with regular patterns π of all samples, with q m(π) + q u(π) = 1.
As an example, Table 3 shows candidate patterns, which are matched against the brand sequence data shown in Table 1. A value of 0 indicates "not matched" and 1 indicates "matched".

Selection of splitting rule
Like the implementation in CART, mbonsai adopts a top-down greedy algorithm for splitting branches of a tree; the splitting rules at the node of the tree are determined by the information change in the node due to the split. Specifically, the branching rule is such that the split maximizes the entropy gain. Given a specific node, the probability of belonging to class i, as estimated by the empirical proportion, is represented by p i . Entropy is computed by the equation ent = − c i=1 p i log p i . For a given splitting rule at a given node, denote the number of samples classified into class i by n i . Accordingly, the ratio of the classified samples to class i is given by n i /n, where n is the total sample size at the node. Equation 1 is used to compute the entropy gain as the difference of entropy ent(π) after splitting the regular pattern π. After entropy is calculated for each splitting point, the splitting point with the maximal information gain among all splitting point is selected. The procedure is repeated until the tree can no longer be grown. mbonsai uses a stopping rule that requires a minimum sample size of a terminal node (e.g., leafSize = 100). The resulting fully-grown decision tree is referred to as the maximal tree.

Pruning
A maximal tree often overfits the data. In order to avoid spurious branches that "fit to noise", mbonsai adopts a strategy similar to that of CART by iteratively pruning back sections of the maximal tree. Denote the set of nodes t in a decision tree by T and the nodes in the maximal tree by T max = {t 1 , t 2 , . . . , t k }, and the subtree with root node t by T t , t ∈ P , where P is the set of nodes t. The resubstitution misclassification rate of decision tree T is denoted by R(T ). Pruning aims to select a subtree T * that has the lowest misclassification rate R(T ) on unseen data.
A practical method of controlling the size of a tree is based on the cost-complexity pruning method, which is implemented in CART. In brief, the method penalizes the estimated error based on the subtree size. The degree of penalty is controlled by the pruning parameter α. An efficient search algorithm can be used to compute all the distinct α values that change the tree size, and the parameter is chosen to minimize the error on a holdout sample.
The implementation of pruning in mbonsai is completed by the direct specification of the tuning parameter α, by the use of a holdout test sample, or by cross-validation. If the pruning based on a holdout test sample or cross-validation is selected, then the subtree with the highest prediction accuracy is reported.
Specifically, the evaluation function or penalized resubstitution error rate of the decision tree T , which measures cost complexity, is defined as R α (T ) = R(T ) + α|T |, where α(≥ 0) is the tuning parameter. A subtree model is considered a better model if its associated cost is smaller. The complexity equation represents a tradeoff between the misclassification rate of a decision tree R(T ) and tree complexity, as measured by the number of leaf nodes in T , |T |. In general, for a given value of α, we can always find the subtree T (α) that minimizes R α (T ). Because there are a finite number of subtrees, the minimizing subtree for any α always exists and can be denoted by T α . Furthermore, because there are at most a finite number of subtrees of T max , R α (T ) yields different values for only a finite number of α's. The pruned treeT * that has a minimum prediction misclassification rate when applied to holdout data or when evaluated using cross-validation is selected as the optimal tree. Specific details about computational procedure for pruning and related theoretical issues can be found in Olshen et al. (1984).

Pruning parameters
In model building mode, mbonsai controls pruning through the following parameters: test sample ts=, cross validation cv=, and alpha=. When cv= or ts= is specified, test data is used for predictive accuracy; the model with the minimum misclassification rate is selected. mbonsai also allows the direct specification of the value of the tuning parameter α. The option alpha= could be used for promptly defining a specific value of the tuning parameter and creating a decision tree for special purposes such as exploration or testing the effect of changes in α. The final tree model is saved in model.txt and model_info.txt.
In prediction mode (-predict), the α value is calculated internally unless the option alpha= is specified. In that case, the specified value of α is used to construct the decision tree for prediction.
In the test sample method, training data D is partitioned into two sets D 1 , D 2 at a ratio of 1 : 2. The maximal tree based on training data D 2 is first constructed. Based on the complexity parameter α 1 , α 2 , . . . , α k obtained for pruned subtrees, D 1 is used as the set of unknown data to predict the value of the misclassification rate, and the best subtree is selected. If the costs for the test sample exceed the costs for the learning sample, then this is an indication of a poor model fit.
In the cross-validation method, training data D is partitioned equally into D 1 , D 2 , . . . , D p , and D 1 is treated as unknown data for prediction. This method is similar to the test sample method where a percentage of training data is used for prediction. Cycling through the D j , j = 1, . . . , p as unknown data with the other data as training data, an average misclassification rate can be obtained by this method. Subsequently, among the complexity parameters α 1 , α 2 , . . . , α k for corresponding decision trees T 1 , T 2 , . . . , T k , the optimal decision tree with the lowest average misclassification rate is selected. Additionally, the smallest tree among all subtrees of which the estimated mean error rate is within one standard error of the overall error rate is selected ("1 SE rule").

Learning cost considerations
In many applications, the cost for misclassification may not be uniform across different outcomes. Using a two-class (positive and negative) model as an example, the cost for falsepositives and false-negatives could be very different, and cost consideration would affect the construction of a decision tree. The construction of the classification model that takes misclassification costs into account is often called cost sensitive learning. Various methods have real predict cost positive negative 2 negative positive 5 Table 4: Example of defining the cost file for a two-class (positive/negative) example. The column name can be customized by the user, but the columns must follow the order of actual class, predicted class, and cost.  been proposed in the literature. For example, the method proposed by Olshen et al. (1984) modified the probability p i of class i in the sample by assigning weights based on cost. Specifically, the cost of a case that belongs to class j but predicted in class i is expressed as c(i|j); thus, the total cost of class i is expressed as j c(i|j). This weight is assigned to class i, and probability p i is accordingly updated.
In mbonsai, the cost function is specified by the creation of a definition file in CSV format. A two-class (positive/negative) example is shown in Table 4, in which the actual class (real), corresponding predicted class (predict), and associated cost (cost) are displayed in the same row. When the combination of actual class and predicted class is not specified, or the cost file is not defined, all costs of misclassifications are set to unity.

Variable types
There are three basic data types accepted by mbonsai: a sequence made up of alphanumeric characters, numerical variables, or categorical variables. For the target class variable, which is defined in c= (see Section 5), it is assumed that the variable is categorical. Note that mbonsai accepts a multiclass definition of the target variable. Predictor variables can exist in the form of sequence or single-value. Continuing the customer churning example in Table 1, we use other predictor variables -a pattern of profit, total amount of profit, and gender -to illustrate the three data types. Table 5 shows the three respective variables "ProfitPattern", "Profit" and 'Gender" respectively of the type numeric sequence patterns (1-5), numeric, and categorical. The algorithm for branching rules based on single-value numerical and categorical is similar to that of the C4.5 algorithm. Within mbonsai, sequential, numerical, and categorical data types are respectively specified in the p=,n=,d= parameters.

Installation
We will describe installation of the mbonsai standalone package for the Linux and Mac OS platform. When the user is at the website (http://www.nysol.jp/mbonsai), the user can download the zip archive file. Note that mbonsai requires Ruby 2.0 (Thomas, Fowler, and Hunt 2013), gcc and g++ complier, boost C++ libraries, and libxml2. Details of the installation instructions of prerequisite software and mbonsai can be found in the website.
The user can first create a parent directory, which is named parentdir in this example, and save the software program mbonsai.zip in the directory. Afterwards, launch the terminal (in This package contains three commands in the mbonsai package subfolders. The mbonsai decision tree command and the mcm classifier command are located within the mbonsai/cmd directory, the visualization command mdtree.rb is located in the mbonsai/view directory. The input data for this tutorial is located inside the data directory.
The user needs to define the directory path for mbonsai to execute the command. The input data is defined at i=, the name of the desired directory where output files are stored at O= parameter. The p= parameter accepts the column name of the pattern, while c= parameter accepts the column name of the class. In this example, at the parent directory, named parentdir, at which the command prompt is shown as~/parentdir$, the user can execute the command with corresponding parameters at command line as shown in the following structure: /parentdir$ mbonsai/cmd/mbonsai p=seq c=cls i=data/input.csv O=output From this point on, the user can proceed to use the tutorial.

Parameters used in model construction mode
mbonsai is a command-based program. Output model and statistics are saved in text files and PMML formats 1 . For data learning and tree constructing, the command line in model construction mode contains the following parameters:

mbonsai i= [p=] [n=] [d=] c= O= [delim=] [cost=] [seed=] [cand=] [iter=] [cv= | ts=] [leafSize=] [--help]
The functions of the parameters are summarized in Table 6. Details of the usage of the i= Training data file name. p= Column name of pattern (multiple fields can be specified). Users can specify up to five parameters after the column name, each separated by a colon, e.g., p=column_name:is:seq:ordered:head:tail:rs. The details of each are as follows: is: Size of the index -if this parameter is not specified, an index is not generated, instead, the original alphabet of the pattern is used. seq: type of pattern -true: partial sequence pattern; false: partial string pattern (default). ordered: Alphabetical order arrangement when generating the index (this parameter is ignored when is is not specified) -true: ordered, group alphabet above / below the threshold value; false: unordered (default) head: Match string or numeric characters from the beginning (default: start of string is not considered for matching). tail: Match string or numeric characters from the end (default: end of string is not considered for matching). rs: Upper size limit of regular pattern (default: 5). n= Column name with numerical data (multiple fields can be specified). d= Column name with categorical data (multiple fields can be specified). c= Column name of class.

O=
Output directory name (text, PMML model, and model statistics). delim= Delimiter character of pattern (default: empty character; a 1 byte character is regarded as 1 alphabet). cost= Name of cost file. seed= Seed of random number (default=−1: time dependent). cand= Number of patterns as predictor variable (default=30, range: 1-256). iter= Iterations of local search (default=1). leafSize= Lower limit of the number of samples in one leaf (default: no limit). alpha= Specify the degree of pruning. However, when cv= or ts= is specified, this parameter is disabled. Default=0.01. ts= Specify the percentage of test data partitioned using the test sample method. When ts= is not specified, the default value is set as 0.333. cv= Specify partition of data by cross-validation method. When cv= is not specified, the default value is set as 10. If either ts= or cv= is not specified, the default value of alpha=0.01 will be applied. Even when alpha=, ts=, and cv=, are specified, the pruning degree of the maximum tree is recorded in PMML; the value of α could change in prediction mode. parameters in the training model are illustrated in Section 6.

Parameters used in prediction mode
The command for the prediction of new cases follows the format:

mbonsai -predict i= I= o= [alpha=] [--help]
-predict Prediction mode [required for prediction]. i= Input data [required]. The column names must be the same as the columns that were used for building the model.

I=
Destination directory path for model building mode [required]. Required files include: bonsai.pmml: pmml containing the decision model. o= Output file name containing the prediction result. The predict column is added to the input data in output. Columns must be the same as columns that were used in building the model. alpha= Specify the pruning complexity parameter. This parameter accepts real numbers greater than 0, as well as the following two symbols. Designation of the two symbols is effective only when ts= or cv= is specified when building the model. min: α value that corresponds to the pruned model, which minimizes the estimated misclassification rate. 1se: α value that corresponds to the pruned model with the same 1SE rule. Default behavior: If ts= or cv= is specified when building the model, min is used. If you specify alpha= when building the model, the specified value is applied. delim= Delimiter character of pattern (default: empty character; a 1 byte character is regarded as 1 alphabet). The list of user-defined parameters for predicting new data using the model generated in training mode are listed in Table 7. The -predict option must be switched on for running mbonsai in prediction mode.

Two applications to illustrate mbonsai
We use a real data example to illustrate the learning algorithm of mbonasi for the construction of a classification tree and the prediction for new cases. The data set contains purchase history of member customers of a drugstore chain in Japan (Hamuro, Katoh, Matsuda, and Yada 1999). The drugstore chain, Pharma, has collected purchase history of all of its 3,000,000 member customers. Using two subsets of data extracted from the drugstore chain's database, we will illustrate how mbonasi builds decision tree models to (1) identify core customers from the drugstore's perspective and classify new customers, and (2) identify loyal customers from a brand's perspective and predict loyal new customers. The first data set contains sequence data for 16,092 members, and the second dataset contains the transaction history of 114,069 members.

Application I: Identification of core customers
Core customers are those that have consistently generated a high level of profit and ultimately form a stable basis of income streams for a company. The objective in this example is to identify core customers based on the first 13 weeks of purchase history. mbonsai is used to construct a tree from transaction data of 16,092 customers using the following identified variables: average profit per visit ("Profit"), pattern of level of profit ("ProfitPattern"), number of visits during that period ("Visit"), and weekly visit pattern ("VisitPattern"). "Profit" and "Visit" are both numerical variables, whereas "ProfitPattern" and "VisitPattern" are ProfitPattern VisitPattern Profit Visit Target  55552342  1100011101011 7969  11  Core  525  1000001100000 5379  3  Core  5  1000000000000 1538  1  Core  3545  1000000001011 2760  8  NonCore  42 1000000010000 566 2 NonCore . . . Table 8: Example of input data core.csv. Each line corresponds to one customer, and the target variable indicates the status of the customer. The predictor variables include "ProfitPattern", "VisitPattern", "Profit", and "Visit". sequence data. Each weekly profit value is transformed into a 5-level ordinal variable to form the sequence "ProfitPattern" (5 = high, 1 = low). In contrast, "VisitPattern" is a sequence of 13 binary (0/1) variables where 0 and 1 correspond to non-visit and visit in a specific week over the 13 week period. The target 2-class variable indicates the status of the customeri.e., whether the customer belongs to the core customer group. Table 8 shows a sample of the data in CSV format.

Core customer classification model
Based on the program and file locations explained in the previous section, the following command creates the first classification model, and the output is saved in the result_core directory: /parentdir$ mbonsai/cmd/mbonsai p=ProfitPattern:2::true,VisitPattern \ > n=Profit,Visit c=Target i=data/core.csv O=result_core seed=100 The field name of pattern variables is specified by the p= parameter. The number of alphabet indexes is set at 2 for "ProfitPattern", and since it is an ordered sequence, the parameter is defined as ProfitPattern:2::true. The third parameter is blank after ProfitPattern:2 and defaults to ordered sequence. "VisitPattern" only contains 0 and 1 in the sequence, customized parameters are not required. The field names of numeric variables are specified by the n= parameter, and the target variable by the c= parameter. The input file can be specified by the i= parameter, whereas the output file is specified by the O= parameter. A random seed can be specified by the seed= parameter, the seed is set at 100 in this example. The use of the same random seed would ensure that the same set of results would be obtained. Using a different random seed could lead to slightly different results. The results shown below may differ depending on the random number generator in your system.
The summary results of the model is by default stored in result_core/model.txt. Results are also interactively displayed as sections.
[alphabet-index] shows the alphabet corresponding to the index. "ProfitPattern" classes 5, 2, and 3 are indexed in index 1 and category classes 4 and 1 are indexed as 2. "VisitPattern" is by default indexed as 0 and 1.  [decision tree] shows the pruned decision tree in text format. Information such as model size (number of leaf nodes) and number of layers of the deepest leaf is reported as well. The results of the tree show 1 split where profit per visit is less than or equal to 1480.5, in which case customers are classified as non-core, and if profit per visit is more than 1480.5, customers are classified as core.
[ Prediction using the classification model mbonsai allows the use of test data to validate the classification model. The option -predict needs to be invoked. The predict mode reads from the model.pmml file, which has been generated in the previous step. The directory of the model output files can be specified by the I= parameter. Test data are specified by the i= parameter. In the test data, the original class label is known and is presented in the Target column. Execute the following command to build the prediction model using test data.
/parentdir$ mbonsai/cmd/mcm i=result_core/predict.csv ac=Target pc=predict \ > O=result_core/evalAcc Three output files are generated in the result_core/evalAcc directory. The file summary.csv contains summary of model accuracy information, class.csv contains positive predictive accuracy information, and finally, confMatrix.txt contains the confusion matrix of positive and negative instances of prediction outcome.
Below shows the results from the three output files:

Visualization of the decision tree
The decision tree can be visualized as a SVG (scalable vector graphics) graph that mbonsai embeds in an HTML file. The input to the SVG file is based on the PMML file, which has been created when the decision model was built. The following Ruby command generates the graph: /parentdir$ ruby mbonsai/view/mdtree.rb i=result_core/model.pmml \ > o=result_core/tree.html The decision tree is visualized in Figure 4. The number of samples of each class at each node is shown in a pop-out area by placing the mouse cursor over the desired class. The two classes are compared in a pie chart by default. However, the chart can also be shown as a bar graph by adding the option -bar as follows. The diagram with bar chart option and pop-out area is shown in Figure 5.

Refine the model
Altering the tuning parameter(s) α and/or leaf size allows users to adjust the tree. To illustrate the idea, we reran mbonsai using the α value of 0.00001 for pruning, together with a minimum number of 1,500 samples in each leaf.
/parentdir$ mbonsai/cmd/mbonsai i=data/core.csv \ > p=ProfitPattern:2::true,VisitPattern n=Profit,Visit c=Target seed=100 \ > alpha=0.00001 leafSize=1500 O=result_core_mod When an α value different from the default is used, an associated tree with 13 leaves and 9 levels is generated. Note that for the new value of α, the misclassification cost is only reduced Figure 6: The decision tree as generated when α is set as 0.00001 for pruning with a minimum 1,500 samples in each leaf.

Results
slightly from 4283 to 4265. The decision tree with the new tuning parameters is visualized in Figure 6.

Application II: Prediction of diaper purchase
In this application, we examine brand loyalty (brand A in this application) of customers by using a decision tree for analyzing brand purchase sequences. The objective is to predict who will stay loyal to brand A through continual purchasing of brand A diaper with a possible switch from size M to size L as the baby grows. There are seven major brands included in the data and they are denoted by A, B, C, D, E, F, and G. Among 1,838 customers who purchased baby diapers of size M from brand A at least four times, those who purchased baby diapers at least five times after switching from M to L are classified as loyal customers. Using this definition, 918 customers are labeled as loyal to brand A. Together with the labels, purchase patterns of M-sized diapers are used as training data to learn the decision tree model. "BrandPattern" is represented by the string of diaper brands purchased in sequential order, which is used as a predictor variable to predict whether the customer will continue to purchase size L of brand A diaper after at least four purchases of size M diaper. A sample of the dataset is shown in Table 9.

Create a decision tree model
In this application, we encode the 7 brands in "BrandPattern" into a two-class alphabet index defined at p=. We then apply 5-fold cross-validation defined at cv=. The following mbonsai command is used to create the training model, followed by results of alphabet index and decision tree from result_brand/model_1se.txt:  Based on the results from the decision tree, brand A is indexed into 1, and brands C, F, B, G, D, and E are indexed into 2. The model rule states that if "BrandPattern" contains 11, which corresponds to 2 consecutive purchases of brand A size M diapers, then the customer is likely to be a loyal customer of brand A -i.e., they would continue to use size L diapers from the same brand. The command mdtree.rb generates the decision tree which is visualized in Figure 7.
In this model, the accuracy is 0.8579. Note that in the result_brand directory, the files model_info.csv, model_1se.csv, model_info_min.csv, and model_min.txt are created in cross-validation mode. The accuracy of the training data can be compared across the four files to inspect the possible variation in estimates of accuracy. In addition, predict.csv, as well as predict_1se.csv and predict_min.csv, are also generated. The predictive accuracy on the test set can be cross-checked using the three files. When the decision tree model is crossvalidated against the test data, the classification appears to be both accurate and consistent across different partitions of data, suggesting that the model is stable and reliable.
model.pmml Based on the maximal tree of the decision tree created, the complexity penalty attribute is shown for each node. The branch would be pruned if α is greater than the value of complexity penalty. As the maximal tree and pruning information is recorded in PMML, different values of α can be used for prediction.
<Node id="0" score="Loyal" recordCount="15212" > <Extension extender="KGMOD" name="complexity penalty" value="0.218446"/> : predict.csv The prediction result is added to the training data in CSV format. The prediction result, as described below, outputs the highest prediction probability in the column "predict", and the prediction accuracy for each class ("Loyal" and "NonLoyal" as shown below). When ts= is specified, it returns the prediction results of test data; when cv= is specified, it returns the prediction results of k-fold cross-validation, where k is user-specified.
File name Content Remarks model.pmml The decision tree model represented by PMML.
Records pruning information for maximum tree. Prediction mode is selected when -predict is specified. alpha_list.csv Other model information of the complexity parameter α.
Series of α corresponding to the series of models. model_min.txt Summary of pruned model with minimum classification prediction error.
Created when cv= or ts= is specified.
model_1se.txt Summary of pruned model with the same 1SE rule.
Created when cv= or ts= is specified.
model.txt Summary of pruned model for the specified α value. model_info_min.csv Various information of pruned model with minimum classification prediction error.
Created when cv= or ts= is specified.
model_info_1se.csv Various information of pruned model with the same 1SE rule.
Created when cv= or ts= is specified.
model_info.csv Summary of pruned model for the specified α. predict_min.csv The prediction information of pruned model with minimum classification prediction error.
Created when cv= or ts= is specified.

predict_1se.csv
The prediction information of pruned model with the same 1SE rule.
Created when cv= or ts= is specified.

predict.csv
The prediction pruned model for the specified α.

param.csv
List of execution parameters. Returns the pair of keyword-value for the specified parameters. When alpha= is specified, the prediction result of the training data using the specified α value is returned. The file stores the model information in CSV format. The column "nobs" refers to the number of records in training data, column "alpha" refers to the value of pruning complexity parameter and "accuracy" and "totalCost" respectively refer to the percentage of correct answers in the test model and the total cost.

Summary and remarks
In this paper, we have presented the implementation of the mbonsai software with extended analytical capability of decision trees. The mbonsai software is built upon existing tree algorithms published by of Quinlan (1993) and Olshen et al. (1984). A substantial body of literature in advancing the tree-based classification methodology can be found in Wu et al. (2008), Strobl, Malley, and Tutz (2009), Kuhn and Johnson (2013), and Loh (2014). Additionally, Hothorn (2018) provides a recent overview of open source R-based decision tree software programs.
mbonsai extends the decision tools C4.5 with advancements shown through several examples. For tree pruning and growing strategies, mbonsai closely follows CART, with the exception of using entropy instead of the Gini index that CART uses. The most innovative feature of mbonsai is its ability to handle multiple variables that contain sequence data. The graphical output from mbonsai is saved as a SVG file providing a tree map with classification distribution at each node. This also greatly enhances the presentation of results.
Although here we only illustrate mbonsai using transaction data, the program can be used in other areas where sequence data are available. For example, like its predecessor BONSAI, mbonsai can be applied to genetic data. Finally, as far as we know, this version of mbonsai is a unique attempt to directly analyze sequence data using tree-based methods. We envision future versions to include improvements such as bias correction and ensemble of trees.