gramEvol : Grammatical Evolution in R

We describe an R package which implements grammatical evolution (GE) for automatic program generation. By performing an unconstrained optimisation over a population of R expressions generated via a user-deﬁned grammar, programs which achieve a desired goal can be discovered. The package facilitates the coding and execution of GE programs, and supports parallel execution. In addition, three applications of GE in statistics and machine learning, including hyper-parameter optimisation, classiﬁcation and feature generation are studied.


Introduction
Grammatical evolution (GE) (O'Neill and Ryan 2001) generates complete programs, optimised towards performing a certain task by combining context-free grammars (CFG) (Knuth 1964) and genetic algorithms (GA) (Holland 1992).Specifically, syntactically correct programs are generated from a user-defined grammar, using a binary string to choose grammatical production rules.Through a formulation involving a user-defined cost function, the fitness score of programs for solving a problem can be evaluated, and an optimisation process is used to search through a subspace for the best program.This optimisation's objective function, i.e., mapping a binary string to a program and subsequently to a numeric score, is often non-smooth and non-convex, precluding gradient-based optimisation algorithms and favouring evolutionary optimisation techniques such as GA.
GE is an alternative to genetic programming (GP) (Koza 1992) for generating programs via evolution.While GP directly operates on the actual program's tree structure, GE applies evolutionary operators on binary strings which are subsequently converted to the final program.GP normally requires a custom search strategy to generate correct programs, whereas GE can utilise an unconstrained evolutionary search, relying on the mapping to generate correct 2. Background

Canonical genetic algorithms
Canonical GA (Holland 1992) is an optimisation algorithm which operates on a population of chromosomess, performing evolutionary operations including selection, crossover, and mutation as illustrated in Figure 1.Inspired by biological evolution, GA has been successfully used in applications with complex fitness landscapes and multiple local optimas (Mitchell 1996).In canonical GA, a chromosome is represented by a binary string.Normally, modern GA implementations do not directly operate on binary values.Instead, bits are grouped into nbit values creating a codon, each of which is used as a parameter in the optimisation problem.If the problem is made of multiple building blocks, codons related to each block are grouped together as a gene (Figure 2).This arrangement is a logical presentation of data and does not affect the low level representation of the chromosome.
The initial population is created from randomly generated chromosomes, each representing a solution to the formalized problem.The chromosomes are then evaluated based on a given cost function, φ(•).The objective is to minimise the cost function, or maximise the fitness of the chromosome.The better scoring chromosomes are deemed to be more desirable, and hence their data is retained.Conversely, the low scoring chromosomes are discarded from the population and replaced with new chromosomes to form a new generation.Elitism, favours the highest ranking chromosomes, and directly forwards them to the new generation's population.
Others are created by recombination of selected chromosomes.
The selection operator is applied to select chromosomes with likelihood proportional to their fitness score.Different selection schemes exist, including roulette wheel selection and tournament selection.In roulette wheel selection, the probability of selecting the ith chromosome, The crossover operator is applied on two randomly selected chromosomes.In canonical GA, a single-point crossover is used, where a position in the binary string is chosen at random and the opposing string sections of the two parents are exchanged, creating two new offsprings.
The mutation operator randomly flips single bits on a specific chromosome with a predefined mutation probability.Mutation is necessary to maintain genetic diversity from one generation of a population to the next.
The evolutionary process is repeated until a given termination criterion is satisfied.This criterion may include reaching a predetermined number of generations, finding a chromosome with fitness better than a certain minimum, or lack of improvement in the population fitness despite evolution.
Since its introduction in 1975 (Holland 1975), other techniques and evolutionary algorithms have been proposed to extend canonical GA.For example, to facilitate complex data representation, GA is often implemented with integer or floating point codons and evolutionary operators are applied directly to the codons instead of the underlying bit string.This method also takes the advantage of the architecture of modern processors to speed-up computation.For a review of other GA techniques, readers are referred to a survey by Srinivas and Patnaik (1994).

Context-free grammar
A context-free grammar (CFG) is a mechanism to generate patterns and strings using hierarchically organized production rules (Sipser 1997).A CFG is described by the tuple (T , N , R, S) where T is a set of terminal symbols, N is a set of non-terminal symbols with N ∩T = ∅, and S ∈ N is the start symbol.A non-terminal symbol is one that can be replaced by other non-terminal and/or terminal symbols, while terminal symbols are literals.N and T form the lexical elements used in R, the production rules of a CFG.R is defined as a set of relations (also referred to as production rules) in the form of x → α with x ∈ N , α ∈ (N ∪T ) * , where * is the Kleene star.If the grammar rules are defined as R = {x → xa, x → ax}, a is a terminal symbol since no rule exists to change it.
An example grammar in BNF notation is given in Table 1.In this grammar, the start symbol (S) is <expr >.Each of the non-terminal symbols defined in N , <expr >, <op>, <coef > and <var >, can be replaced by an appropriate terminal as specified in R. For example, <expr > can either expand to (<expr >)<op>(<expr >) or <coef >×<var >, and <op> can be replaced by one of the +, -, ×, or ÷ operators.Table 1: An example grammar in BNF notation.The three first lines define the non-terminal (N ), terminal (T ), and start (S) symbol sets respectively.The rest of the lines define the production rules (R).

Genotype to phenotype mapping using grammar rules
In evolutionary biology, chromosome data is referred to as the genotype, while an organism's observable characteristics are called the phenotype.Biological organisms use complicated methods to map their genotype to phenotype.Advanced evolutionary algorithms, such as GE, use a similar notion to create complex objects from simple chromosome structures.
In GE, genotype to phenotype mapping is performed according to the production rules of a CFG selected using the chromosome's codon values.The usual mapping function used is the mod rule defined as: (codon integer value) mod (number of rules for the current non-terminal), where mod is the modulus operator.Mapping begins from the start symbol S, and continues by replacing each non-terminal element N according to the production rule R chosen by the mapping function.At each step, the resulting expression can contain terminal (i.e., T ) or non-terminal elements.The mapping continues until all non-terminal elements are replaced with terminals.
If the chromosome is too short, it may run out of codons with non-terminal elements still remaining.A common approach is to wrap the chromosome and continue the mapping process by reusing the codons from the beginning.However, in cyclic grammars, infinite recursion may occur.This is addressed by introducing a limit on the number of allowed chromosome wrappings and returning a poor fitness score if the limit is reached.
In this section, the grammar in , can be later evaluated in different contexts as a numerical value.
Step Codon mod operator Rule Current element state Table 2: Production of an expression using the grammar of Table 1.The process starts from the start symbol S, and continues by replacing the first symbol present in N with another.This later symbol is selected from the production rules R according to the value of the current codon.In 8 steps, all of non-terminal symbols are replaced and the string GE uses the standard evolutionary operators from canonical GAs to evolve the chromosomes and generate new programs.An in-depth explanation of GE can be found in the original GE paper by O'Neill and Ryan (2001).

Software implementations of GE
Several open source software implementations of GE are available for different programming languages.These include GEVA (O'Neill, Hemberg, Gilligan, Bartley, McDermott, and Brabazon 2008) in Java, PonyGE (Hemberg and McDermott 2012) and PyNeurGen (Smiley 2012) in Python, GEM (Hemberg 2011) for MATLAB, and GERET (Suchmann 2013) for Ruby.gramEvol is the first package for R.
The design goal of this package is to evolve programs natively in R. While it is possible to generate and call R code from other languages, a native implementation has the following advantages: • R's expression objects are used to define a grammar, removing an error prone text-based BNF interpretation or compilation step, allowing dynamic grammar manipulation and rapid prototyping.
• Expression are created directly in R as expression objects, which removes the overhead of calling R from an external program.
• Only R's base packages are used for evolutionary operations and grammar processing along with parsing and running generated programs.This eliminates the need for thirdparty libraries and external dependencies.
A disadvantage of gramEvol is its speed compared to compiled GE libraries, such as libGE (Nicolau 2006) or AGE (Nohejl 2011), which are written in C++.We assume that the computational overhead of processing the cost function is greater than the overhead of GE operators.Hence any major speed-up will be a result of moving the cost function computational bottleneck to C, C++ or Fortran.This is already a common practice in the design and implementation of R packages.Furthermore, packages such as Rcpp (Eddelbuettel and Francois 2011) are available to facilitate porting existing R code to C++.

Package gramEvol
The gramEvol package implements grammatical evolution (GE) for R. It offers facilities for defining, creating, evaluating, and evolving programs based on context-free grammars, which are introduced in this section.

Defining a grammar
In gramEvol, a grammar is defined by passing a list of productions rules to the function CreateGrammar.CreateGrammar automatically determines the terminal, non-terminal and start symbols based on the rules.gramEvol supports two type of rules: expression based rules defined using grule, and character string rules defined using gsrule.
• The grammar is cyclic, i.e., the non-terminal symbol <expr > expands to more <expr >s.
To avoid infinite recursion, the maximum recursion depth is limited to the number of production rules.
• The grammar tree depth is limited to four.
• Maximum choices in a production rule is four, given by <op>.
• Maximum length of a chromosome, avoiding wrapping and limiting recursions, is 18.
• The maximum variation of each integer codon in the chromosome.This value depends on the location of the codons and the grammar, and helps reduce the search space of chromosomes.
• The grammar, with recursion limited to four, can create 18500 different expressions.
GrammarMap maps a sequence of integers (the genotype in evolutionary algorithms) to a symbolic expression (the phenotype).The example below converts the numeric genome in Table 2 to its analytical phenotype, using the verbose argument to show the steps of the mapping.

Exhaustive and random search in grammar
Context-free grammars are a general way of describing program structures, not bound to evolutionary optimisation.As a result, gramEvol additionally supports exhaustive and random search.
The first step in any optimisation is defining a cost function.This function receives an expression generated using the grammar, and returns an appropriate score.For example, in order to find the numeric sequence that generates a certain expression, the following cost function returns the generalized Levenshtein distance of the current expression and the target: The objective is to find a suitable chromosome, and therefore the expression, that minimises the cost function, i.e., the string distance.GrammaticalExhaustiveSearch performs an exhaustive search to find this expression: R> GrammaticalExhaustiveSearch(grammarDef, evalFunc)

Evolving a grammar
GrammaticalEvolution uses evolutionary optimisation to find the minima of evalFunc.Continuing the previous example, the best chromosome is determined by calling GrammaticalEvolution with the grammar and the cost function as parameters.Other details, such as size of the population and number of iterations are automatically chosen by an internal heuristic: R> result <-GrammaticalEvolution(grammarDef, evalFunc, terminationCost = 0) R> print(result, show.genome= TRUE) Grammatical Evolution Search Results: No. Generations: 3 Best Genome: It is evident that the evolutionary algorithm has quickly converged to the optimisation objective.
GrammaticalEvolution allows monitoring the status of each generation using a callback function.This function, if provided to parameter monitorFunc, receives an object similar to the return value of GrammaticalEvolution.For example, the following function prints the information about the current generation and the best chromosome in the current generation: R> customMonitorFunc <-function(results){ + print(results) + } R> ge <-GrammaticalEvolution(grammarDef, evalFunc, terminationCost = 0, + monitorFunc = customMonitorFunc) Internally, GrammaticalEvolution uses GeneticAlg.int, which is a GA implementation with integer codons partially based on genalg package by Willighagen ( 2014): • Using the information obtained about the grammar (e.g., number of possibles expressions and maximum sequence length), GrammaticalEvolution applies a heuristic algorithm based on the work of Deb and Agrawal (1999) to automatically determine a suitable value for the popSize (i.e., the population size) and the iterations (i.e., the number of iterations) parameters.
• The ordinary cross-over operator is considered destructive when homologous production rules are not aligned, such as for cyclic grammars (O'Neill, Ryan, Keijzer, and Cattolico 2003).Consequently, GrammaticalEvolution automatically changes cross-over parameters depending on the grammar to improve optimisation results.
• Each integer chromosome is mapped using the grammar, and its fitness is assessed by calling evalFunc (i.e., the cost function).
• After reaching a termination criteria, e.g., the maximum number of iterations or the desired terminationCost, the algorithm stops and returns the best expression found so far.
• GrammaticalEvolution also supports multi-gene operations, generating more than one expression per chromosome using the numExpr parameter.

Non-terminal expressions
As demonstrated in Section 3.1, a cyclic grammar allows complex expressions to be derived from a compact description.However, if the chromosome is too short, the expression may still contain non-terminal symbols even after wrapping multiple times.For example: R> chromosome <-c(0) R> expr <-GrammarMap(chromosome, grammarDef) R> expr

R> GrammarIsTerminal(expr)
[1] FALSE GrammaticalEvolution and other search functions automatically filter non-terminal expressions, and the user does not need to worry about them in practice.

Grammatical evolution for machine learning
In this section, three applications of grammatical evolution in statistics and machine learning are explored.Other applications, such as symbolic regression and regular expression discovery using package rex (Ushey and Hester 2014) are explained in the package's vignette.

Model selection and hyper-parameter optimisation
Selecting the best learning model in a machine learning task is often performed in three steps: • Feature selection, where different features are selected as inputs to for a learning model.
• Model selection, where candidate learning models are compared and one of them is selected.
• Hyper-parameter optimisation, where hyper-parameters of the model are optimised for the current objective, (e.g., the kernel type and parameters for kernel methods).
Due to their importance, dedicated packages such as caret (Kuhn 2014) support feature selection and hyper-parameter optimisation for many machine learning techniques.Extending these packages to support new algorithms or combining additional steps into their operation, however, require structural changes to the package's code.In this section, we show how CFGs can offer an easily extensible framework for a simultaneous feature selection, model selection and hyper-parameter optimisation.
Here, the ChickWeight dataset (R Core Team 2014) is used to demonstrate these steps.The objective is to learn the weight of a chicken based on the Time passed since its birth and its Diet.The Chick identifier is also included.
• The <model > can be either a lm, an svm or a nnet and is wrapped in capture.output to suppress the diagnostic but useless messages by nnet.
• Each learning algorithm has its own set of hyper-parameters: nnet's hidden layer size is determined using <nn.size>, and svm uses <sym.hyperparamm> to select its kernel and its associated parameter in one-step.Here, .() is used to avoid premature interpretation of assignment operator and comma (i.e., = and ,) by R.
The remaining rules, assign certain ranges of values to different hyper-parameters, similar to an ordinary grid search.
An example of an expression generated by this grammar is: This uses Diet as a feature, and an ANN with four neurons in its hidden layer as its model.The eval.chicken function, first evaluates the expression to get its underlying function.This function is then applied to the training data to obtain a model.If the model is NULL, i.e., some error has occurred during the training, it returns a very high cost.Otherwise, the model is used on the testing data, and the MSE of the results is returned.
To find the best combination of features, model and hyper-parameters, GrammaticalEvolution is run with the appropriate grammar and cost function: To compare the performance of GE and exhaustive search, the GE was run 100 times, with termination condition set to reaching the global optima obtained by the exhaustive search.
The error, number of generations and the duration of execution was measured.The tests were performed on a single thread on a 3.40 GHz Intel Core i7-2600 CPU.To ensure reproducibility, set.seed(0) was executed before running the code.The results are presented in Table 3. Overall, the GE's average execution time is better than that of the exhaustive search.It must be noted that however, as the GE is an stochastic optimisation, on some occasions it was unable to find the global minima before reaching the maximum number of allowed iterations.In this example this was limited to 108 generations, set automatically by GrammaticalEvolution.As a result, the optimisation terminated prior to reaching the global optima.
Exhaustive The machine learning approach used in this section was intentionally kept simple.Other learning algorithms can be added as additional rules, each with their own hyper-parameters.Different options, such as scaling or dimension reduction techniques can also be added to the <learner > function, each described using separate rules.

Classification
In the second example, we use GE for classification.There are many ways that GE can be adopted for classification, e.g., a model selection on classifiers similar to Section 4.1.Here, we directly define a grammar which takes input variables and returns the classification result, with a structure similar to a decision tree.
In this example, the objective is defined as separating Iris versicolor from other species in the Iris flower dataset.Here the data is evaluated from a data-frame instead of the program's environment.
Table 4 summarises the performance of GE classifier for 100 executions.As no termination condition was given, all of the runs terminated only after reaching the maximum allowed number of generations.It is evident that on average, GE is able to find an acceptable expression with in this limit.

Symbolic regression and feature generation
Symbolic regression is the process of discovering a function, in analytical form, which fits a given set of data.Commonly, evolutionary algorithms such as GP and GE are used for this Petal.Length q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q other versicolor classified as versicolor Feature generation is the process of deriving new features from existing features (Guo, Jack, and Nandi 2005).In this technique, an evolutionary algorithm is used to generate and combine results of multiple independently discovered expression, e.g., by using a linear combination of GP results (Keijzer 2004;Costelloe and Ryan 2009), or by using non-linear function estimators applied to GE (de Silva, Noorian, Davis, and Leong 2013).This can be considered a type of machine learning and symbolic regression hybrid, as the final learning model is constructed from combination of simpler features created through a process similar to symbolic regression.
For example, consider learning of the following sextic polynomial from numeric data: Evolving an expression that matches the observed data to this polynomial would either require a very well crafted grammar, or a successful search over a huge space, both of which are extremely computationally expensive.
gramEvol: Grammatical Evolution in R However, linear dependencies exist between components of this function.By designing a multigene chromosome, we can generate individual expressions independently and then combine them through a linear regression model to create the final expression.This effectively breaks the search space to several smaller ones, enabling a faster search over the whole space.Figure 4 illustrates the difference between these two approaches.To compare the symbolic regression and the feature generation with ordinary GE, two approaches are benchmarked using the same grammar: R> ruleDef <-list(expr = grule(op(expr, expr), func(expr), var), + func = grule(sin, cos, log, sqrt), + op = grule( + , -, * ), + var = grule(X, X^n, n), + n = grule(1, 2, 3, 4)) R> grammarDef <-CreateGrammar(ruleDef) The grammar can be used to generate many different types of expressions: In the results above, all the elements of the sextic equation are found within five expressions.Three of them (X 2 , X 3 , X 4 and X 6 ) are found separately, and the other expression, X + X^3 * X * ((X + 2) * 1), contains the linear combination X + 2X 4 + X 5 .As X 4 is already present separately, the linear regression can extract and combine all elements with the correct y-intercept.Consequently, the regression model f (X) perfectly matches the original model:

Comparison
To test the stochastic performance of GE with a single and multiple genes, each method was run 100 times and their error from the target equation was noted.The results are presented in Table 5.The results show major improvements in error, from an average 9.12 × 10 10 for symbolic regression to a worst cast of 1.48 for the feature generation approach.In comparison, the average time required to process both approaches was almost equal.

Conclusion
Context-free grammars provide a concise and versatile mechanism for expressing families of programs.Combined with evolutionary optimisation, grammatical evolution creates a powerful framework that allows integration of domain specific knowledge, defined using a grammar, into real-world applications.
The gramEvol package allows creation of native R programs using GE.After specifying a grammar and evaluation function, users can employ GE techniques with little additional code.Parallel execution is also supported via parallel computing functions within R.
One disadvantage of GE lies in its stochastic nature, as it does not guarantee the convergence to the global optima.The gramEvol package includes an exhaustive search option which can ensure an optimal solution at the expense of computation time.

Figure 1 :
Figure 1: The evolutionary process and associated operators.

Figure 3 :
Figure 3: Classification of Iris versicolor using GE

Table 1
To assess each model, a cost function is required.In this example, we define a simple crossvalidation test, returning the out-of-sample mean square error (MSE):

Table 3 :
Summary of GE's performance for 100 runs of model selection example.

Table 4 :
In this grammar, the start symbol, <result>, receives a TRUE/FALSE and returns either 'versicolor' or 'other'.The TRUE/FALSE value is generated by recursively applying boolean operators to <sub.expr >s.In turn, each <sub.expr> is created by a <comparison> of a <var > in Iris features and another value created using <func.var>.Summary of GE's performance for 100 runs of classification example.

Table 5 :
Performance comparison of symbolic regression and feature generation using GE for 100 runs.