Software for Implementing the Sequential Elimination of Level Combinations Algorithm

Genetic algorithms (GAs) are a popular technology to search for an optimum in a large search space. Using new concepts of forbidden array and weighted mutation, Mandal, Wu, and Johnson (2006) used elements of GAs to introduce a new global optimization technique called sequential elimination of level combinations (SELC), that efficiently finds optimums. A SAS macro, and MATLAB and R functions are developed to implement the SELC algorithm.


Introduction
Locating optimal values in a large search space is a primary goal for many scientific problems. In the pharmaceutical industry, for example, work has primarily focused on synthesizing new compounds that are effective against a particular disease or condition. Historically, chemists have used scientific knowledge to build new compounds that have promising properties for alleviating a condition or disease. While compounds may be designed to be theoretically effective, often their interactions with other parts of the body (e.g., liver, kidney, intestine, etc.) render them ineffective or toxic (Welling 1997). Hence, physically creating one compound at a time has become a less effective drug discovery approach.
As robotic technology has improved, chemists have been able to synthesize and explore a large number of new compounds. This technology, known as combinatorial chemistry, has been widely applied in the pharmaceutical industry and is gaining interest in other areas of chemical manufacturing (Leach and Gillet 2003;Gasteiger and Engel 2003). In short, combinatorial chemistry uses robotics to combine sets of monomers to create thousands of new compounds at a time. This technology has been used to enhance the diversity of compound libraries, to explore specific regions of chemical space (i.e., focused library design), and to optimize one or more pharmaceutical endpoints such as target efficacy or ADMET (absorption, distribution, metabolism, excretion, toxicology) properties (Rouhi 2003).
In a typical combinatorial chemistry problem, a core molecule is identified to which monomers are attached at multiple locations. Each attachment location may have tens or hundreds of potential monomers. Clearly, the full combinatorial library can become dauntingly large for a core molecule with just a few attachment points. Constrained by resources, most combinatorial libraries cannot be fully created. Instead, chemists use their scientific knowledge to identify the most promising monomer combinations and the monomer combinations to avoid. By construction, the SELC is an ideal method for searching for optimal molecules in combinatorial chemistry (Mandal et al. 2006).
The SELC method is also useful in computer experiments. Much research which in the past decades could only be conducted by performing physical experiments, can now be done by computer experiments instead. In a computer experiment, a response, y(x), is calculated for each set of input variables, x, using numerical methods implemented by (complex) computer code (Santner, Williams, and Notz 2003). In such cases, the complex numerical method can be thought of as a "black box", and the SELC method can be used to select the optima efficiently.
We can consider such real-life scenarios as large-dimensional design-of-experiment problems where the main challenge is to identify the optimal design settings. In scientific and engineering research, statistical design and analysis of experiments is an effective and commonly used tool to understand and/or improve a system. Identifying important factors and choosing factor levels are among the first and most fundamental issues for an experimenter. But when there are a large number of important factors, designing an experiment can be difficult. Classical experimental design relies heavily on algebraic properties such as orthogonality (Hedayat, Sloane, and Stufken 1999). However, orthogonality does not allow the flexibility to accommodate all kinds of promising follow-up runs, which, in turn makes finding suitable designs for large-scale problems difficult, particularly when the factors have more than two levels.
The use of high-fidelity computer simulations of physical phenomena (Bates, Buck, Riccomagno, and Wynn 1996) has stimulated new research into ways in which experimental design can be applied to such problems. Greedy algorithms are popular choices for these types of problems (Cormen, Leiserson, Rivest, and Stein 2001). In short, a greedy algorithm identifies the direction of an optimum at each stage and searches toward it. But for a complex response surface, the identified direction of the optima may not be the direction of the global optimal. However assuming some kind of regularity, the global maximum will not be in the vicinity of the local minimum and vice versa. One technique, motivated by design of experiments, was introduced by Wu, Mao, and Ma (1990), and is known as sequential elimination of levels (SEL). The idea of SEL is opposite to that of greedy algorithms; instead of focusing on factor levels that improve the response, SEL focuses on those levels that worsen the response. Based on this idea, SEL eliminates one level of each factor in each sequence of the experiment. When all factors are independent and the response surface is smooth, SEL can accurately locate optimal design points. However, when factors interact or the response surface contains local optimums, SEL does not accurately identify optimal design points. In these more realistic situations, SELC, the modified version of SEL using ideas from GAs, has been shown to be able to be a more effective optimization technique.
GAs have most often been viewed from a biological perspective. The metaphors of natural selection, cross breeding, and mutation have been helpful in providing a structure to explain how and why GAs work. Thus, most practical applications of GAs are rooted in the context where optimization is the primary goal. In order to understand how GAs function as optimizers, Reeves and Wright (1999) considered GAs to be a form of sequential experimental design. Recently, GAs have been successfully applied in solving statistical problems, particularly for searching for near-optimal designs (Hamada, Martz, Reese, and Wilson 2001;Heredia-Langner, Carlyle, Montgomery, Borror, and Runger 2003;Heredia-Langner, Montgomery, Carlyle, and Borror 2004). Because of the desire to explore increasing large experimental spaces in the pharmaceutical and engineering industries, there is an imminent need for software that implements new methodology (Willis 2007).
In this work we develop software across three widely used platforms: SAS (SAS Institute Inc. 2003), R (R Development Core Team 2008), and MATLAB (The MathWorks, Inc. 2007). Each implementation is straightforward and simple to use, and only requires the base platform of each software. Each function is freely available along with this paper at http://jstatsoft. org/v25/i06/. The article is organized as follows: Firstly, we review the GAs and the SELC method in Section 2. In Section 3, we provide the implementation of the algorithm in SAS, MATLAB, and R, and illustrate the algorithm using all three implementations. Finally, we conclude this work in Section 4.

Methodology review 2.1. Genetic algorithms
Genetic algorithms (GAs) are a stochastic optimization tool that were inspired by Darwin's theory of evolution (Holland 1975(Holland , 1992. The basic idea of GAs is to solve optimization problems via an evolutionary process which results in better solutions based on good solutions. The basic steps of a GA are as follows: 1. Solution representation: Each observation must be represented by a chromosome that defines its characteristics. 2. Selection: Identify the best chromosomes based on a fitness criterion (e.g., the larger the response, the better the fitness for a maximization problem).
3. Reproduction: Two chromosomes are randomly chosen, weighted by their fitness values, and the following operations are performed.
(a) Crossover: Split the pair of chromosomes at a same random location, and combine the head part of one with the tail of the other and vice versa. (b) Mutation: Change the level of randomly chosen factor(s) for the newly created offspring.
4. Repeat steps 2 and 3 until some convergence or stopping criterion is met.
To illustrate how a GA works, consider optimizing the function, Based on their fitness scores, x values of 10 and 20 have the highest probability of being selected, and without loss of generality, suppose that these two chromosomes are selected. If the randomly selected crossover location is between the second and third gene from the left, then the new offspring would be:

SELC algorithm
As mentioned in the Introduction, the SELC was proposed as an alternative to the SEL when seeking optimums in a high dimensional space where factors interact or where local optimums exist. The unique features of the SELC that allow it to find optimums in these settings are the forbidden array and weighted mutation scheme and these concepts are summarized below.

Forbidden array
The forbidden array is defined by its strength and order and contains design points that have demonstrated poor fitness values or are a priori known to produce undesirable fitness values. The strength of the array defines the number of design points placed into the array at each iteration of the algorithm. More specifically, a forbidden array with strength s consists of the s worst runs of the experiment at each iteration of the algorithm. The order of the forbidden array defines the factor combinations to be prevented from being constructed in subsequent iterations of the algorithm. A forbidden array with order k, for example, indicates that any combination of k or more levels from any design point in the forbidden array be prevented from being constructed in subsequent iterations of the algorithm. Thus, strength and order define the size of the forbidden array: as strength increases and order decreases, the number of forbidden design points increases.
We shall use an example from combinatorial chemistry to illustrate the construction of the forbidden array. On a core molecule, suppose that three monomers (denoted by 1, 2, and 3) can be added to each of three locations (denoted by A, B, and C). Note that in this example the monomers act as genes and form the chromosomes, which are the compounds to be evaluated. Suppose that 9 of the possible 27 compounds are created and analyzed (y is the fitness criterion): A B C y 1 1 1 10.1 1 2 2 53.6 1 3 3 43.8 2 1 2 13.4 2 2 3 46.9 2 3 1 55.1 3 1 3 5.7 3 2 1 43.6 3 3 2 47.0 In this case, the objective is to find compounds that maximize y. Therefore, the forbidden array will consist of compounds that have the lowest values of y. If the forbidden array has strength = 2 and order = 2, then it is constructed using the compounds A B C 1 1 1 3 1 3 and is defined as A B C 1 1 * 1 * 1 * 1 1 3 1 * 3 * 3 * 1 3 where * is the wildcard which represents any admissible value.

Weighted mutation
The second feature that makes the SELC algorithm unique is its weighted mutation scheme. In a traditional GA, genes at each loci mutate at random with small positive probability. This approach is unbiased toward any level of any factor in the experiment. However, upon collecting data we begin to gain insight about the effect of each factor or factor combinations on the fitness criterion. This information can then be used to allow the algorithm to focus on factors and levels of factors that improve the fitness criterion. In the weighted mutation scheme of the SELC algorithm, we use linear regression to determine significant main effects and pairwise interactions. (Because of the effect hierarchy principal (Wu and Hamada 2000), we do not explore beyond pairwise interactions.) If a factor, F j , has a significant main effect and no significant pairwise interactions, then the mutation probability for each level, l, of the factor is proportional to the average fitness of that level for the data collected thus far in the experiment p j l ∝ȳ(F j = l), for j = 1, 2, . . . , J, and l = 1, 2, . . . , L.
If two factors, F j and F k , have a significant interaction, then the probability of mutation is joint on F j and F k . When either factor is chosen, then the mutation will be weighted jointly with probability p j l km ∝ȳ(F j = l, F k = m), for j, k = 1, 2, . . . , J, and l, m = 1, 2, . . . , L.
If the selected factor does not have a significant main effect or interaction, then its value will be changed to any possible level with equal probability.

The SELC algorithm
The SELC algorithm combines orthogonal arrays, genetic algorithms, forbidden arrays, and weighted mutations as follows: 1. Begin with an initial design. Given no prior information about the design space, the SELC should be initiated with an orthogonal array, which helps to estimate a large number of factor effects efficiently. If there is prior knowledge about the design space or about design points that should initially be included in the forbidden array, then the algorithm should begin at step 3.

Create the selected design points; stop if the stopping criterion is achieved (see below).
3. Construct the forbidden array and calculate the weighted mutation probabilities.
4. Determine the new offspring using a genetic algorithm with weighted mutation probabilities.
5. Check offspring eligibility. An offspring is eligible if it is not prohibited by the forbidden array and has not been previously created. If an offspring is not eligible, then discard it and return to step 4.
6. Repeat steps 4 and 5 until the number of desired new offspring are achieved.
The stopping rule is subjective and depends on the progression of the algorithm and experimental constraints. As runs are added, the experimenter can monitor the progress towards optimization of the fitness criterion. Once a satisfactory level of fitness has been achieved the algorithm can be stopped. Alternatively, resources may limit the number of iterations of the algorithm.

Code description
To make the SELC algorithm available to a wide range of users, we have implemented it in SAS, R, and MATLAB. These are available along with this paper. Each implementation requires a data set that contains the initial design points and measured response. In addition, for all three implementations, the user can specify an a priori forbidden array.

SAS implementation
For the SELC macro, the factors in the initial design data set must have the variable names X1, X2, . . ., Xn, and the response must be named Y. Similarly, the factors in the forbidden array data set must have the variable names X1, X2, . . ., Xn.
Prior to calling the SELC macro, the user must first initialize a sequence of global macro variables that define the number of levels of each factor. For example, if an experiment has four factors, each with ten levels, then the user must specify the following code: %let L1 = 10; %let L2 = 10; %let L3 = 10; %let L4 = 10; %global L1 L2 L3 L4; The user can the call the SELC macro, which has the form: Default value is clock time. When the macro is called, it generates a data set named NEXT_GEN that contains the design points to be performed in the next experiment.

R and MATLAB implementation
For both the R and MATLAB code, the initial design can be provided in a text file. The file should not have a header; instead, the column position indicates the factors X 1 , X 2 , . . . , X n , and the last column should contain the response, y. For R and MATLAB, the user can also specify forbidden array data set, where the column position corresponds to each factor.
The call to the R and MATLAB functions have the same form: SELC(initialdesign, forbiddenarray, strength, order, level_true, direction, number_of_offspring, seed) The arguments to these functions are defined similarly as in the SAS macro.

Example
An example from a combinatorial chemistry problem is used to illustrate how the SELC software works. Onto a core molecule, monomers can be added to two positions ( Figure 1). Five monomers are selected for the core, 34 for location A, and 241 for location B; the objective is to find combination(s) of monomers that produce highly active compounds. In this experiment, we start with an initial design and a priori forbidden array. The initial design and forbidden arrays for this example are available electronically along with this paper.
A core B Figure 1: Hypothetical example from combinatorial chemistry.

SAS example
To generate the next set of five design points using the SELC SAS macro, we must specify the following code: Often no information will be available for the forbidden array. If this is the case, simply create a forbidden array data set that contains the names of the factors, but does not contain any design points. For example: data forbiddenarray; input x1-x3; cards; run;

Summary and concluding remarks
Because experimental design spaces are becoming increasingly large, there is an imminent need for methods to efficiently explore these spaces and for software to implement these methods. Indeed, selecting an optimal design in an extremely large search space is not an easy or straightforward task. For these types of problems, the SELC algorithm has been shown to efficiently and effectively identify optimums. To make the implementation of this method widely available, we have introduced software for implementing the SELC algorithm across three commonly used platforms in both academics and industry: SAS, R, and MATLAB. Each implementation is straightforward and easy to use, requiring only that data sets be prepared in the same format for input to each piece of software.
It is important to note that the software implementations provided in this work have been optimized for large design spaces consisting of a relatively small number of factors each containing a large number of levels. In addition, SELC is not designed to optimize a specific objective function. Instead, it is designed to identify new, promising, feasible design points for follow-up. Lastly, because each software package use different random number generators, the next generation of suggested design points will differ among SAS, R, and MATLAB. However, each implementation will identify key level combinations that are important for predicting the response.