Comp Mat Science Baumes Collet

COMMAT 2788 No.
of Pages 14, Model 5G

10 July 2008 Disk Used
ARTICLE IN PRESS
Computational Materials Science xxx (2008) xxx–xxx

1
Contents lists available at ScienceDirect
Computational Materials Science

journal homepage: www.elsevier.com/locate/commatsci
2 Examination of genetic programming paradigm for high-throughput
F
3 experimentation and heterogeneous catalysis
OO
4 Laurent A. Baumes a,*, Pierre Collet b
5 a
Max-Planck-Institut für Kohlenforschung, Mülheim, Germany and CNRS-IRCELyon, Villeurbanne, France
6 b
LSIIT, (UMR 7005) du CNRS, Université Louis Pasteur, Strasbourg, France
PR
7
a r t i c l e i n f o a b s t r a c t
9
2 1
10 Article history: The strong feature dependencies that exist in catalyst description do not permit using common algo- 22
11 Available online xxxx rithms while not loosing crucial information. Data treatments are restricted by the form of input data 23
making the full use of the experimental information impossible, confining the experimentation studies, 24
12 Keywords: and reducing one of the primary goals of HTE: to enlarge the search space. Consequently, an advanced 25
13 Genetic programming representation of the catalytic data supporting the intrinsic complexity of heterogeneous catalyst data
D 26
14 Heterogeneous catalysis structure is proposed. Likewise, an optimization strategy that can manipulate efficiently such data type, 27
15 High-throughput
permitting a valuable connection between algorithms, high-throughput (HT) apparatus, and databases, is 28
16 Materials
depicted. Such a new methodology enables the integration of domain knowledge through its configura- 29
TE
17 Combinatorial
18 Representation tion considering the study to be investigated. For the first time in heterogeneous catalysis, a conceptual 30
19 Data structure examination of genetic programming (GP) is achieved. 31
20 Q1 Ó 2008 Elsevier B.V. All rights reserved. 32
33
EC
34
35 1. Introduction The strong feature dependencies that exist in catalyst descrip- 57

tion do not permit using common algorithms while not loosing 58
36 Materials science is an interdisciplinary field which investigates crucial information. Data treatments are restricted by the form of 59
RR
37 the relationships between the structure, the properties, and the input data making the full use of the experimental information 60
38 performances of materials. Here, we focus on heterogeneous cata- impossible, confining the experimentation studies, and reducing 61
39 lysts which provide a surface for the chemical reaction to take one of the primary goals of HTE: to enlarge the search space. The 62
40 place on. The conventional catalyst development relies essentially so-called ‘‘attributes-cases” or ‘‘individuals-variables” file structure 63
41 on fundamental knowledge and know-how. It implies a rational is a bi-dimensional file in which experimental cases are listed in 64
CO
42 design of each experiment according to the previous ones and lines (i.e., one line for one case) while the corresponding variables 65
43 other information available in the literature. The main drawback appear in columns. Such structure commonly employed as input 66
44 of this approach is to be a very time-consuming process of trials by most of the algorithms is inherently limited. It cannot accom- 67
45 and errors, making and testing one material at a time. Another modate various categories of solids, or different catalytic reactions 68
46 drawback comes from the relative importance of intuition for the at the same time. Consequently, the build-up of powerful DBs ap- 69
47 initial choices of development strategy. On the other hand, combi- pears to be reduced to amassing data while minimizing the loss of 70
UN
48 natorial approaches and high-throughput (HT) techniques [1] de- information if cross-analysis or a complete exploitation of the con- 71
49 crease the time necessary for synthesis and screening of libraries tained information is not feasible. In this paper, an advanced rep- 72
50 of solids. Such a complementary approach is finding an increasing resentation of the catalytic data supporting the intrinsic 73
51 attention over the last decade, and all major chemical and petro- complexity of heterogeneous catalyst data structure is proposed. 74
52 chemical companies have invested multimillion dollars. However, Likewise, an optimization strategy that can manipulate efficiently 75
53 some of the main problems remain: (i) the integration of domain such data type, permitting a valuable connection between algo- 76
54 knowledge, (ii) the design of experiments (DoE), (iii) the storing rithms, HT apparatus, and databases, is depicted. Such a new meth- 77
55 of all decisions related to experimental processes, and (iv) the effi- odology enables the integration of domain knowledge through its 78
56 cient use of the experimental data stored in databases (DBs). configuration considering the study to be investigated. For the first 79
time in heterogeneous catalysis, a conceptual examination of ge- 80
* Corresponding author. Address: Instituto de Technologia Quimica, UPV-CSIC, netic programming (GP) is achieved. 81
Av. de los Naranjos s/n, E-46022, Valencia, Spain. Tel.: +34 963 877 806; fax: +34
963 877 809.
The organization of the study is as follows. First the problem of 82
E-mail address: baumesl@itq.upv.es (L.A. Baumes). catalytic data representation is detailed. Then the most widely 83
0927-0256/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.commatsci.2008.03.051
Please cite this article in press as: L.A. Baumes, P. Collet, Comput. Mater. Sci. (2008), doi:10.1016/j.commatsci.2008.03.051
COMMAT 2788 No. of Pages 14, Model 5G
ARTICLE IN PRESS
2 L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx
84 used genetic programming tree-based form is reported. From both the necessity of an efficient representation in heterogeneous catal- 119
85 catalysis and GP knowledge it is derived an adapted representation ysis that can fully handle such parameters, i.e., order of element 120
86 for the optimization and discovery of solid catalysts. It is demon- additions or solid modification in the synthesis sequence, the de- 121
87 strated how such a personalized GP representation allows to tailed description of linked thermal programs, the parameters re- 122
88 search in open-ended spaces of catalytic structures. The method lated to the different synthesis methods used during the 123
89 is shown to be efficient for the discovery of new structural form preparation phase, the precursors should be emphasized. Every 124
90 of catalysts, while the parametric optimization can be partially time there is a communication between chemists, HT apparatus, 125
91 transferred to common fast local approaches. Among the different databases, and algorithms, (see Fig. 1), an adequate representation 126
92 tests, an example is given through a multi-objective mathematical of the information to be transferred is unavoidable. Each step into 127
93 benchmark. This choice is due to the lack of studies tackling such the HT process will be detailed focusing on data representation. 128
94 difficulty which reflects most real-world optimization problems.
95 Finally, the genetic programming paradigm is compared to the 2.1. Databases connection 129
F
96 commonly used genetic algorithms (GAs).
The management of data, from the storage to the retrieve of the 130
OO
97 2. The problematic of data representation data, is of great importance in combinatorial studies. Saupe et al. [5] 131
stress on the crucial role of informatics in HT experimentation ap- 132
98 The way the experiments are encoded, i.e., the possible repre- plied to material science, specifying that ‘‘Every parameter during 133
99 sentation spaces according to the features investigated, allows preparation and testing may be a factor crucial for the performance 134
100 defining the adequate algorithms taking advantage of such repre- of the material. As a consequence, all experimental parameters 135
101 sentations for providing the expected form of results. For example, should be controlled or at least recorded to be able to identify 136
PR
102 considering pharmaceutical studies, molecular descriptors are the important correlations.” All companies, Symyx, hte AG, Avantium, 137
103 final result of logic and mathematical procedures which transform and recently, Bayer, DPI, and Dow, claim to have well-developed 138
104 chemical information encoded within a symbolic representation of such systems [6]. On the other hand, only few academic groups 139
105 a molecule into useful numbers. However, in contrast to molecules, have tackled the problem of experimental data storage and man- 140
106 a solid cannot easily be represented in a meaningful way in a com- agement considering a broad kind of catalysts [7]. In StoCatÓ [8] 141
107 puter [2]. If only the composition of a solid is encoded, many
D an underlying structure supporting and organizing experimental 142
108 important factors are lost. Parameters such as preparation modes data in an interconnected way permits to successfully minimize 143
109 and heat treatments greatly influence catalytic structure and prop- the loss of scientific information which might be used for further 144
110 erties. Considering a hydrodesulphurization (HDS) study, one must data treatments. The general DB defy encountered when dealing 145
TE
111 take into account the order in the synthesis sequence since results with solids is the antagonist combination of accuracy and flexibility 146
112 are different if Ni is impregnated first and then Mo, or vice versa, or enabling to accommodate most of the reactions using diverse mate- 147
113 even simultaneously [3]. Another example dealing with zeolite rials. Despite this challenging task is effectively supported through 148
114 activation [4] demonstrates that a fast calcinations procedure StoCatÓ scheme, after the software was used as a central DB by a 149
115 without careful ion-exchange and drying produces a steaming with consortium of 10 European organizations, including academia and 150
EC
116 the corresponding dealumination. On the other hand, when calci- industries (fifth PCRD ‘‘Combicat”), it appeared that the graphical 151
117 nations are carried out with different progressive steps a better user interface (GUI) is split into too many steps (i.e., windows), 152
118 control of the final product properties is achieved. Consequently, making the input of data complicated. This problem is due to the 153
RR
CO
UN
Fig. 1. The iterative procedure for heterogeneous catalyst discovery and optimization involving high-throughput technology, data storage in databases, data mining and
statistical algorithms, and chemical knowledge-interpretation, intuition. hITeQ is the new workflow platform built in ITQ supporting the communication between apparatus
and databases via various formats, among them AniML.
ARTICLE IN PRESS
L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx 3
154 difficulty to support the architecture of complex relational schemes 2.2. Algorithms’ inputs and outputs 182
155 through simple grids in interfaces. Such dilemma is obviously
156 inherent to every database, the DB structure would not be neces- Another central and decisive component in combinatorial mate- 183
157 sary if a simple grid or ExcelTM spreadsheet was sufficient. Even if rial research (CMR) is the automatic treatment of experimental 184
158 the input of data can be facilitated by the use of ‘‘norms” such as data. The different approaches employed for proposing iteratively 185
159 the definition of a XML format, difficulties are more troubling con- the new library of experiments to be conducted may be catego- 186
160 sidering the retrieve of data. Let’s consider a query which aims at rized into either ‘‘modeling” or ‘‘optimization” techniques. Model- 187
161 retrieving information about the object O1, for example ‘‘catalyst”. ing aims at obtaining an estimation of the figure of merit for the 188
162 The information to be retrieved may belong to another object O2, search space investigated. Based on the expected criterion(s) 189
163 for example ‘‘element”. One instance of O1 can be linked to more and/or diversity measures, [9] future experiments are selected. 190
164 than one instance of O2, and also inversely; for example a given cat- Depending on the study, the selection of the approach may also 191
165 alyst may contain various elements, and a given element may be- consider the trade-off between accuracy and understandability/ 192
F
166 long to various catalysts. The corresponding relational DB scheme interpretability, or the ability of the model to correctly extrapolate 193
167 is made of a so-called ‘‘intermediary” table between O1 and O2 in view of new conditions, or search space zones that are poorly ex- 194
OO
168 (see Supplementary Material). In such case, the query returns mul- plored, technically difficult, or that require higher investigation 195
169 tiple lines for each instance of O1. A given catalyst made of Cu–Fe– cost. Numerous techniques are employed such as neural networks 196
170 Mg is described with three consecutive lines (Table 1). Whatever [10] and hybrid solutions, [11] support vector machines, [12] 197
171 the DB type and the query language, the retrieve of data usually regression, [13] and classification [14] trees, long-established sta- 198
172 does not provide a usable file for algorithms. A more disturbing case tistics is mostly ignored, [13] traditional DoE [15]. Considering 199
173 is faced when the creation of the file is not possible, for example if optimization methods, the new generation to be conducted is de- 200
PR
174 the desired fields, involved in the description of the experiments, do fined regarding one criterion to be optimized or more, if multi- 201
175 not appear for every experiment. Table 1c is stressing on such objective is handled which is rarely observed. Genetic algorithm 202
176 problem. (GA) and evolutionary strategy (ES) [16] are principally exploited 203
177 As a result, even if the build-up of powerful DBs do permits the due to both the numerous past proofs of efficiency in diverse do- 204
178 handling of complex catalytic data structure, the input of data from mains, and their ‘‘iterative-population-mechanism” which fits well 205
179 user may be complicated and time-consuming, and the extraction the combinatorial loop process. Apart from modeling and optimi- 206
180
181
of the contained information is limited and does not provide a
workable file for algorithms.
D zation, very few new algorithms were proposed, for example, in
Ref. [17] a new active sampling methodology aims at obtaining a
207
208
TE
Table 1
Array resulting from a SQL query
Catalysts Element Order

EC
1 Cu 1
1 Fe 1
1 Mg 2
2 Mg 1
2 V 2
RR
2 Fe 3
Catalysts Element Precursor Precursor type

1 Cu A 1
1 Mg B 2
CO
2 Mg C 1
2 Cu A 2
2 Fe D 3
UN
ARTICLE IN PRESS
209 ‘‘well-organized” distribution of experimental inputs onto the plied for a given element, the related parameters must reveal the 275
210 search space in order to enhance recognition rates (of outputs) code ‘‘empty”. 276
211 when using later a previously unknown modeling approach.
212 Even if the whole set of techniques mentioned above differs 2.3. The knowledge 277
213 greatly in the way data are manipulated, they all share the restric-
214 tion on the data format that every case is described by a pre-de- Since HT started, computers are expected to search, reorganize, 278
215 fined ordered list of variables. Ref. [18] illustrates the weakness and analyze the accumulation of data in order to turn such infor- 279
216 GA representation and other inherent drawbacks. Here, such data mation into knowledge directly applicable to materials design. 280
217 representations (e.g., bi-dimensional arrays, chromosomes,. . .) are However, heterogeneous catalyst development is still largely based 281
218 qualified as ‘‘linear”. Such constraint does not allow the use infor- on empirical knowledge. The integration of knowledge inside algo- 282
219 mation which cannot be encoded in such a way, and thus it will not rithms is lacking due to the difficulty to concretely represent such 283
220 be able to handle a great part of catalytic data. Considering trivial information. The knowledge may come from either domain experts 284
F
221 examples (Table 1), it can be demonstrated that, whatever the cod- or DBs analysis. Data mining or knowledge discovery in databases 285
222 ing, the structure of data is erased, and information loss is compul- (KDD) which refer to a very interdisciplinary field consisting of 286
OO
223 sory if one tries to code such information in a ‘‘linear” way. For using methods of several research areas to extract knowledge from 287
224 example, this representation does not handle linkage between real-world data sets, are broadly employed in CMR. On the other 288
225 parameters such as order or inherited properties. Therefore, it is hand, very few papers about both combinatorial heterogeneous 289
226 impossible for the data treatment to produce knowledge on such catalysis research and data mining have established an efficient 290
227 information if the source is already lost in input file. The main use of knowledge either from expertise or DBs. 291
228 problem for traditional GA or ES (i.e., fixed length representation) In the beginning of the 90s, the first applications dealing with 292
PR
229 is that genes are pre-defined and pre-placed onto the chromosome domain knowledge appeared with programs based on experimental 293
230 without any representative order. GAs do not permit to handle an design for solid catalysts. Banares-Alcantara [19] evaluated design 294
231 ordered structure. The problem with ordering genes represents expert for catalyst development (DECADE), a program that suggests 295
232 actually an important research topic for GA called linkage learning catalysts for the hydrogenation of carbon monoxide, which are di- 296
233 where neighbourhood of genes represents high correlated features rectly selected from a database. The expert system integration of 297
234 but it does not permit to take into account relation of order in the D catalyst activity patterns (INCAP) [20] selects catalysts for oxidative 298
235 description of the object. hydrogenations by activity indices. At BASF, Speck et al. developed 299
236 Without making an exhaustive list of situations in which GA expert system for selection and optimization of catalysts [21] 300
237 chromosomes do not permit to fully handle catalytic data, an (ESKA) for the hydrogenation of organic compounds. Catalyst II9 301
TE
238 example is given in order to underline the kind of synthesis system uses a theoretical kinetic approach to optimize the perfor- 302
239 descriptions for which troubles are inherent to this linear fixed mance of catalysts. However, at the end of the 90s, expert systems 303
240 length representation. The fixed location of gene forces the chro- are no more considered, and the run toward higher screening 304
241 mosome designer to specify in advance the total list of variables throughput seems to substitute the investigation for knowledge 305
242 and to fix a code space for each of them. Each square in Fig. 6 is integration. On contrary these new tools should have amplified 306
EC
243 a gene for which a value will be assigned using a given alphabet. the research topic. The increase of experimental data and the stor- 307
244 Therefore, for the use of the core and layers concept with a classical age effectiveness of databases is a new source of knowledge waiting 308
245 GA representation, one has to define a chromosome like the one in to be mined. Despite their evident complementary, the knowledge 309
246 Fig. 6. This linear representation makes the chromosome composed integration is totally discarded, while the extraction is always lim- 310
247 of numerous genes and therefore, with only few elements in the ited to single studies (on ‘‘cases-variables” arrays), and therefore 311
RR
248 selection pool, the resulting length becomes huge. On the other the kind of knowledge possible to be extracted is restricted by the 312
249 hand, in most papers dedicated to combinatorial heterogeneous representation. Among the very few studies that examine knowl- 313
250 catalysis the synthesis method description is often the poorest, edge integration theme, Ref. [18] highlights the weakness GA repre- 314
251 i.e., only a qualitative variable corresponding to the name of the sentation and proposes an adapted solution based on the 315
252 method is specified. In order to precise the related parameters, hybridization of a GA and a rule-based system. Among other advan- 316
253 the chromosome should be enlarged as shown in Fig. 6 and thus tages, the methodology enables the use of knowledge as shown for 317
CO
254 new genes must be pre-defined. However, the number of parame- a real-case of toluene alkylation [22]. Such proposition is retained 318
255 ters for different synthesis methods is not fixed and thus, some and directly reused [23]. Another attempt [24] deals with knowl- 319
256 pre-defined loci will have to code for nothing. Let us consider edge integration and extraction through feature construction. The 320
257 now thermal programs. For gases, the same problem will appear method used for producing new features applies arithmetic or Boo- 321
258 as a mixture should be allowed, but the maximum number of com- leans functions on inherited properties of catalysts stored in a DB in 322
UN
259 ponents within the mix has to be specified in advance. According to order to get a direct link connecting constructed variables and the 323
260 the definition of heat treatment in StoCatTM the number of ‘‘cycles” solids. The method is shown to be efficient for the discovery of 324
261 is variable and hence the pre-defined representation will bring new descriptors allowing an increase of recognition rate of the 325
262 troubles another time. The resulting chromosome will be very long selectivity of untested solids for the oxidation of propylene to pro- 326
263 compared to the length which effectively code for something be- pene-oxide. Another example might be listed here due to its special 327
264 cause all the listed (pre-selected) elements are not all present at use of information [25]. The study stresses on the methodology to 328
265 the same time. This kind of chromosome is linear in the form but be employed for the automatic grouping of combinatorial catalytic 329
266 the structure that it tries to handle is not. There are dependences experiments based on the evolution of a given performance crite- 330
267 between genes. For example, if an element is not present, the gene rion over time. Such information allows analyzing the shapes of 331
268 that codes for the synthesis method has to be ‘‘empty”, i.e., a spe- the curves usually related to the mechanism aspects of the reaction. 332
269 cial code is needed when no synthesis is applied for the linked ele- The proposed solution is shown to be of great interest, permitting to 333
270 ment. Therefore, these dependencies have to be integrated in identify different behaviours further on the coincidences at a spe- 334
271 crossover and mutation in order not to propose impossible cata- cific time for the Heck coupling reaction using HT apparatus. Apart 335
272 lysts. In this case, genetic operator must be highly constrained from these few examples, it will be shown how knowledge can be 336
273 and above all these constraints are usually difficult to be set up integrated through the configuration–definition of genetic pro- 337
274 and to be managed. For the same reasons, if any synthesis is ap- gramming constituents. 338
ARTICLE IN PRESS
339 2.4. Representation and research space Over the past thirty years, strong biological metaphors led to the 376
development of several schools of EA, including genetic program- 377
340 When starting a discovery program in material science employ- ming (GP) [28,29]. A simple EA can be first summarized by Fig. 3, 378
341 ing HT tools, the initial conception of the research space should let where different representation or encoding schemes and operators 379
342 opportunity to unexpected results. A surprising breakthrough is will define different EAs. Here, we concentrate on GP which uses 380
343 only possible if the designer of the search space integrates diversity genetic tree-like computer programs. The aim of GP is ambitious 381
344 among possible experiments. However, diversity is not only inte- and this approach has already attracted a great deal of attention 382
345 grated through the number of modalities of the parameters in- from researchers in the field of machine learning (ML). It is Koza 383
346 volved in the study (i.e., size of the pool of element or supports, that demonstrated the power and generality of GP through the po- 384
347 levels of temperature or concentration,. . .), but also through the tential of performing genetic-style recombination upon function- 385
348 selection of variables to be explored. Such diversity is always a pri- tree specification of algorithms. The tree-based GP (TGP) is also 386
349 ori restricted in all studies due to the impossibility to further han- named traditional GP. GP differs with other ‘‘soft-computing” tech- 387
F
350 dle the variables through the representation mode. This obviously niques, which often optimize real numbers or vectors techniques. 388
351 reduces tremendously the chance of discovery, the power of the GP produces and processes symbolic information very efficiently. 389
OO
352 gained knowledge, and finally the interest of using combinatorial Despite this unique strength, GP has so far been applied mostly 390
353 approach. For example, despite the body of theory [26] providing in numerical and Boolean problem domains. See Ref. [30] for 391
354 theoretical evidence to complement the empirical proof of the non-Boolean related materials applications of GP. 392
355 robustness of GAs, the alphabet string representation is not that
356 flexible as emphasized earlier. Moreover, the capability of GAs 3.1. The GP mechanism 393
357 and ESs to discover new materials may be discussed. On one hand,
PR
358 Schrage [27] states that ‘‘In many respects, evolution is the ulti- The primary difference between GA and GP is that GP genomes 394
359 mate prototyping and simulation methodology. Evolution’s power is allowed to vary in its depth and size which can change dynam- 395
360 and versatility are inarguable; its ability to innovate and surprise is ically during the evolutionary process to give a more flexible rep- 396
361 overwhelming.” On the other hand, one can argue that there is resentation. Such GP open-endedness is a good way of getting 397
362 nothing creative involved in the solutions generated by this way around the inherent limitations of fixed coding. The tree-like rep- 398
363 since technically speaking one is just finding solutions that are al- resentation is a hierarchical representation where the arguments 399
364
365
ready out there waiting to be found. For the first time, here it is
proposed a method which enables working on an open-ended
D of a function are represented as its descendant nodes. A function
of arity n in a parse tree will have n child nodes. These arguments
400
401
366 search space while taking into account the complex structure of may be constants, variables or other functions. Two different parse 402
TE
367 data. trees are presented in Fig. 4. Six preliminary steps can be defined 403
when using GP: (i) the terminals, (ii) the functions, (iii) the fitness 404
368 3. The genetic programming paradigm function, (iv) the control parameters, (v) the termination criterion, 405
and (vi) the program architecture. Here, it is considered that the 406
369 Evolutionary algorithms (EAs) methodologies are now consis- reader is already familiar with GAs or ESs, and consequently, only 407
EC
370 tently used not only in research but also for industrial and com- GP specific features are discussed. 408
371 mercial problem-solving activities, demonstrating that the Before a problem may be solved the alphabet from which the 409
372 approach is sound and competitive. EA is an umbrella term used program trees are composed must be specified. The alphabet can 410
373 to describe computer-based problem solving systems, which use be split into: function set, i.e., internal nodes of a tree, and terminal 411
374 computational models of some of the known mechanisms of evolu- set which is made up of all the constants and variables for a partic- 412
RR
375 tion as key elements in their design and implementation (Fig. 2). ular problem, i.e., leaf nodes with no descendants in a tree. The first 413
CO
UN
Fig. 2. Classification of evolutionary algorithms.
Generate the initial population P(0) usually at random and set i=0
Repeat
Evaluate the fitness of each individual in P(i)
Select parents from P(i) based on their fitness in P(i)
Apply search operators to the parents and get generation P(i+1)
Until the population converges or the time is up
Fig. 3. A simple evolutionary algorithm.
ARTICLE IN PRESS
F
Fig. 4. (a) A simple program x2 + ((y3) + z) in tree-like form and (b) another example of genetic programming syntax tree (OR(AND(OR(x1x2))(NAND(x2x3)))(NOR(x1x3))). Note
how the brackets which denote the order of evaluation correspond to the structure of the tree. The functions are respectively arithmetic functions such as (+, ^ and ) and
OO
Boolean functions (AND, OR, NAND and NOR) whereas the terminal sets comprise real values (x, y, z) and, respectively, Boolean input variables (x1, x2, x3).
PR
D
TE
EC
RR
CO
Fig. 5. Standard GP crossover (left) and unviable programs (right). The function ‘if. . .then. . .else’ is 3-arity where the first argument is Boolean and the two others real ones. ‘>’
Returns ‘‘true” if the first element is superior to the second one. ‘+’ and ‘^’ are the usual arithmetic functions. ‘Max’ is a 1-arity function returning the maximum real value
among the connected list. On the right hand side parents 1 and 2 are, respectively, equal to 1 and 8. However, on left hand side, after crossing over the two viable parents 1
and 2, an unfeasible offspring is obtained as the type of arguments for each function is not respected.
UN
414 requirement of great importance is to ensure that they are capable grams according to the incompatibility of data types. In order to 428
415 of expressing the solution to the problem. Both functions and ter- overcome closure constraint limitations, ‘‘strong typing” can be 429
416 minals must return a value, and this value should provide a valid introduced to the algorithm. The type system indirectly and a priori 430
417 argument to each of the functions in the function set. This impor- specifies constraints through the types of arguments of the func- 431
418 tant constraint is known as closure. However there are different tions and the return types of both the functions and terminals. 432
419 ways that allow relaxing such constraint, such as the use of dy- The constraints can be implicit and come with the type system, 433
420 namic typing which enforce closure and allow multiple data types. or they can be explicit and hand crafted by the user. The type sys- 434
421 A better way to enforce data type constraints is to use strong typ- tem, therefore, constrains the search space by only allowing a sub- 435
422 ing and hence to force to only generate trees which satisfy these set of the combinations of symbols from the alphabet to be 436
423 constraints. Closure influences the choice of terminals and func- combined. In addition to the type constraints, strongly typed GP 437
424 tions, and so the problem representation and thus problem diffi- (STGP) allows the use of generic functions and generic data types. 438
425 culty. While both GAs and GP employ operators, their STGP lifts the closure requirement by implementing mechanisms 439
426 implementation must be tailored to the representation. In Fig. 5 that allow only type correct programs to be considered. Generic 440
427 right hand side, is shown how the crossover creates unviable pro- functions will accept and return a variety of different types so that 441
ARTICLE IN PRESS
F
OO
PR
D
TE
Fig. 6. Expended GA representation.
EC
442 an individual function does not have to be defined for each type. be hard to reproduce and optimize. Therefore, for two different 472
443 For example, the addition function is able to accept and return both trees, i.e., catalysts, with the same fitness values the smallest one 473
444 floating point numbers and integers. Haynes and Schoenefeld [31] is preferred. However, trees show a tendency to increase in size 474
445 extend STGP to allow a hierarchy of types in an object orientated with the evolution: this phenomenon is known as ‘‘bloat” appears 475
446 manner. to be an inherent feature of search algorithms using variable- 476
RR
length representations, and is common for TGP as two different 477

447 3.2. Genetic programming for catalysts design genomes can produce exactly the same programs, i.e., two different 478
trees code the same catalyst. Koza [28] examines the effect of add- 479
448 GP is used to evolve computer programs while heterogeneous ing more Boolean functions to the function set on a single Boolean 480
449 catalysts have nothing to do with programming code or mathemat- induction problem (the 6-multiplexer). He found that performance 481
450 ical functions. However, the design of catalysts libraries using the progressively deteriorated when the size increases, and observed 482
CO
451 HT approach can be practically depicted by sequences of actions the best performance that was reached with the logically complete 483
452 which are analogous to program instructions. Therefore, the gen- function set. Therefore, functions must be general enough to han- 484
453 eral idea behind the use of GP for heterogeneous catalysts design dle diverse catalysts but also adequately specific not to enlarge the 485
454 is to describe experimental information by trees. For example, this search space to undesirable zones, not to create unrealistic cata- 486
455 can be done through the creation of functions related to synthesis lysts, and not to increase bloat potential. GP2HC (genetic program- 487
UN
456 activities. Mathematical functions in Fig. 4 will be replaced by syn- ming to heterogeneous catalysis) concept tackles this problem. 488
457 thesis actions such as ‘‘impregnate”. The programs can be seen as Different examples defining the GP constituents are examined 489
458 the catalyst design, the compilation is similar to the synthesis, and discussed. The following section stresses on the structural 490
459 the execution is the catalytic test and the set of fixed reaction con- organization of GP2HC by assuming a domain dependent architec- 491
460 ditions are the fitness cases. Considering heterogeneous catalysts ture, defining functions for heterogeneous catalysis, and adapting 492
461 representation through trees, the narrow view of types restricted and refining operators. 493
462 to ‘‘mathematics” must be enlarged to other types with symbolic
463 notions such as ‘‘liquid” or ‘‘gas”. For example, in Ref. [32] a func-
464 tional logic language is used and combined with GP in order to pre- 4. GP2HC: an advanced and flexible concept 494
465 dict the carcinogenic activity of chemicals.
466 Before giving concrete example of the possible application of GP The choice of components of the program, i.e., terminals and 495
467 for heterogeneous catalysis, it must be pointed that large candidate functions, and the fitness function largely determine the space 496
468 solutions, i.e., large trees, are undesirable for numerous reasons which GP searches and consequently how difficult that search is 497
469 and among them due to the fact that the comprehensibility be- and ultimately how successful it will be. Here, the chemist knowl- 498
470 comes more difficult. From catalysis point of view, one can think edge has an important role to play since the computer scientist 499
471 that catalysts containing many elements or synthesis steps can responsible of the coding of the strategy, i.e., the manipulation of 500
ARTICLE IN PRESS
501 the data, is not able to interpret the chemical meaning of the trees. nique is often used in problem solving to decompose a difficult prob- 514
502 Due to the high complexity of catalyst design when considering lem into more manageable sub-problems. The most general and 515
503 various types of materials, an architecture-defining preparatory simplest structure considers a catalyst in different parts. As there 516
504 step will be performed. The proposed structure is not totally fixed are options (noted with Lozenges ‘‘ ”) this architecture is consid- 517
505 and can be easily prototyped considering special requirements of a ered as fixed within a certain degree of freedom. Fig. 7 shows the 518
506 given research program. scheme corresponding to the structure also called architecture. 519
Rectangles with round borders ‘‘ ” are not functions but concep- 520
507 4.1. Architecture definition tual objects which will be defined by their special function and ter- 521
minal sets. Functions are symbolized with ‘‘s” (none in Fig. 7) and 522
508 4.1.1. A multi-tree architecture terminals use different symbols depending to the corresponding 523
509 Each catalyst within the population is composed of n trees. This type (h, ,. . .). ‘‘h” is a numerical value and ‘‘ ” corresponds to a 524
510 multiple trees architecture is selected and adapted for heteroge- list of values. ‘‘d” is a conceptual ‘‘and” on which the argument list 525
F
511 neous catalysis so that each tree contains code which evolves for a and functions of the concept are plugged for clarity of the scheme. 526
512 single purpose (a part of the catalyst) while optimizing the whole The solid roughly follows the StoCatÓ DB scheme conception 527
OO
513 structure, i.e., the entire catalyst. The ‘‘divide and conquer” tech- (Fig. 7 on the right hand side, the entire DB scheme is given in 528
PR
D
TE
EC
Fig. 7. The catalyst architecture in GP2HC (left) following the concept of ‘‘core” and ‘‘shell” previously defined in StoCatÓ.
RR
CO
UN
Fig. 8. GP2HC architecture design flexibility.
ARTICLE IN PRESS
F
OO
Fig. 9. A multi-tree crossover, one tree at a time. Crossover is done between trees which are at the same position (here the trees at the second place from the left).
PR
D
TE
EC
RR
Fig. 10. Core synthesis methods.

CO
529 Q4 Supplementary Material). It is composed of a core (in green1), the sub-trees from equivalent branches in different trees. Multi-tree 546
530 weight percent of this core (h), some optional combinations (in blue crossover is similar to the crossover operator with branch typing 547
531 and red) of layers additions with their respective weight percent ( ) used by Koza. When crossing over, one tree is selected at random 548
532 and heat treatments, and a final heat treatment. An intermediary with equal probability (i.e., 1/5 in Fig. 9). This tree in the offspring 549
533 heat treatment is possible (in red, second one from left hand side) is created by crossover between the trees in each parent of the cho- 550
UN
534 in case of element addition onto the core (i.e., layers). The architec- sen type in the normal GP way. The new tree has the same root as 551
535 ture respects an order from top to bottom and implicitly from left to the first parent. Each mating produces a single offspring. Concern- 552
536 right. The regimented syntax used from left to right is implicit since ing the mutation, two mutation operators are defined, acting at dif- 553
537 any operator can alter this order. The architecture definition permits ferent levels in the GP architecture: the macro-mutation and the 554
538 both to introduce constraints and to underline important factors a micro-mutation. The macro-mutation takes one function and gen- 555
539 given research program may investigate. For example, one can sep- erates a new one of the same type. The micro-mutation consists in 556
540 arate the main elements from promoters (<10%) and traces (<0.1%) replacing the terminal by a new random one. 557
541 into the architecture as shown in Fig. 8.
4.2. Function and terminal sets 558
542 4.1.2. Adaptation of GP operators
543 The crossover operator has to be suitably adapted so that the Here different functions and terminal sets are presented taking 559
544 main structure is preserved. In addition to this, in order to evolve into account the architecture defined above. 560
561
545 syntactically correct trees, crossover may only take place between 562
– The core is simply defined by either picking one commercial 563
1
For interpretation of color in Figs. 1, 5, 7, 8, 10, 11, 13 and 14 the reader is referred support or by achieving its synthesis (Ss). Therefore, a synthesis 564
to the web version of this article. method is called. Ms (see Fig. 10) has to be pre-defined correctly 565
ARTICLE IN PRESS
F
OO
Fig. 11. From left to right: heat treatment (HT), gas, and layer ADFs.
566 in order to avoid impossible cases such as the use of impregnation

567 since there is nothing yet to impregnate on. Considering the core
568 synthesis methods, only methods for producing dispersing agent
569 through intimate mix of element are taken into account according
PR
570 to core and layers concept. Simultaneous use of multiple elements
571 in a single method is necessary if one wants to produce a multi-
572 component core such as co-precipitated CuZnAl. Thus, core synthe-
573 sis Ms calls ‘‘+n” is shown in Fig. 10.
574
575 – The thermal treatment or heat treatment (noted HT) is defined
576 as it follows: the argument list is composed of a final temperature,
D
577 ramp, dwell, total flow rate of gases, and an initial temperature Fig. 12. ‘‘&” Function for solid mixture.
578 automatically set as the previous one or ‘‘ambient” ( ) if any is
TE
579 specified (see Fig. 8 right hand side). Functions such as ‘‘HT”,
580 ‘‘Ml” can call each other enforcing a hierarchical arrangement of
581 function calls. Thus, ‘‘HT” requires gas (Fig. 10), and either another constraints are explicit and hand crafted by the user. The terminal 617
582 ‘‘HT” or terminal Ø for stopping as depicted in Fig. 10 (left branch). sets are dependent on the tree branch and functions in which they 618
583 Therefore one thermal treatment can be composed of different are employed. In this example of GP2HC conception we have de- 619
EC
584 cycles. If a given research program focuses on thermal treatment, fined: lists (element, gases,. . .), ordered list, real values (flows, tem- 620
585 flexibility can be integrated by defining gas. The ‘‘gas” function peratures, Ph,. . .), symbols or pre-defined groups (air, Ø), 621
586 could be either ‘‘air” or a coupled list of gas and relative flow per- qualitative values (stirring types, precursor’s types, commercial 622
587 cent. This list is of special type in order to avoid bloat since it is supports,. . .). The creativity of the designer is important as it is di- 623
588 ordered ( +;). We restrict also the use of twice the same gas for rectly correlated with GAP performance and research problem def- 624
RR
589 bloat consequences; the related symbol is the DB+ (see the DB inition or boundaries. If one wants to take into account mixtures of 625
590 symbol in Fig. 10 with a red arrow). commercial supports, see Ref. [22] for an example of real study 626
591 where the resulting catalyst outperforms the current industrial 627
592 – The optional ‘‘layer” concept necessitates ‘‘HT” or ‘‘Ml” functions. performance, a function ‘‘&” may be defined (see Fig. 12 on left 628
593 Ml needs a stirring type argument, ‘‘Param” which is the list of hand side). However, bloat may occur since the order when mixing 629
argument linked to the chosen method, ‘‘HTm” is the thermal treat- n solids does not have any influence on the final mixture of solids, 630
CO
594
595 ment associated to the method (for example the thermal program and consequently & (NaZSM-5, 30, & (NaX, 70, Ø)) & (NaX, 70, & 631
596 during a precipitation). ‘‘HT” and ‘‘HTm” are two different functions (NaZSM-5, 30, Ø)). 632
597 since they could be designed differently by the user. ‘‘Add” is the
598 function that defines which element is added to the catalyst but 5. Experimental section 633
599 also precursor type and concentration associated to the element.
UN
600 Such design allows only one single element to be added per layer. Materials science, as numerous design problems from various 634
601 Of course and depending on catalysts to be searched an improved domains, deals with multiple objectives (MOs) at the same time. 635
602 version can be adopted as shown for core synthesis. Each synthesis However, even if the presence of several conflicting objectives is 636
603 method can be fully described (for example, number of impregna- typical for heterogeneous catalysis research, no paper is available. 637
604 tions will only belong to the impregnation list and addition rate of Therefore it is decided to survey quickly the subject (see Supple- 638
605 precipitating agent for co-precipitation). Note that the designer mentary Material), and to apply the concept of GP to a MO bench- 639
606 could apply the ‘‘ ” filter on elements in ‘‘Ml” function not to make mark using Pareto optimality. 640
607 twice the same element addition in two different layers. However,
608 StoCatÓ assumes that achieving two separated element additions 5.1. Benchmark and algorithm settings 641
609 of the same element should produce two different catalysts (even
612 The benchmark is suggested in order to stress on the order into 642
610 if the final and total amount of the given element is equal in both
611 Q2 cases) (see Fig. 11). a sequence of selected elements noted x. Twice the same element 643
613 can be selected. This emphasizes on the fact that solids containing 644
614 This simple and quick description can be easily modified by the equal amounts of elements may perform differently depending on 645
615 user. The flexibility of such a structure that can be accommodated the way they are added. The aim is to optimize both C and S, max- 646
616 to nearly all catalysts prepared at lab scale. In GP2HC concept, the imized and minimized, respectively. 647
ARTICLE IN PRESS
F
OO
PR
D
Fig. 13. View of a single run of GP for the benchmark. R is on the vertical axis, while S is on the horizontal axis. Since the scale on the vertical axes in changing from one chart
to another, R = 100 is shown by the blue dotted line ( ), and R = 300 by ( ).
TE
EC
RR
Fig. 14. GAP population (left) and population ranking in ‘‘MOGA” (right).
CO
8 x
>
> f ðxj ; xi Þ ¼ j1oj
þ xi ðnbfxi g oi Þ
>
>
>
>
X
i¼n < if nbfxi g 6 3 then a ¼ 2
UN
C ¼ Sa f ðxi1 ; xi Þ with x i2½0;...;1

and
>
> else a ¼ 1
i¼2 >
>
> nbfxi g ¼ number of x 6¼ 0
>
:
S 2 f1; . . . ; 10g
648 Multi-objective optimization is handled by using MOGA [33] real values for x, another that gives i (i.e., which xi is selected). 658
649 which ranks each individual according to their degree of domi- The selection of i is linked to a temporary list that remove every 659
650 nance. An individual’s ranking equals the number of individuals i that has been selected with ‘‘+” and ‘‘&”. In the last leaf, it makes 660
651 that it is dominated by plus one. Individuals on the Pareto front a call for ‘‘+” or ‘‘&” functions. ‘‘&”is defined exactly like ‘‘+” since 661
652 have a ranking of one, as they are non-dominated. The rankings the selection of twice an x is restricted on the same i. The difference 662
653 are then scaled to score individuals in the population (see Fig. between ‘‘&” and ‘‘+” is on the influence they have on fitness value. 663
654 14). The main architecture is defined by a root and two leaves. A rank-based selection with tournament and a generational tech- 664
655 The left hand side leaf receives a single value for S from terminal nique using elitism are employed. In order to enhance exploitation, 665
656 set fS1 ; . . . ; S10 g, and the second leaf receives either ‘‘+”or ‘‘&” func- the population is separated into two sets: (i) Pb is the part (b%) of the 666
657 tions. ‘‘+” function has three leaves with two terminal waiting for population that stores the best individuals from the beginning and 667
ARTICLE IN PRESS
Table 2 has been statistically tested with a Chi-square GOF test. Fig. 13 716
GP parameters shows the result of a single run. 717
Parameters Values The first generation is quite poor and an average of only two 718
Evolutionary model Generational individuals usually exhibit R > 100 as shown by the blue dotted 719
Population size 150 line ( ). It can be noted that no relatively high difference is visible 720
Stop criterion Stop at 25 generations between R on the different S except for S5, S8 and S9. In the third 721
Function set {+, &} generation, the R figure of merit exceeds 300 ( ). One can note 722
Terminals {S1,...,S10}, real, and integer
Tree generation Half and half
the increase of R at the eighth generation. However, while search- 723
Initial depth 4 ing to increase R, the GP generates individuals that increase S val- 724
Maximum depth 12 ues. The GP follows the MOGA objective which consists in 725
Sub-tree crossover probability 1 providing the best Pareto frontier and to maintain it. The trade- 726
Macro-mutation 0.01
off between R and S is correctly controlled as the best individuals 727
F
Micro-mutation 0.1 (applied every mM generations)
Frequency of local optimization (mM) 3 (considering R) on lowest S are rapidly found around the eighth 728
Parent selection Tournament size 6 generation and GAP focuses on higher values of S. From the eighth 729
OO
generation one can note that small values of R for S > 5 are partially 730
removed (see circle on Fig. 13) going forward to Pareto front. 731
668 (ii) Pr is the rest (r = 1 b) of the population. Compared to the com- The outcome from this optimization is a set of Pareto optimal 732
669 mon elitist method, an elitist selection is done over the entire set of solutions that visualize the trade-off between two figures of merit 733
670 individuals previously explored. These individuals belonging to Pb noted C and S. The advantage with such approach is that the solu- 734
671 are not re-evaluated. Pb is set to 33% in the experiment. Micro-muta- tions are independent from the decision-maker’s preferences. The 735
PR
672 tion consists in replacing the terminal by a new random one. Every analysis has only to be performed once, as the Pareto set does not 736
673 mM generations, micro-mutation is applied in order to both opti- change as long as the problem description remains unchanged. A 737
674 mize the structure and the parameters. Given a pre-defined maxi- disadvantage might be that the decision-maker has too many solu- 738
675 mum depth, GP trees are initialised using half–half initialisation. tions to choose from. Here, a typical decision-maker would balance 739
676 The population trees are created half with the full and half with the pros and cons of some few results picked up from the Pareto 740
677 the grow method. This ensures initial tree diversity in terms of both Dfront. For example, catalysts made of very few elements (S from 1 741
678 size and structure. All the parameters are listed in Table 2. The pop- to 5 in the benchmark) receive very low C, on the other hand in or- 742
679 ulation size and tournament size have been set, respectively, to 150 der to reach the highest vales of C, S is equal to 10. The solution for 743
680 and 6 due to previous tests and information from the literature. The S = 7 seems a good compromise between the complexity of the cat- 744
TE
681 stopping criterion is set to 25 generations since we must restrict the alyst, i.e., number of element addition, and its performance. More- 745
682 total number of tests as we would do for real experiments. The fre- over, there is a relatively high increase from S = 6 to 7, while no 746
683 quencies of crossover and mutation operators have been set after better catalyst has been found for S = 8. This evaluation on a bench- 747
684 numerous tests with various benchmarks (not shown here). Tourna- mark shows that managing ordered problem is easily feasible with 748
685 ment selection has become increasingly popular as it performs rank a GP. However, such a strategy still has to be applied on real cases. 749
EC
686 selection based selection using only local information. As it does not
687 use the whole population tournament selection does not require
688 global population statistics. In tournament selection a number of 6. Discussion 750
689 individuals, the tournament size, are chosen at random with reselec-
690 tion from the breeding population. These are compared with each Possibly the greatest distinction between GAs and GP is that of 751
RR
691 other and the best of them is chosen. As the number of candidates fixed or variable length. In some cases, the size of the required 752
692 in the tournament is small the comparisons are not expensive. An solution sought may be known beforehand. However, there are 753
693 element of noise is inherent in tournament selection due to the ran- many problems where it is difficult to pre-specify the size of the 754
694 dom selection of candidates. solution. Clearly, if we know the size of the solution we do not need 755
695 A solution is said to be Pareto optimal if there exists no other to use a variable length representation as this would make the 756
696 solution, i.e., better in all attributes. This implies that in order to search space larger. The quest for more efficient GP is an important 757
CO
697 achieve a better value in one objective at least one of the other research problem. This is due to the fact that a high complexity of 758
698 objectives is going to deteriorate if the solution is Pareto optimal. GP is among its distinctive features. Evolution in GP is both para- 759
699 Thus, the outcome of a Pareto optimization is not one optimal metric and structural in nature. Two important features are spe- 760
700 point, but a set of Pareto optimal solutions that visualize the cific to GP: (i) the fitness of the functional structure depends on 761
701 trade-off between the objectives. Considering a minimization the values of local parameters. Even very fit structures may per- 762
UN
702 problem and two solution vectors x1 and x2; x1 is said to dominate form poorly due to inappropriate numeric coefficients and (ii) 763
703 x2, if 8i 2 ff1; . . . ; kg n jg : fi ðx1 Þ 6 fi ðx2 Þ and 9j : fj ðx1 Þ < fj ðx2 Þ. The the fitness of the individual is highly context sensitive. Slight 764
704 Pareto optimal solutions are known as the Pareto optimal front. changes in structure dramatically influence fitness and may re- 765
705 If the final solution is selected from the set of Pareto optimal solu- quire completely new parameters. Accordingly, there are many 766
706 tions, there would not exist any solutions that are better in all ways to introduce local learning into GP. The presence of stochas- 767
707 attributes. If the solution is not in the Pareto optimal set, it could ticity in local learning makes it relatively slow, even though some 768
708 be improved without degeneration in any of the objectives, and hybrid algorithms yield overall improvement. Apparently, the full 769
709 thus it is not a rational choice. This is true as long as the selection potential of local search optimization is yet to be realized. More- 770
710 is done based on the objectives only. Pareto optimal solutions are over, since local learning comes with a price, it must be wisely 771
711 also known as non-dominated or efficient solutions. traded off with genetic search costs. 772
Storing, organizing and using most of the information in chem- 773
712 5.2. Results istry research is one of the main concerns to develop efficient 774
experimental strategies. Up to now, there is not any integral ap- 775
713 The optimization of the benchmark using GAP is done 20 times proach able to exactly reproduce each of the decisions around 776
714 each being made of 25 generations. The reproducibility of the dis- one particular experimental procedure. In catalysis research, 777
715 tribution of performances values over the 20 runs is successful and synthesis of materials, its characterization, and the corresponding 778
ARTICLE IN PRESS
779 reactivity tests, involve a large number of individual steps regard- 6.1.4. Spaces relationship 840
780 ing the selection of substances, compositions, heat treatments, There are many ways to represent GA features, however typi- 841
781 characterized chemical properties and reactivity parameters. How- cally there is a one to one mapping between the search space 842
782 ever, it is significant, but assumed, that much more decisions are and chromosome space. With programs there is a many to one 843
783 always taken along the research. Specially, decisions about the mapping between the representation and the program being repre- 844
784 temporal order of each step, i.e., the order in which several metals sented. This difference has consequences when searching the 845
785 are deposited on a support or the design of heat treatments, are space. If the mapping between the representation and the object 846
786 impossible to be included into a conventional strategy. Genetic being represented is one to one, a uniform sampling of the repre- 847
787 programming represents a flexible architecture to store all the sentation will lead to a uniform sampling of the objects being rep- 848
788 information around experimental procedures, opening the possi- resented. If the mapping between the representation and the 849
789 bility to find unexpected new formulations, impossible to be con- object being represented is a many to one mapping, a uniform 850
790 sidered with other algorithms. Genetic programming allows sampling of the representation will lead to a nonuniform sampling 851
F
791 defining particular blocks for particular experimental operations, of the objects being represented. This is the reason why Landgon 852
792 i.e., core blocks, such as the way to prepare the main part of a cat- pointed bloat as a consequence to the many to one relationship be- 853
OO
793 alyst; heat treatment blocks, in order to modify chemical proper- tween spaces. NFL is valid in the case of a one to one mapping. 854
794 ties of raw materials; or shell blocks, as a way to add promoters However, when there is a nonuniform many to one mapping be- 855
795 and other additional elements, each one with own rules. This kind tween representation and the objects being represented suggested 856
796 of architecture, together with the great flexibility to define and that NFL is not valid. With GA, either situation can occur, however 857
797 adapt the rules for the different blocks, makes genetic program- the second situation is always the case with the representations 858
798 ming a powerful tool to integrate and extract scientific knowledge, used in GP. 859
PR
799 notably improving the research quality.
800 GP has been shown as the first tool that can handle the com-
801 plexity inherent to catalyst structure. Moreover, it becomes very 7. Conclusion 860
802 easy to connect such GP with a database in order to obtain a fully

803 automated workspace with an increased speed of the data flow be- In view of the complexity of catalysis, different search frame- 861
804 cause each part of the catalyst is well-defined (i.e., tagged) like An- works with a structure commonly limited in features dependence 862
805 iML [34] format, the use of data stored in DBs is then possible.
D type have been proposed in the literature. Bearing in mind CHC 863
objectives and priorities, an adapted architecture of data is sug- 864
806 6.1GAP:GP or GA? gested. With genetic programming, it is possible to increase the 865
TE
number of variables to study and this would result in a potentially 866
807 It is clear that GAs and GP are related as they are both inspired rather more powerful final catalyst. Indeed, if this methodology is 867
808 by Darwinian evolution. The main differences between GP and GA properly followed it can be very helpful in the scientific under- 868
809 are listed below. standing of catalysis. GP may be used in order to enlarge the search 869
space with deeper details on synthesis description that cannot be 870
EC
810 6.1.1. Fixed or variables length handled by other methods based on linear representation. A sec- 871
811 Possibly the greatest distinction between GAs and GP is that of ond motivation was based on the statement that high-throughput 872
812 fixed or variable length. In some cases, the size of the required solu- and related data treatment in the domain of heterogeneous catal- 873
813 tion sought may be known beforehand. However, there are many ysis was relatively delayed compared to other chemistry, material 874
814 problems where it is difficult to pre-specify the size of the solution. science and pharmaceutical domains. By considering as much as 875
possible different paradigms without any a priori and determining
RR
876
815 There is no reason why the size of a bit string in GAs cannot vary
816 during the evolution. Both crossover and mutation operators, which which one was the best adapted to the specificity and the numer- 877
817 operate on fixed length structures, can be engineered into operators ous issues of heterogeneous, while being positioned at a high level 878
818 which produce variable length bit strings. Conversely, with GP of research from computer science point of view, it seems that 879
819 there is no reason why fixed size GP cannot be implemented. much improvement has been done for putting the combinatorial 880
catalysis in a competitive position as compared to the other lead- 881
CO
820 6.1.2. Representation ing domains. The third motivation, linked to the previous ones, 882
821 In Ref. [35] an example comparing the use of the chromosome was to promote innovative ideas from computer science point of 883
822 bit string in fitness evaluation is depicted. In the first case, the fit- view. 884
823 ness of an individual bit string in the population is given by some

824 cost function which, given a bit string, returns a real value and, Acknowledgements 885
UN
825 therefore, we are facing a GA. But on the other hand, the cost func-
826 tion is interpreting the bit string as a computer program and the va- Ferdi Schueth from Max-Planck-Institut für Kohlenforschung, 886
827 lue returned reflects the performance of the program at a particular Mülheim, Germany is gratefully acknowledged for the discussions 887
828 task. One would be inclined to say we have been describing a GP dealing with catalyst discovery which have permitted to elaborate 888
829 system. Woodward concludes that the question is not what the rep- such a conceptual approach. Claude Mirodatos and David Farrus- 889
830 resentation is but rather the interpretation of the representation. seng are also acknowledged. Avelino Corma is also acknowledged 890
for the suggested examples and references. EU Commission (TOP- 891
831 6.1.3. Operators COMBI Project) support is gratefully acknowledged for this re- 892
832 One important potential difference between GA and GP is the ef- search. We thank Santiago Jimenez for his technical support on 893
833 fect of crossover. In GA, the crossover operators can move genetic the platform hITeQ. 894
834 material from either of the parents, and places it in the same location
835 in the child (i.e., the position of the gene in the genotype is not al-
836 tered by crossover). Thus, crossover does not move the location of Appendix A. Supplementary material 895
837 a bit within a bit string. The crossover operator in GP typically moves
838 a sub-tree from one parent to a different location in the child. In GP, Supplementary data associated with this article can be found, in 896
839 sub-trees can be interpreted anywhere in the overall tree. the online version, at doi:10.1016/j.commatsci.2008.03.051. 897
ARTICLE IN PRESS
898 References [16] (a) J.M. Serra, A. Corma, E. Argente, S. Valero, V. Botti, ICEE (2003) 21–25; 961
(b) D. Wolf, O.V. Buyevskaya, M. Baerns, Appl. Catal. A 200 (2000) 63; 962
(c) G. Grubert, E.V. Kondratenko, S. Kolf, M. Baerns, P. van Geem, R. Parton, 963
899 [1] (a) B. Jandeleit, D.J. Schaefer, T.S. Powers, H.W. Turner, W.H. Weinberg, Angew.
Catal. Today 81 (2003) 337–345; 964
900 Chem. Int. Ed. 38 (17) (1999) 2494–2532;
(d) M. Holena, High-Throughput Screening in Chemical Catalysis, in: A. Hag- 965
901 (b) S.M. Senkan, Angew. Chem. Int. Ed. 40 (2) (2001) 312–329;
emayer, P. Strasser, A.F. Volpe (Eds.), Wiley VCH, 2004, pp. 153–172. 966
902 (c) M.T. Reetz, Angew. Chem. Int. Ed. 40 (2) (2001) 284–310;
[17] L.A. Baumes, J. Comb. Chem. 8 (2006) 304–314. 967
903 (d) J.M. Newsam, F. Schuth, Biotechnol. Bioeng. 61 (4) (1999) 203–216;
[18] L.A. Baumes, P. Jouve, D. Farrusseng, M. Lengliz, N. Nicoloyannis, C. Mirodatos. 968
904 (e) F. Gennari, P. Seneci, S. Miertus, Catal. Rev. – Sci. Eng. 42 (3) (2000) 385–
Seventh International Conference on Knowledge-Based Intelligent Information 969
905 402;
and Engineering Systems (KES’2003), September 3–5, 2003, University of 970
906 (f) M. Moliner, J.M. Serra, A. Corma, E. Argente, S. Valero, V. Botti, Micropor.
Oxford, UK. Springer-Verlag in Lecture Notes in AI (LNCS/LNAI series), vol. 971
907 Mesopor. Mater. 78 (2005) 73–81;
2773, pp. 265–270 V. Palade, R.J. Howlett, L.C. Jain (Eds.). 972
908 (g) O.B. Vistad, D.E. Akporiaye, K. Mejland, R. Wendelbo, A. Karlsson, M.
[19] (a) R. Bañares-Alcántara, E.I. Ko, A.W. Westerberg, M.D. Rychener, Comput. 973
909 Plassen, K.P. Lillerud, Stud. Surf. Sci. Catal. 154 (2004) 731–738;
Chem. Eng. 12 (9/10) (1988) 923–938; 974
910 (h) A. Cantín, A. Corma, M.J. Diaz-Cabanas, J.L. Jordá, M. Moliner, J. Am. Chem.
(b) R. Bañares-Alcántara, A.W. Westerberg, E.I. Ko, M.D. Rychener, Comput. 975
911 Soc. 128 (2006) 4216–4217;
F
Chem. Eng. 11 (3) (1987) 265–277. 976
912 (i) J.R. Hendershot, C.M. Snively, J. Lauterbach, Chem. Eur. J. 11 (2005) 806–
[20] S. Kito, T. Hattori, Y. Murakami, Chem. Eng. Sci. 45 (1990) 2661. 977
913 814.
[21] H. Speck, W. Hoelderich, W. Himmel, M. Irgang, G. Koppenhoefer, W.D. Mross, 978
914 [2] (a) L.A. Harmon, A.J. Vayda, S.G. Schlosser, Abstr. Pap. – Am. Chem. Soc. 221
OO
DECHEMA–Monogr. Computer Application in the Chemical Industry. Papers of 979
915 (2001) BTEC-067;
European Symposium, Wenheim, VCH, Erlangen, April 23–26, 1989, p. 43. 980
916 (b) C. Klanner, D. Farrusseng, L.A. Baumes, C. Mirodatos, F. Schueth, QSAR
[22] J.M. Serra, A. Corma, D. Farrusseng, L.A. Baumes, C. Mirodatos, C. Flego, C. 981
917 Comb. Sci. 22 (2003) 729–736.
Perego, Catal. Today 81 (3/30) (2003) 425–436. 982
918 [3] M.A. Camblor, A. Corma, A. Martinez, V. Martinez-Soria, S.J. Valencia, J. Catal.
[23] F. Clerc, S.R.M. Pereira, M. Lengliz, D. Farrusseng, R. Rakotomalala, C. 983
919 179 (2) (1998) 537–547.
Mirodatos, Rev. Sci. Instrum. 76 (2005) 062208. 984
920 [4] G. Garralon, A. Corma, A. Fornes, Zeolites 9 (1) (1989) 84–86.
[24] (a) C. Klanner, D. Farrusseng, L.A. Baumes, C. Mirodatos, F. Schüth, QSAR Comb. 985
921 [5] M. Saupe, R. Fodisch, A. Sundermann, S.A. Schunk, K.E. Finger, QSAR Comb. Sci.
Sci. 22 (2003) 729–736; 986
922
PR
24 (1) (2005) 66–77. 987
923 (b) C. Klanner, D. Farrusseng, L.A. Baumes, M. Lengliz, C. Mirodatos, F. Schüth,
[6] (a) H. Zhang, R. Hoogenboom, M.A.R. Meier, U.S. Schubert, Meas. Sci. Technol. 988
924 Angew. Chem. Int. Ed. 43 (40) (2004) 5347–5349;
16 (2005) 203–211; 989
925 (c) D. Farrusseng, C. Klanner, L.A. Baumes, M. Lengliz, C. Mirodatos, F. Schüth,
(b) M.A.R. Meier, U.S. Schubert, Soft Matter. 2 (2006) 371–376; 990
926 QSAR Comb. Sci. 24 (2005) 78–93;
(c) Special issue ‘‘Materials Informatics: From Data to Knowledge”, QSAR 991
927 (d) F. Schüth, L.A. Baumes, F. Clerc, D. Demuth, D. Farrusseng, J. Llamas-Galilea,
Comb. Sci. 24(1) (2005) 1–196. 992
928 C. Klanner, J. Klein, A. Martinez-Joaristi, J. Procelewska, M. Saupe, S. Schunk, M.
[7] (a) W.F. Maier, K. Stöwe, S. Sieg, Angew. Chem. Int. Ed. (2007); 993
929 Schwickardi, W. Strehlau, T. Zech, Catal. Today 117 (2006) 284–290.
(b) W.F. Maier, J. Saalfrank, Chem. Eng. Sci. 59 (2004) 4673–4678. 994
930 [25] L.A. Baumes, R. Gaudin, P. Serna, N. Nicoloyannis, A. Corma. Comb. Chem. High
[8] (a) L.A. Baumes. Ph.D. Thesis in Comput. Sci. Univ. Lyon 1 La Doua – France, Q3 995
Throughput Screen (in press).
931
932
2004.;
D[26] D.E. Goldberg, Genetic Algorithms in Search Optimization and Machine 996
(b) D. Farrusseng, L.A. Baumes, C. Hayaud, I. Vauthey, P. Denton, C. Mirodatos. 997
933 Learning, Springer, Reading, MA, 1989.
Kluver Academic Publisher, Nato series, in: E. Derouane (Ed.), Proceedings of 998
934 [27] M. Schrage. ISBN-13: 9780875848143. Harvard Business School Press, 2000.
NATO Advanced Study Institute on Principles and Methods for Accelerated
TE
[28] J.R. Koza. ISBN: 0-262-11170-5. MIT press, 1992. 999
935 Catalyst Design, Preparation, Testing and Development, Vilamoura, Portugal,
[29] (a) J.R. Koza. ISBN: 0262111896. MIT press, 1994.; 1000
936 15–28 July, 2001. E. Derouane, V. Parmon, F Lemos, F. Ribeiro (Eds.), Book
(b) J.R. Koza, F.H. Bennett, D. Andre, M.A. Keane. ISBN: 1- 55860-543-6. 1001
937 Series: NATO Science Series: II: Mathematics, Physics and Chemistry, vol. 69,
Morgan Kaufmann, 1999; 1002
938 101–124, Kluwer Academic Publishers, Dordrecht, Hardbound, ISBN 1-4020-
(c) J.R. Koza, M.A. Keane, M.J. Streeter, W. Mydlowec, J. Yu, G. Lanza. ISBN: 1- 1003
939 0720-5. July 2002.
4020-7446-8. Kluwer Academic, 2003. 1004
940 [9] L.A. Baumes, A. Corma, ISHHC 12, 18–22 July, 2005, Firenze, Italy. 1005
EC
941 [30] (a) M. Kovacic, P. Uranick, M. Brezocnik, R. Turk, Mater. Manuf. Process 22 (5-
[10] (a) J.M. Serra, A. Corma, A. Chica, E. Argente, V. Botti, Catal. Today 81 (3) (2003) 1006
942 6) (2007) 634–640;
393–403; 1007
943 (b) M. Brezocnik, M. Kovacic, L. Gusel, Mater. Manuf. Process 20 (3) (2005)
(b) Y. Watanabe, T. Umegaki, M. Hashimoto, K. Omata, M. Yamada, Catal. 1008
944 497–508.
Today 89 (4) (2004) 455–464; 1009
945 [31] T. Haynes, D. Schoenefeld, in: J.R. Koza, D.E. Goldberg, D.B. Fogel, R.L. Riolo
(c) K. Omata, Y. Watanabe, M. Hashimoto, T. Umegaki, M. Yamada, Ind. Eng. 1010
946 (Eds.), Genetic Programming 1996: Proceedings of the First Annual
Chem. Res. 43 (13) (2004) 3282–3288. 1011
947 Conference, The MIT Press, Cambridge, MA, 1996, p. 426.
[11] (a) A. Corma, J.M. Serra, P. Serna, S. Valero, E. Argente, V. Botti, J. Catal. 225
RR
[32] C.J. Kennedy. Ph.D. Thesis, University of Bristol, 2000. http:// 1012
948 (2005) 513–524;
citeseer.ist.psu.edu/kennedy99strongly.html. 1013
949 (b) L.A. Baumes, D. Farruseng, M. Lengliz, C. Mirodatos, QSAR Comb. Sci. Nov.
[33] C.M. Fonseca, P.J. Fleming, Evol. Comput. 3 (1) (1995) 1–16. 1014
950 29 (9) (2004) 767–778.
[34] B. Schaefer, L.A. Baumes, A. Corma, LabAutomation 2008 Palm Springs CA, 1015
951 [12] (a) L.A. Baumes, J.M. Serra, P. Serna, A. Corma, J. Comb. Chem. 8 (2006) 583–
Documenting Catalytic Test Reactions Using the Analytical Information 1016
952 596;
Markup Language (AnIML) 2633 Monday, 01/28/2008 1:00PM–3:00PM , 1017
953 (b) J.M. Serra, L.A. Baumes, M. Moliner, P. Serna, A. Corma, Comb. Chem. High
Room MP94. 1018
954 Throughput Screen 10 (January 1) (2007) 13–24. 1019
CO
955 [35] J.R. Woodward, J.R. Neil, in: Genetic Programming, Proc. EuroGP 2003,
[13] L.A. Baumes, M. Moliner, A. Corma, QSAR Comb. Sci. 26 (2) (2007) 255–272. 1020
956 Springer-Verlag, Essex, UK, 14–16 April, 2003.
[14] A. Corma, M. Moliner, J.M. Serra, P. Serna, M.J. Díaz-Cabañas, L.A. Baumes,
957 Chem. Mater. 18 (2006) 3287–3296. 1021
958 [15] J.N. Cawse (Ed.), Experimental Design for Combinatorial and High Throughput
959 Materials Development. 2003. ISBN-10: 0-471-20343-2. ISBN-13: 978-0-471-
960 20343-8, John Wiley & Sons.
UN

Comp Mat Science Baumes Collet

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comp Mat Science Baumes Collet

Uploaded by

Copyright:

Available Formats

COMMAT 2788 No.

of Pages 14, Model 5G

Computational Materials Science xxx (2008) xxx–xxx

Contents lists available at ScienceDirect

Computational Materials Science

2 Examination of genetic programming paradigm for high-throughput

35 1. Introduction The strong feature dependencies that exist in catalyst descrip- 57

2 L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx

L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx 3

Catalysts Element Order

Catalysts Element Precursor Precursor type

4 L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx

L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx 5

Fig. 2. Classiﬁcation of evolutionary algorithms.

Fig. 3. A simple evolutionary algorithm.

6 L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx

L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx 7

length representations, and is common for TGP as two different 477

8 L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx

Fig. 8. GP2HC architecture design ﬂexibility.

L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx 9

Fig. 10. Core synthesis methods.

10 L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx

566 in order to avoid impossible cases such as the use of impregnation

L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx 11

C ¼ Sa f ðxi1 ; xi Þ with x i2½0;...;1

12 L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx

L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx 13

802 easy to connect such GP with a database in order to obtain a fully

823 ness of an individual bit string in the population is given by some

14 L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxx–xxx

You might also like

C ¼ Sa f ðxi1 ; xi Þ with x i2½0;...;1