You are on page 1of 10

Multimodal pattern recognition by modular neural network

Shulin Yang Kuo-Chu Chang, MEMBER SPIE George Mason University School of Information Technology and Engineering Center of Excellence in Command, Control, Communications, and Intelligence Fairfax, Virginia 22030 E-mail: kchang@gmu.edu Abstract. Multilayer perceptrons (MLPs) have been widely applied to pattern recognition. It is found that when the data has a multimodal distribution, a standard MLP is hard to train and a valid neural network classier is difcult to obtain. We propose a two-phase learning modular (TLM) neural network architecture to tackle this problem. The basic idea is to transform the multimodal distribution into a known and more learnable distribution and then use a standard MLP to classify the new data. The transformation is accomplished by decomposing the input feature space into several subspaces and training several MLPs with samples in the subsets. We veried this idea with a two-class classication example and applied the TLM to the inverse synthetic aperture radar (ISAR) automatic target recognition (ATR), and compared its performance with that of the MLP. Experiments show that the MLP is difcult to train. Its performance depends strongly on the number of training samples as well as the architecture parameters. On the other hand, the TLM is much easier to train and yields better performance. In addition, the TLMs performance is more robust. 1998 Society of Photo-Optical Instrumentation Engineers. [S0091-3286(98)03702-7]

Subject terms: multilayer perceptron; neural network; modular; multimodal distribution; automatic target recognition; learning; local minimum. Paper 36047 received Apr. 30, 1997; accepted for publication Sep. 8, 1997.

Introduction

Over the past two decades, the multilayer perceptron MLP has been extensively studied and widely applied to pattern recognition. Although many successful applications were reported, the MLP as well as other neural networks have some well-known inherent drawbacks. Among these, the local minimum problem is the most serious. Although some local minima* could yield almost the same performance as the global minimum we call them pseudoglobal minima , most of them could not perform as well. If the training data has a multimodal same class of samples distribute in different regions in the feature space and overlapped distribution, the learning may be stuck in the worst local minimum in which the network can not discriminate even the training samples and, therefore, no valid network can be obtained. In our inverse synthetic aperture radar ISAR automatic target recognition ATR application, for example, we often observe such a worst case we refer it to as the fatal local minimum . Although the behavior of the local minima has been investigated extensively, no generally systematic and effective solution has been found. There are several empirical methods to handle the local minimum problem. The rst method is to change initial weights or to change the number of training samples. The difculties with this are that there is no reliable way to nd a good initial point, and incremental learning is time-consuming and there is no guarantee of convergence to the global minimum. The second
*Here a local minimum is a short representation of the network converged at the local minimum.
650 Opt. Eng. 37(2) 650659 (February 1998)

method is to increase or to decrease the number of neurons in hidden layers. Researchers proposed some dynamic methods to automatically increase and decrease neurons in hidden layers.13 However, increasing/decreasing network size does not necessarily decrease the chance of sinking to a local minimum. Moreover, nding such a size is not easy and sometimes not practical, due to its extensive computational requirement. The third method is to modify the learning algorithm to avoid local minima. Baba4 and Baba et al.5 proposed a hybrid algorithm for nding the global minimum of the error function and showed that the algorithm ensures convergence to a global minimum with probability 1 in a compact region of a weighting vector space. However, this is true only if the number of iterations approaches innity. In this paper, we are not going to solve the general local minimum problem. We focus our attention on the fatal local minimum caused by the multimodal distribution, and instead of applying the preceding empirical methods, we propose a more efcient heuristic method to reduce the chance of sinking in the fatal local minimum. In our ISAR ATR problem, since targets may be identied by radar at different angles, the same target may appear differently and different targets may look alike at different observation angles. It can be imagined that the same target may distribute in multiple regions and different targets may distribute in overlapped regions in the feature space. We conjecture that this multimodal distribution is the primary cause of the fatal local minima. We propose to congure a preprocessor to transform the multimodal distribution into a more learnable form and then use a standard multilayer
1998 Society of Photo-Optical Instrumentation Engineers

0091-3286/98/$10.00

Yang and Chang: Multimodal pattern recognition . . .

is trained individually in the rst phase of learning. The second stage on the right is composed of a single MLP, called the global net, which is trained in the second phase of learning. The input layer of the global net receives data from outputs of all local nets. In other words, all outputs of local nets are merged to form the input layer of the global net. Therefore, the number of input units in the global net is equal to the sum of all the local output units. However, the number of hidden units in the local nets and the global net are not necessarily the same. Moreover, each local net may be different. 2.2 Network Training and Conguration Conguring a TLM involves two sequential steps: rst conguring the preprocessor and then conguring the global classier. 2.2.1 Preprocessor conguration Conguration of the preprocessor consists of the following three steps.

Fig. 1 Architecture of the TLM neural network.

perceptron to classify the new patterns. Specically, rather than employ a learning algorithm to directly partition the input space into regions corresponding to each class, we transform the original patterns in the input space into easier patterns in a new space, and then partition the new space. The resulting classier consists of two stages: a preprocessor and a global classier. Since the classier needs twophase learning and has a modular architecture, it will be called a two-phase learning modular TLM neural network. Modular network architecture has been successfully applied to speech recognition, initial consonant recognition in particular,6,7 and texture processing.8 It was originally proposed to reduce training time by separate training of each module, to nd an initial weight vector for the whole network. After the initial weight vector is obtained, the whole network is ne-tuned by another learning. It was shown that this approach could signicantly increase learning speed without incurring a performance penalty. Our proposed modular architecture is similar to Waibels. However, the objectives are quite different. Our basic idea is to make data more learnable, i.e., less apt to the fatal local minima. We employ a modular structure to map training data from one space to another, and use another multilayer network to perform classication of the new data. In this sense, our architecture looks more like the cascade of block structure.9 Nevertheless, from the view of locating a good initial point near the global minimum, our modular architecture can provide a more efcient way than the standard multilayer architecture. In this regard, our purpose is also quite similar to Waibels. This paper is organized as follows. Section 2 describes the proposed modular architecture and the corresponding procedure to congure it. Section 3 uses a two-class example to demonstrate the benet of using the TLM over the MLP. Section 4 compares the performance of the TLM and the MLP in the applications to ISAR ATR, and Section 5 summarizes the results, and gives some conclusions. 2 2.1 System Description

Step 1: Splitting the training set. In this step, multiple modes are rst identied in the input space and the input space is then decomposed into several subspaces based on the modes. The general method for the mode identication is the clustering algorithm. We can apply the ISODATA algorithm10 or other similar clustering algorithms11 to the entire sample set to nd clusters, each corresponding to a mode. Then group the clusters according to a certain rule to form several subsets. If we have some a priori knowledge on the sample distribution, we can decompose the sample set manually and use neural networks to verify the appropriateness of the decomposition. For example, we might know which group of data is more confusable and which group of data is more distinct. We can utilize such knowledge to initially divide the training set into subsets and then use samples in each subset to train an MLP. If an MLP with an appropriate number of hidden units can not classify the samples in the subset with satisfactory accuracy, either add some hidden units and retrain it, or nd the samples that the net could not correctly classify and move them to other existing subsets or a new subset. The basic criterion for this partition is intended to make samples in each subspace have a singleregional distribution. Step 2: Local net training. In this step, an MLP is employed to dene a function to map each subspace decomposed in step 1 to a classication space i.e., an MLP is trained by the samples in each subset . This is the rst phase of learning. Since each class has only one mode in each subset we assume modes are correctly identied in step 1 , the learning would be much easier and conventional techniques e.g., adding hidden nodes to enhance discrimination ability can be used. Assume there are M classes and K subspaces. Let Si (i 1,2,...,K) denote a subspace and f i the mapping function of the MLP. Then each mapping function can be expressed as
Optical Engineering, Vol. 37 No. 2, February 1998 651

TLM Neural Network Architecture The TLM architecture shown in Fig. 1 consists of two stages. The rst stage on the left is composed of several MLPs, called local nets. All local nets share one input layer, namely, they receive the same inputs. Each local net

Yang and Chang: Multimodal pattern recognition . . .

f i :C j Si D j

0,0,...,0,1,0,...,0

RM ,

1,2,...M , 1

where C j ( j 1,2,...,M ) denotes a mode of the jth class in Si , RM is an M -dimensional classication space, and D j ( j 1,2,...,M ) is a point or vector in RM , whose jth component equals to 1 and the rest 0. Since this is supervised learning, the output can be designed arbitrarily. For example, the number of output units may be larger than the number of classes to be discriminated, and two or more output neurons may be stimulated at the same time.

Step 3: Preprocessor conguration and global training data generation. In this step, all input layers of local nets are connected together and all their outputs are combined to form a larger output vector, that is, the individual classication spaces are remerged to form a new input space for the global training. The combined net is called preprocessor or transformer. The task of the preprocessor is to process or lter the original samples to generate the global samples. In the global net training phase, the preprocessor provides global training samples to the global net.
2.2.2

Fig. 2 Multiregional distribution of two-class samples: light gray areas, class A; dark gray areas, class B (overlapped border areas are not shown).

Two-Class Example with Four-Region Distribution

Global net training and TLM classier conguration The global classier is congured by training an MLP using the generated global training data. This is the second phase of learning. Since the distribution of the new training samples has been supervised articially, this learning will also be easier in fact, when the pattern in Eq. 1 is used, many global training samples can be discriminated simply by using some kind of voting , and conventional techniques can be employed to increase the global nets generalization capability. From a mapping point of view, this trained MLP can be considered as a function to map regions in the new input space to the nal classication space. After the global classier is obtained, a TLM classier can be congured by connecting the preprocessor and the global classier together, i.e., connecting the local nets output to the global nets input layer. In the pattern recognition phase, when an unknown sample is received, the preprocessor maps the data to the new space, and then the global classier maps the new data to the classication space and make a nal classication decision. From the conguration procedure we can see that the TLM classies patterns differently from the MLP. The MLP partitions the input space directly, or maps input space into some new space and then partitions the new space. In the case where a very complicated distribution is involved, it may be extremely difcult for the MLP to partition the input space directly or to form any interim new space indirectly by hidden layers. Therefore the learning is vulnerable and apt to the fatal local minima. On the other hand, the TLM directly employs an interim space to ease the partition. In addition, the interim space is formed by a supervised learning and therefore it can be designed to make the partition much easier. By splitting one learning into two, the learning difculty and the chance of sinking to the fatal local minima will be greatly reduced. Note that there can be some trade-off between the local learning and the global learning in the TLM.
652 Optical Engineering, Vol. 37 No. 2, February 1998

To verify the effectiveness of the TLM, we conducted a simulation experiment with two classes of samples distributed in four regions. 3.1 Databases We generated two databases, one consisting of 10,000 samples for training and the other consisting of 100,000 samples for testing. The samples are made up of two classes: class A and class B. Fig. 2 shows their distribution, where class A distributes in two separate ring regions and so does class B. Specically, 0.1 r 0.19 0.19 r 0.21, 0.21 r 0.29 0.29 r 0.31, 0.31 r 0.39 0.39 r 0.41, 0.41 r 0.5 Prob class A Prob class A Prob class A class A Prob class B class B Prob class B class A Prob class B class B,

where r denotes the ring regions radius. Each sample is generated randomly. Its location is determined by two independent random numbers: radius, which takes values in the range 0.1, 0.5 , and angle, which takes values in the range 0,2 . 3.2 TLM and MLP Architecture Parameters We split the input space into two subspaces, i.e., we split the training sample set into two subsets. Subset 1 consists of samples coming from the internal two ring regions and subset 2 consists of samples coming from the exterior two ring regions. We congured each local net with an MLP of two hidden layers. The sizes of the two local nets are determined with the trial-and-error method, and the sigmoid

Yang and Chang: Multimodal pattern recognition . . .

function is used as the unit function. It was found that putting 20 units in the rst hidden layer and 8 units in the second hidden layer, together with 2 input units and 2 output units, can yield a reasonably good training classication rate TCR on each subset we use the abbreviation 2-208-2 to denote net size . Also, we congured the global net with a MLP of size 4-20-8-2. The global net has 256 weights and so the whole TLM has 512 weights. To fairly compare their performances, we also congured two MLPs: one has a size of 2-20-10-2, and the other has a size of 2-30-14-2. Their weights 260 and 508 are comparable to those of the global net and the whole TLM. 3.3

Performance Comparison of the TLM versus the MLP

3.3.1 Learning behavior In all our experiments, we applied the modied PolakRibiere MPR learning algorithm12 to train the MLP and all the local nets and the global net in the TLM. The MPR is a line-search gradient descent algorithm. Its objective is to minimize the squared-error of the desired outputs and the actual outputs of the network. It ensures convergence at least to one of the minima of the error function. In the MPR, the learning process will stop if one of the following three criteria is met: 1 the number of iterations exceeds its maximum allowable threshold, 2 the relative minimum weighting variation is less than a given value, or 3 the total error is less than a given threshold and the training classication rate TCR; percentage of training samples that can be correctly classied during training is equal to or close to 100%. The performance of the MPR proved to be better than that of the standard back propagation algorithm.13 At rst, we tried to train an MLP of size 2-20-10-2 directly with 10,000 samples, starting with random initial weights. We found the learning sank quickly in a fatal local minimum. Then we tried incremental training. That is, the learning started at a random initial point with 100 samples, and then the number of samples was gradually increased to 1000, 2000, 3000, 5000, 8000 and nally 10,000. Fig. 3 a shows the nal learning behavior of the MLP of size 2-2010-2 with 10,000 samples. The gure indicates that the converged network did not yield a reasonably high TCR. The reason might be either the network size is inappropriate or the initial point is not good. However, after the incremental learning experiment, we believe that it is more likely that the network size is not large enough. Therefore, we changed the net size to 2-30-14-2 and tried two more trainings. In the rst one, we used the 10,000 samples to directly train the network with randomly initialized weights. The learning behavior is shown in Fig. 3 b . It is obvious that the learning sank in a fatal local minimum. In the second training, we used the 1000 samples to rst train the network with the same random initials. After convergence, the weights were then used as the initial point to retrain the network with the 10,000 samples. Fig. 3 c shows the nal learning behavior. It can be seen that neither the incremental learning nor the random initial learning resulted in a valid classier. The reason might still be that either the network size is not large enough or the initial point is not good enough. In general, when the training

Fig. 3 Learning behavior of the MLP for (a) net size of 2-20-10-2, 10,000 samples, and initial weights that are incremental from 100 to 8000 samples; (b) net size of 2-30-14-2, 10,000 samples, and initial weights that are random numbers; and (c) net size of 2-30-14-2, 10,000 samples, and initial weights that are converged weights of 1000 samples.

samples have a multimodal distribution, training the MLP is a difcult task. The global minimum, even a pseudoglobal minimum, is hard to nd. The best generalization performance can not be guaranteed. To compare the learning behaviors, we used the same samples to train a TLM of comparable size. It was found that both the local net training and the global net training in the TLM were much easier. For example, Fig. 4 a shows
Optical Engineering, Vol. 37 No. 2, February 1998 653

Yang and Chang: Multimodal pattern recognition . . .

Fig. 5 Comparison of classication performance for an MLP of size 2-20-10-2 and a TLM with a local size of 2-20-8-2 and a global size of 4-20-8-2.

from the initial point obtained by training the network with 100 global samples. It can be seen from Fig. 4 that both the local learning and the global learning can start at a random initial point. The global learning converges very rapidly, no matter what kind of initial weights are given. The local learning converges relatively slowly due to the complicated distributions involved. However, since the complexity of the sample distribution in the subspace is greatly reduced by the decomposition of the input space, the local learning is much less likely to sink in a local minimum. In fact, no fatal local minimum was observed in the local learning. Comparing the typical learning behaviors, we can see that in this particular example, a valid TLM can be obtained after three learnings two local and one global , and a valid MLP of the same size was not obtained after a prelearning and two nal learnings started at different initial points. Of course, there may be some cases in which a valid MLP can be obtained at the rst trial with a random initial point. But in the case where a multimodal distribution is involved, the probability that this happens is very low. Therefore, on average, training a TLM is easier than training an MLP if the training samples have a multimodal distribution. 3.3.2

Fig. 4 Typical learning behavior of the TLM for (a) the training classication rate of local net 1 with a net size of 2-20-8-2, 4000 samples, and initial weights that are random numbers; (b) the training classication rate of the global net with a net size of 4-20-8-2, 10,000 samples, and initial weights that are random numbers; and (c) the training classication rate of the global net with a net size of 4-20-8-2, 10,000 samples, and initial weights that are converged weights of 100 samples.

the learning behavior of local net 1, where random initial weights was given and 4000 samples in subset 1 was directly used; Fig. 4 b shows the learning behavior of the global net with 10,000 global samples started from a random initial point; and Fig. 4 c shows the learning behavior of the global net with the 10,000 global samples started
654 Optical Engineering, Vol. 37 No. 2, February 1998

Test performance Theoretically speaking, if the MLP and the TLM have the same network size and they both reached their global minima during learning, their generalization performance should be identical, or at least comparable. However, in practice, the global minimum may not be reached and the learning usually converges to a pseudoglobal minimum. Different pseudoglobal minima may yield a different generalization performance. Therefore, we need to nd out which one performs better. To compare their classication performance, we also conducted an incremental training for the TLM. Fig. 5 shows their classication accuracies on the test sample set. Note that for each number of training samples, two test classication rates were calculated for each network. One

Here by converge we refer to TCRs reaching a stable high percentage. Maximizing the TCR and minimizing of the error function usually but not always happen at the same time.

Yang and Chang: Multimodal pattern recognition . . .

was obtained at the highest TCR and the other was obtained at the end of training. Shown in the gure is the better of the two classication rates obtained when the specied number of samples was used for training. Note also that with regard to the TLM, the training sample is referred to the global training sample. From the gure it can be seen that the best test classication rates of the MLP and the TLM are comparable, and the TLM yields more robust performance. The reason that the MLP has a slightly inferior test classication rate is that it has a smaller size and it did not converge to any pseudoglobal minimum at the maximum number of training samples. If we can make the MLP converge to the global minimum with the 10,000 samples, it would yield a better classication rate, and might even be better than that of the TLM. However, it should not be much better, because in this example, there is 15% overlapped area and the ideal classication rate will be slightly higher than 92.5% assuming the ideal classier can perfectly recognize samples in the unoverlapped area and, in the worst case, can recognize half of the samples in overlapped area . The best classication rate that the TLM achieved is 92.579%, which is very close to the ideal rate. This result indicates that the TLMs generalization performance is at least comparable to that of the MLP, if not better. 4 ATR Experiments with ISAR Data

Fig. 6 Example of 1 ft

1 ft ISAR images (HH).

4.2 Decomposition of Input Space and Learning Algorithm In our experiments, since the data radar images were collected by putting targets vehicles on a turnable table, each radar image was associated with a angle. Images of the same target may vary with the angle. With this prior knowledge and after some preliminary experiments, we found that the same target within a 45 deg sector sustains a relatively homogenous appearance. Therefore, we divided both the 1-D and 2-D data sets into eight subsets, each with a 45 deg interval based on the azimuth angle, and used eight local nets to classify their corresponding samples. 4.3 Experimental Results with 1-D Data In these experiments, 8000 samples were taken uniformly from the original data set as training samples, and the remaining 9657 samples were test samples. 4.3.1 Performance of the MLP on 1-D data To congure a valid MLP classier, three learning strategies with different net sizes were tried: 1 8000 samples with random initial weights, 2 abrupt incremental learning, and 3 gradual incremental learning. With the rst learning strategy, random initial weights were used and all the 8000 training samples were input to train a MLP. Five different net sizes were attempted: 32-32-4, 32-10-5-4, 32-16-8-4, 32-30-20-4, and 32-50-30-4. The same phenomenon was observed: the training classication rate was only about 25% and no valid MLP classier could be obtained with the 8000 training samples. With the second strategy, four samples were chosen randomly to train the ve networks with random initial weights. After convergence, the weights were used as initials, and then the 8000 samples were input to train the networks. Networks of sizes 32-10-5-4 and 32-16-8-4 were also tried with the intermediate training of 400 and 800 samples. Unfortunately, the results remained the same. That is, learning sank in a fatal local minimum and no valid classier was obtained with the 8000 training samples. With the third strategy, the samples were added gradually. Two networks, the smallest and the largest, were tried. For the network of size 32-3-2-4, the number of samples was increased from 4, 400, 800, 2000 to 8000. Table 1 lists the learning behavior of the net. In the meantime, before adding samples, the recognition performance of the trained net was evaluated using samples in the test set. Fig. 7
Optical Engineering, Vol. 37 No. 2, February 1998 655

To test the feasibility of the TLM in the real world, we conducted a series of experiments with ISAR data. In this section we describe these experiments and present the results. As a comparison, we also give the MLP results. Note that all MLPs, including those in a TLM, contain two hidden layers, the number of hidden units is set heuristically in most cases, and the sigmoid function is used as the unit function.

Database The database14,15 we used in our experiments contains two kinds of data: one is one-dimensional 1-D and the other is two-dimensional 2-D . However, the way in which the data were collected and the targets involved are the same. There are four types of targets: a Chevrolet Camaro, a Dodge Van, a Dodge pickup Wagon, and an international Bulldozer, as shown in Fig. 6. The rotary platform millimeter-wave radar data was obtained from the Massachusetts Institute of Technology MIT Lincoln Laboratory. It consists of ISAR data with 35 GHz normal frequency, 5.5 deg depression angle, 1 ft range resolution, and calibrated I and Q channels with HH horizontal transmit, horizontal receive , HV horizontal transmit, vertical receive , and VV vertical transmit, vertical receive . The data were originally collected by turning the targets over a complete 360 deg angle at 0.04 deg azimuth intervals. The 1-D data contain 17,657 target samples, and each sample is a range prole vector 32 feet collected in the HH channel. The 2-D data contain 21,158 target images. The original image dimension consists of 32 20 pixels. In the data, the images were ltered by a polarimetric whitening lter16 PWF and compressed by a window slicing technique to a 15 9 dimension.15 In the experiments, both 1-D and 2-D data sets were split into two subsets, one for training and the other for testing.
4.1

Yang and Chang: Multimodal pattern recognition . . . Table 1 Learning behavior of the MLP (32-3-2-4) on 1-D data. Number of training samples Iterations tried Training classication rate (%) 4 17,721 100.0 400 90,000 93.0 800 50,200 84.7 2000 90,000 84.1 8000 63,500 25.3

Fig. 7 Test performance of the MLP (32-3-2-4) on 1-D data.

Fig. 9 Test performance of the TLM on 1-D data for local nets of size 32-3-2-4 and a global net of size 32-3-2-4.

Fig. 8 Test performance of the MLP (32-50-30-4) on 1-D data.

Fig. 10 Test performance of the TLM on 1-D data for local nets of size 32-16-8-4 and a global net of size 32-16-8-4.

shows the recognition rate versus the number of training samples. The results show that the network can be successfully trained with up to 2000 samples, the best performance was obtained with 800 samples, and the network can not be trained with 8000 samples. The reason that the performance deteriorated when the number of training samples is greater than 800 may be due to the multimodal distribution and small network size. For the network of size 32-50-30-4, the training samples were increased from 4, 100, 800, 2000 to 8000. Fig. 8 shows the recognition rate versus the number of training samples. The results indicate that this network can be trained with 8000 samples using the gradual incremental learning strategy and about an 85% recognition accuracy can be obtained. Although the nal results are somewhat satisfactory, it is obvious that the MLP architecture is vulnerable to local minima, and an appropriate network size is critical.
656 Optical Engineering, Vol. 37 No. 2, February 1998

4.3.2 Performance of the TLM on 1-D data The TLM classier consists of two parts. It requires two phases of learning to congure the classier. To be comparable, three combinations of network sizes were tried: 1 all local nets: 32-3-2-4, global net: 32-3-2-4; 2 all local nets: 32-16-8-4, global net: 32-16-8-4; and 3 all local nets: 32-50-30-4, global net: 32-50-30-4. Figs. 9 through 11 show their performance, respectively. Fig. 9 plots the curve of recognition rate versus available training samples, which was obtained by training both the local nets and the global net using the given training samples e.g., if only 800 samples can be used, use 100 samples to train each local net and use all the 800 samples to train the global net . In Figs. 10 and 11, the curves of recognition rate versus global training samples are plotted. They were, however, obtained by rst training local nets with 8000 samples and then training the global net with the given samples.

Yang and Chang: Multimodal pattern recognition . . . Table 2 Local net learning behavior on 1-D data (32-50-30-4). Net 1 Iteration Training rate (%) Local recognition (%) 15,600 98.9 92.2 Net 2 43,000 98.3 88.0 Net 3 12,600 99.9 92.8 Net 4 17,200 99.7 97.1 Net 5 11,500 99.9 94.1 Net 6 24,700 99.4 92.3 Net 7 46,700 98.2 89.5 Net 8 21,700 99.5 94.6

From the gures, it can be seen that the TLM performs consistently better than the MLP on this data. That is, the TLM is not only easier to train but also yields a better test performance. If all the local nets are considered as preprocessors, the results in Fig. 9 indicate that after being processed, the training data becomes more learnable, i.e., the entire global training samples can be handled by the same sized global net 32-3-2-4 , while Fig. 7 indicates that the MLP 32-3-2-4 can not handle the same data set e.g., 8000 samples . These results are consistent with those of the two-class example in Section 3. Moreover, since the training set was split into eight subsets and each target in each subset was supposed to have only one mode assume each target looks similar within a 45 deg observation sector , local learning with samples in the subset was much easier as expected and sinking to local minima was no longer observed. However, local learning behavior depends on whether the input space is decomposed properly or not. If multiple modes still exist in a subset or its samples are not homogenous, learning associated with such a subset may still be difcult. In fact, our uniform decomposition of input space in this ATR problem is not optimal. As an example, Table 2 gives the local net learning behavior corresponding to the largest network size, where the local recognition rate refers to the percentage of times that a local net can recognize the samples in the corresponding test subsets. It can be seen that the second and seventh local nets did not perform as well as other networks. We conjecture that the performance might be improved if the decomposition were done more appropriately. From these comparison experiments, it can be seen that if the TLM is considered as a generalized MLP with a preprocessor, adding such a preprocessor will yield better performance. That is, after being preprocessed, the training

data becomes more learnable and the globally trained MLP has better generalization ability than the directly trained MLP. 4.4 Experimental Results with 2-D Data In these experiments, 7040 samples were taken from the data set uniformly to constitute the training data set and the rest of the 14,118 samples constituted the test data set. The incremental learning procedure was used. Since a 2-D image provides more information, discrimination of different targets is easier. However, for some net sizes, the MLP still had the local minimum problem, and the TLM demonstrated its superiority over the MLP. 4.4.1 Performance of the MLP on 2-D data To evaluate the MLP performance on the 2-D data, we tried four net sizes: 135-10-5-4, 135-100-50-4, 135-200-150-4 and 135-450-225-4. Actually, the rst attempted size was 135-100-50-4 and training samples were increased from 4, 36, 144, 360, 1440 to 7040. It was observed that the best recognition rate of 68% was obtained when 36 training samples were used, and after samples were increased to 1440, learning sank in a local minimum at which only about 25% of the training samples were correctly discriminated. When the network size was increased to 135-200150-4, a similar phenomenon was observed. When the network size was further increased to 135-450-225-4 which is as large as the TLM described in the next experiment , a local minimum was observed even earlier, i.e., when 144 samples were used. In this case, the best recognition rate was only 49.9%. These results indicate that the MLP is apt to local minima, especially when its network size is chosen incorrectly and the training sample set contains multiple modes. As in the 1-D data experiments, we tried to change the network size to prevent such local minima. It was found that when the size was decreased to 135-10-5-4, no local minimum was observed. Fig. 12 shows the recognition performance versus the number of training samples. It can be seen that about a 96.0% accuracy can be obtained. 4.4.2 Performance of the TLM on 2-D data To evaluate the TLM behavior on the 2-D data, we chose the following conguration: all local nets, 135-100-50-4, and global net, 135-100-100-4. Fig. 13 shows the recognition rate versus global training samples. It indicates a 97.5% accuracy can be obtained, which is better than the best performance of the MLP shown in Fig. 12. As expected, after decomposition, no fatal local minimum was observed either in local learning or in global learning, although the network sizes are as large as those of the MLP.
Optical Engineering, Vol. 37 No. 2, February 1998 657

Fig. 11 Test performance of the TLM on 1-D data for local nets of size 32-50-30-4 and a global net of size 32-50-30-4.

Yang and Chang: Multimodal pattern recognition . . .

Fig. 12 Test performance of the MLP on 2-D data (135-10-5-4).

Some other researchers also investigated the same ATR problem. As an example, Novak11 achieved an average correct recognition rate of 92.2% using a prole-matching algorithm. Relatively, he used fewer training samples to create the prototypes for each target. His results can not be directly compared with ours. But we believe that even if the same number of training samples were used, the prolematching classier might not perform better than the TLM. 5 Summary and Conclusions

In this paper, we have proposed a TLM neural net architecture for the recognition of confusable or multimodal patterns. The basic ideas are to decompose the input space, make data more learnable, and avoid the fatal local minina caused by the multimodal distribution of samples. The TLM consists of a preprocessor and a global classier. The conguration of the preprocessor consists of three steps: 1 identify modes in the sample set; 2 decompose the input data set into subsets, and use samples in each subset to train a local MLP; and 3 connect their input layers of the individually trained MLPs in parallel and combine their output layers. The global classier is congured by a single MLP and is trained by the new data obtained through the preprocessor. We have applied the TLM to a complex two-class classication problem and the radar target recognition, and

compared its performance with that of the MLP. We can draw two conclusions from the experiments. First, the TLM is much easier to train. By using all samples in the training set, the MLP learning started at random initial weights will often converge to a local minimum, sometimes a fatal local minimum, if the samples have a multimodal distribution. For it to work in those cases, many trials must be performed to nd a good initial point. In contrast, neither the local net nor the global net in the TLM is required to be incrementally trained. If the input space is properly decomposed, i.e., there is no multimodal distribution in the subspace, then the learning started at a random initial point will rarely be stuck in any fatal local minimum. Second, the TLM performance is better and more robust. The MLP performance strongly depends on its architecture parameters as well as the number of training samples used. Inappropriate selection of training samples may yield very poor performance or even fail to result in obtaining a valid classier. A valid TLM classier can be easily obtained, no matter how large the architecture size is and how many training samples are used. Its performance does not change dramatically with its network size or the number of training samples. Based on the experiments, it can be concluded that by using the TLM architecture and the conguration procedure, the chance that the learning sticks in the fatal local minimum can be greatly reduced. The TLM provides an alternative maybe more efcient method to handle the fatal local minimum problem caused by the multimodal distribution of the data.

References
1. T. M. Nabhan and A. Y. Zomaya, Toward generating neural network structures for function approximation, Neural Net. 7 1 , 8999 1994 . 2. E. B. Bartlett, Dynamic node architecture learning: an information theoretic approach, Neural Net. 7 1 , 129140 1994 . 3. Z. Wang, C. D. Massimo, M. T. Tham, and A. J. Morris, A procedure for determining the topology of multilayer feedforward neural networks, Neural Net. 7 2 , 291300 1994 . 4. N. Baba, A new approach for nding the global minimum of error function of neural networks, Neural Net. 2 3 , 367373 1989 . 5. N. Baba, Y. Mogami, M. Kohzaki, Y. Shiraishi, and Y. Yoshida. A hybrid algorithm for nding the global minimum of error function of neural networks and its applications, Neural Net. 7 8 , 12531265 1994 . 6. A. Waibel, Modular construction of time-delay neural networks for speech recognition, Neural Comput. 1 1 , 3946 1989 . 7. A. Waibel and J. Hampshire, Building blocks for speech, Byte 14 8 , 235242 1989 . 8. M. M. Van Hulle and T. Tollenaere, A modular articial neural network for texture processing, Neural Net. 6 1 , 732 1993 . 9. S. Santini, A. D. Bimbo, and R. Jain, Block-structured recurrent neural networks, Neural Net. 8 1 , 135147 1995 . 10. G. H. Ball and D. J. Hall, Isodataan iterative method of multivariate analysis and pattern classication, in Proc. IFIPS Congr. 1965 . 11. S. Yang and Y. Ke, An unsupervised clustering algorithm for template creations, J. Beijing Inst. Technol. 12 3 , 3742 1992 . 12. S. Yang, Y. Ke, and Z. Wang, A modied Polak-Ribiere learning algorithm, Adv. Model. Simulat. 31 3 , 712 1992 . 13. A. H. Kramer et al., Efcient parallel learning algorithms for neural networks, in Advances in Neural Information Processing Systems, Vol. 1, D. S. Touretzky, Ed., pp. 4048, Morgan Kaufmann, 1989 . 14. L. M. Novak, A comparison of 1-D and 2-D algorithms for radar target classication, Proc. IEEE Int. Conf. on Systems Engineering, Vol. 1, pp. 612 1991 . 15. K. C. Chang and Y. C. Lu, High resolution polarimetric SAR target classication with neural network, in Proc. Int. Joint Conf. of the 4th IEEE Int. Conf. on Fuzzy Systems and 2nd Int. Fuzzy Engineering Symp., Vol. 3, pp. 16811688, Yokohama 1995 . 16. L. M. Novak and C. Netishen, Polarimetric synthetic aperture radar imaging, Int. J. Imaging Syst. Tech. 4, 306318 1992 .

Fig. 13 Test performance of the TLM on 2-D data (local nets, 135100-50-4; global net, 135-100-100-4). 658 Optical Engineering, Vol. 37 No. 2, February 1998

Yang and Chang: Multimodal pattern recognition . . . Shulin Yang received his BS from the University of Science and Technology of China in 1982, his MS from the Beijing University of Posts and Telecommunications in 1985, and his PhD from the Beijing Institute of Technology in 1991, all in electrical engineering. From 1991 to 1994 he was an associate professor at the Beijing Institute of Technology, where his main research activities were studying neural network architecture and speech recognition. Currently, he is a research associate professor with the Center of Excellence in Command, Control, Communications, and Intelligence, George Mason University. His main research interests include neural networks, automatic target recognition, array signal processing, speech signal processing, and Bayesian network modeling. Kuo-Chu Chang received a BS degree in communication engineering from the National Chiao-Tung University, Taiwan, in 1979 and MS and PhD degrees, both in electrical engineering, from the University of Connecticut in 1983 and 1986, respectively. From 1983 to 1992 he was a senior research scientist with Advanced Decision Systems (ADS) division of Booz-Allen & Hamilton, Mountain View, California. He became an associate professor in the Systems Engineering Department of George Mason University in 1992. His research interests include estimation theory, optimization, signal processing, Bayesian probabilistic inference, and data fusion. He is particularly interested in applying unconventional techniques in conventional decision and control systems. He has published more than 60 papers in the areas of multitarget tracking, distributed sensor fusion, and Bayesian probabilistic inference. He is currently an editor on large scale systems for IEEE Transactions on Aerospace and Electronic Systems. Dr. Chang is a member of Etta Kappa Nu and Tau Beta Pi.

Optical Engineering, Vol. 37 No. 2, February 1998

659

You might also like