You are on page 1of 4

v1.

2
A NEURAL NETWORK LEARNING ALGORITHM FOR ADAPTIVE PRINCIPAL COMPONENT EXTRACTION (APEX)
S. Y. Kung and K. I. Diamaiitaras Department of Electrical Engineering, Princeton University, Princeton NJ 08544
ABSTRACT

This paper addresses the problem of the recursive computation of the principal components of a vector stochastic process. The applications of this problem arise in modeling of control systems, in high-resolution spectrum analysis, image data compression, motion estimation, etc. We propose a new algorithm called APEX which can recursively compute the principal components using a linear neural network. The algorithm is recursive and adaptive, namely given the first m - 1 principal components it can produce the m-th component itera tively. The paper also provides the numerical theoretical basis of the fast convergency of the APEX algorithm and demonstrates its computational advantages over the previously proposed methods. Extension to extracting constrained principal components via APEX is also discussed.

patterns [1]-[3]. In [l]a linear neural network is proposed (Figure 1) where only one output neuron y and n inputs q . ..z , is used for the most dominant component. The activation of y is just a linear combination of the inputs with the weights qi

or more compactly y = qTx,where q and x are the weight and input vectors respectively. The updating rule is

Aqi = P(YG

- YQ~).

Oja proves that the algorithm converges and extracts the first Principal Component (PC) of the input sequence, namely in the steady state q=the normalized eigenvector of R corresponding to the largest eigenvalue. To extend Ojas method to extract more than one principal components using multiple output nodes, Sanger [2] proposed a modified method based on the following updating rule

INTRODUCTION
where y = Q x , L T ( - ) denotes the lower triangular part of a matrix (including the diagonal), and x , y are vectors with y having smaller or equal dimension with x. It is claimed [2] that the above algorithm converges to the optimal linear PCA transformation. One disadvantage of the above approach is the fact that for the training of every neuron non-local information is used, resulting in a lot of redundant computations. (A computational comparison will be given in section 4 . ) To avoid this problem, Foldiak [3] proposed another method that combines the hebbian learning embedded in Ojas training rule and the competitive learning that facilitates the neurons to extract different eigenvectors. The drawbacks of this approach are that (1) the entire set of weights have to be retrained when one additional component is needed; (2) the method does not produce the exact principal eigenvectors of R but rather a set of vectors that span the same space as the principal eigenvectors. All the previous methods cannot effectively support a recursive approach for the calculation of the m-th principal component given the first m-1 components. The motivation behind such an approach is the need to extract the principal components of a random vector sequence when the number of required PCs is not known a priori. It is also useful in environments where R might be slowly changing with time (e.g. in motion estimation applications). Then the new PC may be added trying to compensate for that change without affecting the previously computed PCs. (This is similar to the idea of lattice filtering used extensively in signal processing - where for every increased filter order, one new lattice filtering section is added to the original structure but all the old sections remain completely intact.) This idea leads to a new neural network called APEX, an acronym standing for Adaptive Principal-component Extractor, proposed in the next section.

The problems of data compression and pattern classification have attracted a lot of attention from many researchers of various fields like pattern recognition, artificial intelligence, signal processing, etc. Both problems rely on finding an efficient representation of the input data. This representation should extract the essential information out of the original sequence of patterns. For data compression this process is a mapping from a higher dimensional (input) space to a lower dimensional (representation) space. The data compression problem is also important in modeling multilayer perceptrons where the hidden neurons may be regarded as a layer of nodes corresponding to the most effective representation of the input patterns. Their activation patterns can be viewed as the target patterns of the representation transformation that may facilitate the classification problem tackled by the following layer(s). A powerful mathematical tool for extracting such representations is Principal Component Analysis (PGA) which derives an optimal linear transformation y = P x for a given target space dimension. The optimality criterion is based on the mean square error of the reconstructed input data j , from the actual components y. Define R as the correlation matrix of the input : R = E { x x T } . The rows of the optimal matrix P are the eigenvectors of R corresponding to its highest eigenvalues. This result stems from the Karhunen-Loeve Theorem that has been extensively studied in the past. Recently, new techniques have been reported for the adaptive calculation of this transformation for a given set of random This research was supported in part by the National Science Foundation under Grant MIP-87-14689, Air Force Office of Scientific Research and by the Innovative Science and Technology Office of the Strategic Defense Initiative Organization, administered through the Office of Naval Research.

861
CH2847-2/90/0000-0861 $1.00 0 1990 IEEE

A NEW NEURAL NETWORK - APEX

Awj = -y(y,yj

+ ykwj),

j = 1 . .. m - 1

The APEX neuron model is depicted in Figure 2. There are , } connected to m outputs { y l . . .ym} through n inputs { X I . .. z the weights {pj,j}. Additionally, there are anti-hebbian weights wj forming a row vector w that feeds information to output neuron m from all its previous ones. We assume that the input is a stationary stochastic process whose autocorrelation matrix ., has n distinct positive eigenvalues A1 2 A 2 2 . . . 2 A We also assume that the first m - 1 output neurons correspond to the first m - 1 normalized principal components of the input sequence. The most important feature of APEX hinges upon the fact that the m-th neuron is able to extract the largest component which is orthogonal to the first m - 1 components represented by the already trained m - 1 neurons. This will be referred to as the orthogonal learning rule. The activation of each neuron is a linear function of its inputs
y = Px
Ym

(2) (3)

Notice that the first equation is the same as Ojas adaptive rule, which is the hebbian part of the algorithm. We shall show that it has the effect of driving the neuron towards more dominant components. The second equation, represents what we call orthogonal learning rule. It is basically a reverse Oja rule, i.e. it has a similar form except for the opposite signs of the terms. The w-weights play the role of subtracting the first m - 1 components from the m-th neuron. Thus, the m-th output neuron tends to become orthogonal to (rather than correlated) all the previous components. Hence the orthogonal learning rule constitutes an anti-hebbian rule. It is hoped that the combination of the two rules produces the mth principal component. This will be proved by the numerical theoretical analysis in the Section 3, and demonstrated by simulations in Section 4. The orthogonal learning rule also has a very important application to the problem of extracting constrained principal components as briefly discussed in Section 5 and elaborated in a future paper [4].

= PX

WY

where x = [q. . . z , ] ~ , y = [y1 . .. ym-1IT, P is the matrix of the p;j weights for the first m - 1 neurons, and p is the row vector of the p m j weights of the m-th neuron. The algorithm for the m-th neuron is

NUMERICAL ANALYSIS PROOF OF THE ALGORITHM

A P = P(YmXT - YkP)

(4)
(5)

A w = -y(ymyT t Y

~ W )

where 3 ! , and y are two positive (equal or different) learning rate parameters. If we expand the above equations for each individual weight we get the following equalities
APm,j =P(~mzj -~ k ~ m , j ) j ,=

1 . .. R.

Assume that neurons 1through m- 1have converged to the first m - 1 principal components, i.e. P = [el . . . e,-1lT where e l , . . . ,em-l,are the first m - 1 normalized eigenvectors of R, and let p(t) = Bj(t)eT, where t is the number of sweeps. (One sweep means one round of training process involving all the given sample input patterns.) Let M be the number of all the input patterns. We shall divide the analysis into two parts: Part I for the analysis of the old first (m- 1) principal modes; Part I1 for the remaining new (i.e. m-th, .. ., n-th ) modes. Part I : By averaging eq.(4) over one sweep (sweep t ) and assuming p(t) approximately constant in this period of time, we can derive the following formula

P(t

+ 1) = P(t) + P[(P(t) + w ( t ) P ) R- u(t)p(t)]


where u ( t ) = E{yk(t)},

(6)
(7)

p = MP

in/
Figure 1. Ojas simplified neuron model

For simplicity, in the following we will drop the index t from u ( t ) . By focusing on the eigenmodes we can derive the updating rule for 6;
ei(t

+ 1) = [I + pyxi - u)lei(t)+ pxjw,(t) +


+ u)]w;(t)

(8)

By the same token eq.(5) becomes w;(t 1)= -yfAj6i(t) t [l - yf(A; where y = M y Refer to [4]for a more detailed derivation of eqs.(8) and (9). We rewrite the above dynamic equations in a matrix form
(9)

I[,:$
Figure 2. The linear multi-output model. The solid lines denote the weights p i , w j which are trained at the m-th stage. (Note that { w j } asymptotically approach zero as the network converges.) When p = y value on

1 p(Ai - U )
-7Aj

pxj

1 - -/(A;

+U)

] [; : ]
(10)

(p = y) the system matrix has a double eigen-

p j ( t ) = 1 - pu(t) (11) which is less than 1 as long as p is a small positive number, Hence all 8; and w;tend asymptotically to 0 with the same

862

speed, (since the value of the eigenvalues are the same for all the modes.) Very importantly, the above relationship of p and 7 may be exploited to select a proper learning rate p which guarantees a very fast convergence speed. Note that (12) is an optimal value Fee eq.(11)). One way to estimate o is t o take the average of y k over every sweep. If p # y then there is an extra flexibility in choosing the speed of the decay of every mode i (i = 1,.. . ,m - 1). The control over the decay speed becomes even stronger if we introduce different parameters yi for the different w i (i.e. for the different modes) thus selectively suppressing some modes more rapidly (or slowly) than others. Part II : Here we consider only modes i, for i 2 m. We follow the same steps as above but now w is removed from eq.(8) since there is no influence from w on those modes. (This is because the old nodes { y1, y2,. . . ,ym-l } contain only the first m-l components.) Hence, we have a very simple equation for every mode i 2 m:
O;(t

COMPUTATIONAL EFFICIENCY AND SIMULATION RESULTS

Based on the above theoretical analysis, the APEX Algorithm is formally presented below:

4.1

APEX Algorithm

= y = l/(Ma)

For every neuron m = 1 t o N , ( N

5 n)

1. Initialize p and w t o some small random values. 2. Choose

p,

y as in eq.(3) (see section 4.2).

3. Update p and w according to eqs.(4), ( 5 ) , until Ap and Aw are below a certain threshold.

4.2

Comparisons in Computational Efficiency

The APEX algorithm compares very favorably with all the previous ones and amounts t o several orders of magnitude of computational saving. The claim can be supported by three different aspects.
1 Efficiency in Recursively Computing New PCs

+ 1) = [I t p(Xi
n

0)]0i(t)

(13)

According to Part I, 0; and w;will eventually converge to 0 (for i = 1 , . . . , m - I), and in that case we will have
0

For the recursive computation of each additional PC, the APEX method requires O ( n ) multiplications per iteration, In contrast, the Foldiaks method would require computation of all the PCs (including the previously computed higher components) so it requires O(mn) operations. 2. Efficiency by Using Local (i.e. Lateral Nodes) Algorithm The APEX algorithm also enjoys one order of magnitude of computational advantage over the Sangers method. More precisely, for the recursive computation of each additional PC, Sangers method (cf. eq.(l)) requires ( m 1) multiplications per iteration for the m-th neuron, as opposed t o 2(m n - 1) multiplications per iteration in APEX. This significant computational saving stems from the fact that APEX uses only local (i.e. lateral ) y nodes which summarizes the useful information for our orthogonal (reverse-Oja) training, thus avoiding the (redundant) repeated multiplications with the synaptic weights { p i , j } when nonlocal (i.e. x ) nodes are used, as is the case in the Sangers method.

= U(t) =
i=m

A;B;(t)2

(14)

therefore eq.(13) can not diverge since whenever the 0; becomes so large that o > A; then l+p(X;-o) < 1, and 0; will decrease in magnitude. Assume that 0,(0) # 0 (with probability 1).

then by eq.( 13)

For the convenience of the proof, let us assume that the eigenvalues exhibit a strict inequality relationship, i.e. A1 > A2 > . .. > A ., (The general case can be proved along exactly the same line.) In this case,

3. Reduction of Iteration Steps by Adopting Optimal Learning Rate p More importantly, our analysis provides a very powerful tool for estimating the optimal value for p, y, as given in eq.(3). As another attractive alternative, instead of eq.(12) we propose to set

and

r;(t)+ 0 as t

-+

(15)

since 0, will remain bounded according t o eqs.(l3) and (14). 0 Eq.(15) further implies that for i = m I, ... n, 0;(t) when t + CO. Then according to eq.(14), U becomes X,0$ and eq.( 13) for i = m becomes

p = y = l/(MA,-l).
This is an underestimated value of , B as suggested by , and limt,,~ = .A , A fairly eq.(12) since Am-l > A 1 -, can be achieved by averaging precise estimate of X the output of the previous neuron over one sweep. This calculation needs t o be done only once for every neuron. (In the case of training the very first neuron, a common scaling scheme may be adopted.)

-+

om(t

t1) = 1 1 +Px,(~ - e,(t)*)~e,(t)


0,(t)
-+

(16) (17)

therefore,
1 as t
-+

Hence the m-th normalized component will be extracted. (For more details refer to [4]). Note that eqs. (15) and (17) together imply that ~ ( t-+ )A , as t -+ 00 (18)

4.3

Simulation Results

In our simulations, the number of random input patterns is A 4 = 20 and the input dimension is 5. For the case of 6 , =

863

y we use the value l/(MA,-l) as dlbriissed in the previous paragraph. Figure 3(a),(b) shows the absolute error between Average(y$) and actual A, and the square distance between the calculated vectors and actual component vectors as a function of the sweep number. The convergence is extremely fast as expected from the theoretical analysis. The results are very close to the actual components even for small eigenvalues, and are almost perfectly normalized and orthogonal to each other. Using the above values for p and y the algorithm converges with a relatively large error. In Table 1 we show the speed of convergence and the final square distance between the calculated principal component vectors and the actual components. An obvious solution to this problem would be to adopt more conservative values for the learning parameters in the fine tuning phase (after the PC has already reached certain very close neighborhood). Table 1 summarizes the results when the values l/(MAm-1f) are used, with f = 1,5,10. The convergence s f gets larger. is slower but towards a more accurate solution a We can achieve both favorable convergence speed and accuracy by adopting the following compromise: (a) In the initial phase, the algorithm starts with f = 1 for fast (but rough) convergence, and (b) in the fine tuning phase, f will be increased to achieve higher precision until a desired accuracy threshold is reached.

REFERENCES
[l] E. Oja, A Simplified Neuron Model as (I Principal Component Analyzer, J. Math. Biology, vol. 15, pp. 267-273,
1982.

[2] T. D. Sanger, An Optimality Principle for Unsupervised Learning, in Advances in Neural Information Proce.csing Systems, vol 1, pp. 11-19, (D. S. Touretzky editor).
[3] P. Foldiak,

Adaptive Network for Optimal Linear Feature Extraction, IJCNN, pp. 1-401-1-406, Washington DC, 19a9.

[4] S . Y. Iiung, Adaptive Principal CO-mponent Analysis via an Orthogonal Learning Network, Proceedings, Int. Symp. on Circuits and Systems, New Orleans, May 1990
[5] S. Y. Kung, D. V. Bhaskar Rao, and K . S. Arun,

Spectral Estimation: From Conventional Methods to High Resolution Modeling Methods, = in VLSZ and Modern Signal Processing, pp. ( S . Y. Kung, et. al. editors).

EXTENDING APEX TO EXTRACT CONSTRAINED PRINCIPAL COMPONENTS


5

t
--xp-

neuron1

neuron2 neumna
neuron4

0.5

neuron1

neuron2
neuron3

In the above we have presented anew neural network (APEX) for the recursive calculation of the principal components of a stochastic process. The new approach offers considerable amount of computational advantages over the previous approaches. Some typical examples include, e.g. image data compression, SVD applications for modeling, for spectral estimation or for high resolution direction finding [5]. APEX also uniquely introduces a new application domain (which can not be handled by any of the previous known algorithms) in its ability to extract constrained principal components. The problem is t o extract most significant components within a constrained space. (Such a problem arises in certain anti-jamming sensor array applications and image feature extraction applications.) In this case, the old nodes { y i , y z , . . . ,y,-i} are not necessarily principal components of the input data. More generally, they may represent a set of arbitrary constraining vectors to which the search space is orthogonal. (For example, in anti-jamming application, they represent the directions of jamming signals.) y will be Here we shall show that in the steady state, , orthogonal to all yi, i = 1,.. . m - 1. Suppose that y is a linear combination of x: y = A x , where the rows of A are not necessarily principal components, and AAT = I . Assuming the algorithm converges, the vector corresponding to y, (i.e. ( p w A ) ) is orthogond to A. This can be readily seen by setting eqs.(4) and ( 5 ) to 0. Multiplying eq.(4) by A and adding to eq.(5), we have the result sought

10

20

30

00 0

10

20

30

Figure 3. For p = y = l/(MA,-l) the convergence speed is very fast for both (a) the computation of eigenvalues; and (b) the computation of eigenvectors.

factor
1 5 10

Average sweep number where the square distance is within 5% of the final value 21 118 194

Average final square distance (X 10-3) 1.23 0.34 0.32

Table 1. The performance for different learning rates.

(pi-w A ) A = 0

This justifies the name orthogonal learning rule. The mathematical derivation of the convergence proof follows very closely what used in Section 3 and the readers are referred to [4] for more details.

864

You might also like