Learning Rules and Its Computational Properties

) ACULTY
OF 0ATHEMATICS AND 3HYSICS

&OMENIUS 8NIVERSITY %RATISLAVA
, NSTITUTE
OF ,NFORMATICS
8NSUPERVISED 1EURAL 1ETWORKS /EARNING 5ULES %&0

/ EARNING 5ULE
AND ,TS &OMPUTATIONAL 3ROPERTIES
Diploma Thesis
$UTHOR
$DVISOR
3DYHO 3HWURYL
51'R
XELFD %HXNRYi &6F
HVWQH SUHKODVXMHP H VRP GLSORPRY~ SUiFX Y\SUDFRYDO ViP LED V SRXLWtP
citovanch zdrojov a literatry.
podpis
$FNQRZOHGJHPHQWV
I would like to give my appreciation to my advisor RNDr.
XELFD
%H XNRYi
CSc. She provided me with all necessary information and references on BCM and a
lot of help. She created a very friendly and cooperative atmosphere, and invited me to
take part in scientific seminars held at the Department of Computer Science and
Engineering of Slovak University of Technology.
0DQ\ WKDQNV EHORQJV DOVR WR KHU FROOHDJXH ,QJ 3HWHU 7L R &6F ZKR ZDV D
generous source of many ideas, had a strong influence on the direction of our work in
most crucial momements and who provided me with Information Theory references
and software utilities.
I am also grateful to my tutor PhDr. Jn efrnek, DrSc., who organized the
diploma thesis seminar, where we regulary discussed the progress in our work on
thesis.
7DEOH RI FRQWHQWV
A&.12:/('*(0(176 .............................................................................................1
T$%/( 2) &217(176................................................................................................2
1. I1752'8&7,21 ....................................................................................................3
2. U1683(59,6(' L($51,1* R8/(6 ........................................................................5
2.1. Unsupervised Hebbian Learning ...................................................................6
2.2. Oja's Rule .....................................................................................................7
2.3. Principal component analysis (PCA).............................................................8
2.4. One-Layer Feed-Forward Networks..............................................................8
2.5. Self-Organizing Feature Extraction with Hebbian learning...........................9
2.6. Unsupervised Competitive Learning............................................................10
3. BCM L($51,1* R8/( .......................................................................................12
3.1. Introduction ................................................................................................12
3.2. Basic concepts of the BCM Theory..............................................................12
3.3. Experiments with the BCM Neuron and Time Sequences.............................15
3.3.1. Introduction ..........................................................................................15
3.3.2. Theoretical background.........................................................................17
Figure 3.3.b: An example of the HMM automaton (automaton definition). .....21
3.3.3. Implementation.....................................................................................21
3.3.4. Results ..................................................................................................28
R(680( ................................................................................................................38
R()(5(1&(6 ..........................................................................................................39
A33(1',; A, 6285&( &2'( $1' (;$03/(6 2) '$7$ ),/(6. ...................................41
Program source code.........................................................................................41
A33(1',; B, (17523,& 63(&75$ ...........................................................................41
,QWURGXFWLRQ
One of the first (and often cited) works regarding neural networks, "The
Organization of Behavior" by Donald Hebb [Hebb49] introduces a basic principle for
updating synaptic strengths: "When an axon of cell A ... excite(s) cell B and
repeatedly or persistently takes part in firing it, some growth process or metabolic
change takes place in one or both cells so that A's efficiency as one of the cells firing
B is increased". It is not yet completely revealed which underlying processes are
responsible for described changes in the real biological neural networks.
Although detailed mathematical models of neurons that used differential
equations have been created, they were not acceptable in terms of computational
complexity. In addition, they contained too many modifiable parameters. One of the
primary challenges of artificial intelligence of nowadays is to develop an efficient
model of neuron as clearly stated by Rodney A. Brooks in [Selman96].
Many simplifying models have been developed and they provided a large area for
formal research. This led to an establishment of a new independent area, artificial
neural networks, which, although originally biologically motivated, went far from the
real biological neural networks. Their classification and pattern recognition features
approach statistics, theory of learning and information.
The central topic in artificial neural networks is the learning process, which is
driven primary by the learning rule. The basic learning rules that have been suggested
have been exhaustively analyzed in detail. Most of their limitations have been
uncovered and it seems that the simplifications inherently contained in these models
imply in most cases too strong constraints. That's why is it reasonable to explore
learning rules based on different principles and this is the motivation of the second
part of this thesis. The hope is that this laborious and iterative process will converge
into more biologically plausible and efficient rule with either general or highly
specific purpose.
Two main approaches of the learning rules and network architectures can be
identified: unsupervised and supervised learning. Supervised learning requires a large
set of training data and it is usually not very adaptable to the changes of the
circumstances. Once the training stage is finished, the network learning ceases. On the
other hand, outputs of networks with unsupervised learning are of different kind. They
provide either some statistical information, perform a useful transformation or

anonymous classification or clustering, thus making it more difficult to interpret the
outputs directly. To make use of advantages of both approaches, several mixed
models with both supervised and supervised learning have been designed.
The thesis focuses on the unsupervised learning rules. The first part of the thesis
gives an overview of unsupervised learning rules. Their particular characteristics,
usage and limitations are summarized. The second part focuses on the BCM learning
rule suggested by Bienenstock, Cooper and Munro. We give an overview of the BCM
Theory. We design, perform and discuss several experiments with symbolic time
sequences in order to analyze computational properties of the learning rule. The
source code of the programs, which we used in the experiments, is included in
Appendix A. The entropy spectra, which we have computed for two symbolic
sequences, are in Appendix B.
8QVXSHUYLVHG /HDUQLQJ 5XOHV
We can understand a neural network (even consisting only of a single neuron) as

a projection
) ,
2 of input vector
[ ,
L
from the space of input vectors
space of possible output vectors )[ 2, but keeping in mind that

L
to the
is dynamic and
may change while inputs are processed. In unsupervised learning,
is updated
without external intervention. Changes are implied only by the inputs. The network
must find the regularities, categories, correlation or patterns in the input and organize
itself to provide useful codes on the output side. Unsupervised learning is possible
only thanks to the redundancy in the input data. Otherwise, it would not be possible to
differentiate between the input with information and random noise. According to
[Hertz91], output of the unsupervised neural network may represent different features
of the input:
)DPLOLDULW\
- how familiar the input pattern is to the typical or average patterns
seen in the past. The network gradually learns what is typical.
3ULQFLSDO &RPSRQHQW $QDO\VLV
- the similarity to the previous examples is
measured along the set of axes.
&OXVWHULQJ
- input vectors are assigned to belong to a certain category and the
output of the network is a binary vector with only one bit identifying the
category set. Each cluster of similar or nearby patterns would then be classified
as a single output class.
3URWRW\SLQJ
- the input data are classified to categories but for each input the
network outputs a prototype of the category that input belongs to.
(QFRGLQJ
- output could be an encoded version of the input, in fewer bits, keeping
as much relevant information as possible. This could be used for data

compression, assuming that an inverse decoding network could also be
constructed.
)HDWXUH 0DSSLQJ
- output units may be seen as in a fixed geometrical
arrangement with only one on at a time mapping input patterns to different

points in this arrangement. The topographic map of the input is constructed and
7KH IRUPXODV LQ WKLV FKDSWHU DUH EDVHG RQ >+HUW]@
similar input patterns are projected to nearby output units. A global organization
of the output units evolves.
These cases are not distinct and might be combined. Encoding can be performed
using principal component analysis or by clustering called also vector quantization.
Principal component analysis can be used also for dimensionality reduction before
performing clustering or feature mapping avoiding "curse of dimensionality". Another
usage of unsupervised learning is to replace supervised learning, when possible either
because of the computationally complexity of supervised networks or to provide a
mechanism for network to adapt to changes.
Unsupervised neural networks tend:
to be of a simple architecture (complexities lie in the learning rules)
not to have many layers (usually only one)
to contain many fewer output units than inputs (with exception of the Feature
Mapping)
to be more biologically plausible than other kinds of networks.
We can separate unsupervised networks into 2 main categories: based on the
modified Hebb rule and based on competitive learning. We will describe them in 2
different subsections.
8QVXSHUYLVHG +HEELDQ /HDUQLQJ
We consider input vectors
with components xij for j=1..N and with the
probability distribution P([ ). The single linear output unit (see Figure 2.1) is the
simplest example and the formula giving the output is:
1
) [ = Z M [ LM = Z 7 [ L = [ Z
L
M =
where Z is the weight vector.

t
)[
w
w1
w
xL
xL
xL1
Figure 2.1. Plain Hebbian learning architecture.
(2.1)
Since the output )[L has to be a measure of similarity, a plain Hebbian learning

rule may be applied:
wM=)[Lxij
(2.2)
where is the learning rate. Frequent input patterns strengthen the weights most
and thus they produce the largest output. The problem with this rule is that the
weights keep growing forever. This can be fixed. We suppose that there exist an
equilibrium point for Z. At equilibrium, we want the changes to
to be 0 on average
and thus:
0 = wj = )[, xIj = kwkxIkxIj = k&MNwk
Z = Z&.
(2.3)
where & is a symmetric correlation matrix defined as:

&MN
xLMxLN
(2.4)
[L[L T.
(2.5)
or
&MN
If there were the equilibrium Z, it would be the eigenvector of & with eigenvalue
0, but
&
has some positive eigenvalues and fluctuation having a component along an
eigenvector with positive eigenvalue would grow exponentially. The direction of the
eigenvector with the largest eigenvalue PD[ of & would become dominant and
will
approach this direction with increasingly huge norm.

2MD
V 5XOH
We can restrict the growth of weight vector
in the plain Hebbian learning for
example by normalization after each change wj' = wj with such that |Z

|=1.
Another approach suggested by Oja in 1982 is to use modified learning rule:
wM=)[Lxij-)[Lw ).
L
It can be proven, that
(2.6)
approaches length of one and tends to an eigenvector
with the largest eigenvalue.

There exist different ways of modification of the plain Hebbian rule. One way is
to simply cut components of the weight vector Z so that they remain in some interval
w- wi w+ [Linsker86]. Another modification by Yeuille at al. uses the rule:
wM=()[L xij wi |Z|).
M
(2.7)
Here Z converges to the eigenvector with the largest eigenvalue with
Z = PD[
instead of |Z|=1. Main advantage of the Oja's rule is that there is an associated cost
function
(

& MN Z M Z N
MN

+ Z = Z 7 &Z
M

M
(2.8)
3ULQFLSDO FRPSRQHQW DQDO\VLV 3&$
Linsker reports [Linsker86] that performing principal component analysis is

equivalent to maximizing the information content of the output signal in situations
where that has a Gaussian distribution. The aim is to find in a data space set of M
orthogonal vectors that account for as much as possible of the data variance. The
original data might then be projected from their original N-dimensional space to the
M-dimensional subspace (usually M << N) performing a dimensionality reduction
retaining most of the information in the data. The kth principal is taken to be along the
direction with the kth maximum variance and it can be shown, that this component
corresponds to the eigenvector direction belonging to the kth largest eigenvalue of the
full covariance matrix (xij -j)(xik -k), where j= xij. For zero-mean data this
reduces to the corresponding eigenvectors of the correlation matrix
& defined above
and it is always possible to center data to achieve the zero mean.

2QH/D\HU )HHG)RUZDUG 1HWZRUNV
Oja's rule can be used to find the first principal component in zero-mean data. To
find M principal components we can use a one-layer feed forward network designed
by Sanger in 1989 [Sanger89] or Oja in 1989 [Oja89]. The output )([ ) is given by the
L
formula:
1
) [ =
L
N =
L
MN [ M
= Z7 [ = [
L
(2.9)
where Z is the vector for the jWK output. Both Sanger's and Oja's learning rule is:
M
Z MN = )
[ [ NL ) [ Z ON
L
O =
(2.10)
but the upper limit of the sum is j for Sanger and M for Oja. The main difference is
that Sanger's rule finds exactly the first M principal components, whereas Oja's rule
8
finds M vectors that span the same subspace as the first M eigenvectors, but they are
not the same and depend on the initial conditions. These rules are not local, since
updating the weight wij requires information from other nodes other than input k and
output j. There exists a modification of the Sanger rule which is local:
Z
MN
= ) [
M
[ ) [ Z
L
=
MN
) [ Z
L
MN
(2.11)
Other network architectures are used for principal component analysis too. Selfsupervised back-propagation network with N inputs, N outputs and one hidden layer
of M < N units trained so, that the outputs are as close as possible to the inputs in the
training set produces the same result as the Oja's M-unit rule.
Another architecture designed by Rubner and Tavan in 1989 contains one-layer
network with trainable lateral connections between the M output units (lateral
connections exist only "from the left to the right" or between all units in similar
Fldik's approach). The ordinary weights are trained with the plain Hebbian rule
with renormalization to the unit length and the lateral weights are trained with antiHebbian learning:
u =- ) [ )
L
MN
(2.12)
This architecture extracts M principal components as the Sanger's rule, and the lateral
connection converges to zero.
6HOI2UJDQL]LQJ )HDWXUH ([WUDFWLRQ ZLWK +HEELDQ OHDUQLQJ
In the feature extraction the goal is to have many output units, each one most
sensitive to a particular input and different output units choose different input
patterns. The number of the output units may be larger than the number of input units.
This can be measured by selectivity of a particular output Oi (as defined by
Bienenstock in 1982):
Selectivity = 1 - Oi / max Oi
M
(2.13)
where the average Oi and the maximum are both taken over all the possible inputs.
The selectivity is near 1, if the output unit favors single input (or a narrow range of
inputs) and it is near zero, if it responds equally to all inputs.
The aim is to define an architecture and learning rule in which the outputs
converge to high selectivity and to have different output units to become sensitive to
different input patterns with some output unit matched to every input pattern. Another
goal is that the similar input patterns should activate nearby output units arranged in
geometrical structure.
The area of feature mapping is very challenging also because of the experimental
biological evidence in the animal visual cortex.
An interesting model consisting of three layers A, B and C where units are
connected with the feed-forward connections with the neighboring units in the
previous layer (called receptive field) has been designed by Linsker in 1986
[Linsker86]. Output O of the particular unit receiving input from K units, Ij, j=1..K is:
O = a + wkVk
(2.14)
where a is an optional threshold and the sum runs from k=1 to K. Since the units are
linear, the result can be replaced by a network with just one layer, but the layers are
important for the learning rule with tunable parameters b, c and d:
w =() [ O + b ) [ + cO + d)
L
(2.15)
Then the weights are clipped to the range w- wi w+ In addition, this rule tries to

maximize output variance with the constraint
wj = . ( is a constant - combination
of constants a-d).
8QVXSHUYLVHG &RPSHWLWLYH /HDUQLQJ
The main principle in the competitive learning is that only one output unit or only
one per group is activated at a time. The output units compete for activation and this is
where the name for this type of learning comes from: winner-takes-all.
The aim of these architectures is to cluster or categorize the input data. Similar
input patterns are classified into the same category. The network has to find the
classes based on the correlation of the input data. The possible uses include any
categorization in AI, data encoding and compression, function approximation, image
processing, statistical analysis and combinatorial optimization. There are several
disadvantages for winner-takes-all architectures:
The output code is not very effective, since one output cell represents one
category and N units can represent only N categories instead of possible 2N.
These architectures are not robust to degradation or failure, if one output unit fails,
then we lose the whole category.
10
Hierarchical knowledge cannot be represented, only one level of categorization is

possible with the winner-takes-all method.
A similar approach called feature mapping develops spatial organization in the
output layer. These fields are intertwined and should be examined together.
Probably the most important contribution in this area was made by prof. Teuvo
Kohonen who designed several competitive learning architectures most important is
the Self-Organizing (Feature) Map algorithm (SOM). The input vector is sent in
parallel to all neurons distributed in the single layer. The activation of each node is the
Cartesian product of the input vector with the input weight vector, which is specific
for each node. The node with the highest activation (or the weight vector closest to the
input vector in Euclidean Distance version) is a winner and its index is the output of
the network. The weight vectors of the winner and its topological neighbors are
updated: close neigboring nodes move their weight vector towards the input vector. At
the same time, nodes that are distant from the winner are inhibited and move their
weight vector in the opposite direction. The result is the topological map of the input
space.
We have already mentioned the difficulty with applying unsupervised learning to
classification problems. The solution is provided by Kohonens supervised LVQ
algorithms, which are responsible for tuning the weights of the SOM layer in order to
minimize the number of wrong classifications. Each output node belongs to one of the
possible categories.
Applications of the competitive learning include speech recognition, robotics and
various applications of topological mapping. See [Kohonen95] for more detailed list
of applications.
Recent Kohonens work includes also the work on the physiological
interpretation of the SOM Algorithm [Kohonen93]. It is argued that the pure
transsynaptic communication between the neurons occurs. It is suggested that this
proces is based by a diffuse chemical effect.
11
%&0 /HDUQLQJ 5XOH
,QWURGXFWLRQ
The area of artificial neural networks was originally motivated by biological

neural networks. It declined later from the original direction and converged into a
theory that provides reasonable tools for machine learning, in case of the supervised
architectures, and many computational procedures applicable in the field of statistics,
in case of unsupervised architectures. BCM learning rule, named after Bienenstock,
Cooper and Munro, has a similar and long history.
In this part of the thesis, we will give a detailed overview of the BCM
bibliography. In addition to a general view, we have divided the published material
into three categories. First category is strictly biologically motivated and usually
based on real biological experiments. Here, the BCM serves as a simplified
mathematical model of animal neural cell in visual or barrel cortex. The second
category includes other applications of the BCM learning rule and its modifications,
particularly the projection pursuit. The last category contains papers describing and
analyzing the mathematical details of the BCM theory. This overview is followed by a
next part concerning the experiments, which we performed with the BCM learning
rule. Its introduction outlines the purpose of experiments. Followed is by the section,
which summarizes a necessary theory, and by a section with implementation details.
The last section discusses the results.
%DVLF FRQFHSWV RI WKH %&0 7KHRU\
Original motivation came from the real experiments with the visual cortex of
animals. In 1975, Nass and Cooper [Nass75] explored a theory in which the
modification of visual cortical synapses was purely Hebbian.
Then a significant extension of theory was presented by Cooper, Liberman and
Oja in 1979 [Cooper79]. It was presented as a theoretical solution to the problem of
visual cortex plasticity. The main idea was that the sign of weight modification should
be based on whether the postsynaptic response is above or below a threshold.
Responses above the threshold should lead to strengthening of the active synapses,
responses below the threshold lead to weakening of the active synapses.
12
Neurons in the primary visual cortex of a normal adult cat are sharply tuned to
the orientation of an elongated slit of light and most are activated by stimulation of
either eye as stated by Hubel and Wiesel in 1959 [Hubel59]. Both of these properties orientation selectivity and binocularity depend on the type of visual environment
experienced during a critical period of early postnatal development [Intrator92].
Several striking effects appear after not normal postnatal development. The theoretical
solution for the plasticity of visual cortex was presented by Cooper, Liberman and Oja
in [Cooper79].
The original theory used a modification threshold m that was static. Bienenstock,
Cooper and Munro suggested [Bienenstock82] that this value varies as a nonlinear
function of the average output of the postsynaptic neuron, which is the main concept
of the present BCM model. This provides stability properties and explains several
important effects. The form of synaptic modification is:
mj = (c,m(t)) dj
(3.1)
where mj is the efficacy of the jth Lateral Geniculate Nucleus (LGN) synapse onto a
cortical neuron (i.e. the input weight), dj is the level of presynaptic activity of the jth
LGN afferent (i.e. the input), c is the level of activation or the postsynaptic activity of
the postsynaptic neuron (i.e. the output), which is given (in the linear region), by md,
and m is a nonlinear function of cell activity averaged over some time that in the
original BCM formulation was proposed as
0 = F
(3.2)
The dynamic modification threshold m(t) is a nonlinear function of the time-averaged

postsynaptic activity c(t), so that
F (W )
P (W ) =
F
(3.3)
where c0 is a positive scaling constant (originaly, c(t) 2 was used instead of c2(t)).
The averaged cell activity over some recent past c2(t) is determined for example by:
F W

W W
W H
$OO HTDWLRQV LQ WKLV VHFWLRQ DUH EDVHG RQ >,QWUDWRU@
13
GW
(3.4)
where is the averaging period. From these relations it follows that when the
postsynaptic activity c(t) is greater than zero but less than the modification threshold
m all active synapses (i.e. di(t) > 0) weaken. On the other hand, when postsynaptic

activity c(t) is greater than m all active synapses potentiate. Since c(t)=mi(t)di(t), the

correlation of excitatory inputs plays a crucial role in driving the postsynaptic cell
activity above the modification threshold m, The key property of m is that it is not
fixed, its current value is proportional to the postsynaptic response averaged over
some recent past time.
The shape of the function for different m is drawn in the Figure 3.1.
! SRWHQWLDWLRQ
F
SRVWV\QDSWLF DFWLYLW\ F
GHSUHVVLRQ
Figure 3.1. The function for two different values of threshold m. Usually, takes the
form of parabola, i.e. = c.(c- m).
The BCM model has a great biological relevance, details can be found for example in
[Bear87]. The group of researchers working with the BCM theory is currently making
a lot of effort in this area. The model has been applied also to a rat barrel cortex by
BeXNRYi
HW DO
[BeXNRYi] for modeling the neuron after whisker sparing. The
proposed model has been further extended with inhibitory cells in [BeXNRYi97].
The important aspect of this work is that the theory is usually compared with the
experiment as in [Clothiaux91] where such process in the visual network as
monocular deprivation (MD), normal rearing (NR), reverse suture (RS), strabismus
(ST), binocular deprivation (BD) and recovery from deprivation (RE) are explained
using the BCM model. A neural network of visual cortex has also been modeled and
its stability and fixed points of synapses were analyzed in [Cooper88]. The BCM
Theory was mathematically analyzed from various points of view by several
14
researchers, primarily by Nathan Intrator. Objective function for the originally

suggested learning rule was formulated in [Intrator92] and the learning rule was
derived from the risk function for both linear and nonlinear neurons and for a network
with feed-forward inhibition. Analysis of the fixed points and stability of the solution
with respect to noise of different kinds was performed. A differential equations for
neural network were analyzed in [Intrator92]. Work of Intrator and Cooper revealed
the possible field of applications of the BCM learning rule, which is far from the
biology. It was found that the BCM is a form of projection pursuit. The BCM as the
projection pursuit procedure was compared to backward propagation and PCA by
[Bachman94]. They used data from radar presentations and they found that BCM
achieves the best performance. They used an architecture with lateral connections and
suggested and derived the architecture and learning rule for recurrent BCM models.
Comparison of the BCM model and the negative feedback network [Fyfe95] was
performed in [Fyfe97]. BCM was also applied to classification of underwater
mammal sounds in [Cooper98]. The current work of this group concentrates on the
move from artificial visual environments into natural scene environment as in
[Cooper97].
([SHULPHQWV ZLWK WKH %&0 1HXURQ DQG 7LPH 6HTXHQFHV
3.3.1. Introduction
The BCM learning rule was applied in several different situations. It was shown,
for example, that a modified version can perform an efficient computation of
projection pursuit, determine bimodality, statistical skewness and curtosis of the
distribution of input data.
The original motivation for our experiments was the internal state of the BCM
neuron - its threshold . We believed it should have a significant impact on the ability
of the BCM neuron (or some kind of network based on BCM neurons) to recognize
some properties of the symbolic time sequences.
The goal of all experiments which we performed with the BCM learning rule was
to study the behavior of a single neuron exposed to symbolic time sequences built
from 2 symbols, namely 0 and 1. Symbols from the input sequence were feeding the
weighted neurons input. The weight w and the threshold were updated after each
15
input. Figure 3.2 shows a scheme of this circuit. The development of both weight and
threshold was observed.
ZW
%&0
W FW
6\PEROLF
6HTXHQFH
Figure 3.2. General scheme of circuit for experiments.

We performed the experiments with the following variants of the BCM neuron:
linear neuron
linear neuron with additive noise added to the input
neuron with sigmoid function
neuron with sigmoid function with noise on input
neuron with sigmoid function with one recurrent connection
In order to have a better possibility to compare the behavior of neuron for sequences
with different degree of determinism, we used three different kinds of sequences:
deterministic (produced by final-state automaton) (DET)
sequence produced by Hidden Markov Model automaton (HMM) [Rabiner86]
random sequence (Bernoulli source) (RND)
We worked with 4 different sets of sequences named A, B, C, D. Each set
contained one (periodical) deterministic, one HMM and one random sequence. The
difference between the sets of sequences was primarily in the nature of HMM
automaton for HMM sequences and in one case also in the number of symbols 1
present in the sequence. Sequences within each set contained the same number of
symbols 1 in order to make the results more comparable.
The different degree of determinism of sequences in one set can be appropriately
measured by the entropy of the sequence. Even more information can be traced from
entropy spectrum. We describe these concepts in the next chapter.
Each experiment produced a sequence of cell activations (which is proportional
to the input weight in the case of one input) and a sequence of threshold values. We
have found out that for two sequences with different entropies and the same number
16
of symbols 1 the resulting weight differs, i.e. it is dependent on the entropy of the
input sequence (or at least has a strong relation with it).
We were interested also in the nature of the dynamics of the threshold . The
weight becomes stabilized after certain number of iterations. The role of the threshold
is to compensate for changes in input. Thus it might be interesting to investigate the

sequence of differences from s average (expected value), which themselves can be
transformed into a symbolic time sequence. We have computed entropy spectra for
these symbolic sequences of differences and compared them with the entropy
spectrum of the original input sequences. Further details can be found in the section
with results.
3.3.2. Theoretical background3
In this section, we will give the theoretical details of the implementation of the
BCM neuron as well as the measures used for reasoning about the results.
Let d is the input to neuron, w is the weight of input connection, is the
threshold for weight modification and is the polynomial function used in the BCM
weight modification rule. The activity at time t is then given by:
c(t) = w(t).d(t)
(3.5)
c(t) = (w(t).d(t))
(3.6)
for linear neuron and

for nonlinear neuron, where is a usual sigmoid function
(x)=2 / (1 - e-2x) 1
(3.7)
`(x)= 4e-2x/(1+e-2x)2
(3.8)
w(t+1)= w(t) + w
(3.9)
with the derivation:
Weight is updated by the rule:
where the weight modification is

w = . ( c(t), ) . d
(3.10)
w = . ( c(t), ) . `(x) . d
(3.11)
for linear neuron and
7KH IRUPXODV UHJDUGLQJ %&0 DUH IURP >,QWUDWRU@ DQG WKH LQIRUPDWLQ WKHRU\ HTXWLRQV DUH SURYLGHG
E\ 3HWHU 7L R 6HH IRU H[DPSOH >7L R@ RU >.DWRN@
17
for nonlinear neuron, where
( x, ) = c.(c - )
(3.12)
In cases with additive noise, we used d(t)+noise(t) instead of d(t), where noise(t)
is a Bernoulli source, uniformly distributed in the interval [-,].
The threshold is defined by (3.4), but the continuous integral was approximated
by the discrete sum, particularly by accumulating the weighted average activity over
recent iterations (see the source code). In addition, a scaling parameter c0 allows
scaling of this average:

F W
F
(3.13)
For determining the entropy of symbolic sequences, we have used the calculation
of entropies for increasing length of window.
If the probability of some event is Pi, its information content is given by the
formula:
= ORJ
(3.14)
Based on this, Shannon introduced the entropy of the data as the sum of products
of probabilities and information contents of all possible events:
= 3,
L
(3.15)
For a window of length n there exist 2n different words (having only two symbols
0 and 1). Let the probability of word wi of length n is Pn(wi). Entropy of the
sequence for window of length n is thus:
= 3
Z
=Q
(Z )ORJ 3 Z
Q
(3.16)
Entropy can be understood as a measure of uncertainty. The uncertainty is high

for large Hn, i.e. all possible words are almost equally likely to occur in the sequence.
For small values of Hn, the uncertainty is smaller, i.e. the sequence is more
deterministic.
However, we are interested in some measure normalized for one symbol. We use
for this purpose the entropy per 1 symbol:
18
KQ = + Q
Q
(3.17)
The desired entropy of the symbolic sequence is then:
= OLP Kn
(3.18)
To gain a better picture of the internal structure of the sequence, we employed

entropy spectra. They consist of entropies of sequence computed for different
temperatures T =1/ .
Instead of Pn(w), we will use
a
3Q Z =

3Q Z
3Q Z
ZD
(3.19)
This transformed probability enables to reveal words which are more likely to
take over the less frequent words when the temperature is positive and larger than 1
(i.e. >1). Opposite effect happens for negative temperatures (<0), when the least
frequent words are dominant. Therefore entropies computed for different temperatures
uncover more information about the histogram of the sequences. (A special case when
T , i.e. 0 is called a topological case. In this case, all words, which are present
in the input sequence, are represented in the histogram by 1, and all words, which are
not present in the input sequence, by 0).
6HTXHQFHV JHQHUDWHG E\ WKH GHWHUPLQLVWLF VWDWH DXWRPDWRQ
Deterministic state automaton, which we used in the experiments, is defined as

follows:
A=(Symbols,States,Init,Generate,Next),
where Symbols is the output alphabet, |Symbols|=N,
States is the set of possible states of automaton,
Init States is the initial state, and
Generate: States Symbols is a function generating some symbol in each state
Next: States States is a function, which determines the consequentiality of
states.
19
6HTXHQFHV JHQHUDWHG E\ WKH +LGGHQ 0DUNRY 0RGHO DXWRPDWRQ
HMM automaton, which we used in the experiments, is defined as follows:

A=(Symbols, States, Init, Prob, Trans),
where Symbols is the output alphabet, |Symbols|=N,
States is the set of possible states of automaton,
Init States is the initial state,
Prob: States x Symbols [0,1] is a function that assigns output probabilities to
each of possible states and symbols, where
Prob DE = 1
(3.20)
E6\PEROV
for all a States and finally
Trans: States x States[0,1] is a transition function which assigns a

probability of transition from one state to another. Again,
Trans DE = 1
(3.21)
E6WDWHV
for all a States.

VWDWH

VWDWH

VWDWH

Figure 3.3.a: An example of the HMM automaton (state diagram).

The algorithm of the automaton can be summarized in these steps:
1. start in the initial state SInit
2. according to the probability distribution Prob(S,Sym), choose a symbol Sym
3. output symbol Sym
20
4. according to the probability distribution Trans(S,newS), choose a new state newS

5. change the state SnewS and continue to step 2.
An example of the HMM automaton is at Figures 3.3.a and 3.3.b.
A=(Symbols,States,Init,Prob,Trans),
Symbols={0,1}, States={state 1,state 2,state 3}, Init=state 1,
Prob(state 1,0)=0.1
Trans(state 1,state 1)=0.1
Prob(state 1,1)=0.9
Prob(state 2,0)=0.9
Prob(state 2,1)=0.1
Prob(state 1,0)=0.3
Prob(state 1,1)=0.7
Figure 3.3.b: An example of the HMM automaton (automaton definition).

3.3.3. Implementation
This section first describes the four sets of symbolic sequences A, B, C, D. Then
we will proceed with the design of software which we have built for our simulations
and finally we will explore the parameters which can be used for tuning the model.
The set A consisted of these sequences:
A1 deterministic, generated by the automaton

A1=({0,1},{s1,s2,s3},s1,A1,A1)
A1(s1)=1
A1(s1)=s2
A1(s2)=0
A1(s2)=s3
A1(s3)=0
A1(s3)=s1
A2 HMM:
A2=({0,1},{s1,s2},s1, A2,A2)
A2(s1,0)= 0.433
A2(s1,s1)=0.75
A2(s1,1)= 0.567
A2(s1,s2)=0.25
A2(s2,0)= 0.9
A2(s2,s1)=0.25
A2(s2,1)= 0.1
A2(s2,s2)=0.75
21
A3 random sequence containing the same number of symbols 1 as the

sequence A2.
The set B consisted of these sequences:
B1 deterministic, generated by the automaton

B1=({0,1},{s1,s2,s3,s4,s5,s6,s7,s8,s9,s10},s1,B1,B1)
B1(s1)=1
B1(s7)=0
B1(s2)=1
B1(s8)=0
B1(s3)=1
B1(s9)=0
B1(s4)=1
B1(s10)=0
B1(s5)=1
B1(s6)=0
B1(si)= s(i
mod 10)+1
B2 HMM:
B2=({0,1},{s1,s2,s3,s4,s5,s6,s7,s8,s9,s10},s1,B2,B2)
B2(s1,0)= 0.05
B2(s1,1)= 0.95
B2(s2,0)= 0.15
B2(s2,1)= 0.85
B2(s3,0)= 0.25
B2(s3,1)= 0.75
B2(s4,0)= 0.35
B2(s4,1)= 0.65
B2(s5,0)= 0.45
B2(s5,1)= 0.55
B2(s6,0)= 0.55
B2(s6,1)= 0.45
B2(s7,0)= 0.65
B2(s7,1)= 0.35
B2(s8,0)= 0.75
B2(s8,1)= 0.25
B2(s9,0)= 0.85
B2(s9,1)= 0.15
B2(s10,0)= 0.95
B2(s10,1)= 0.05
B2(si,sj)=25.0,
for j=i
B2(si,sj)=75.0,
for j=(i mod 10)+1
B2(si,sj)=0.0,
otherwise.
B3 random sequence containing the same number of symbols 1 as the

sequence B2.
The set C consisted of these sequences:
22
C1 deterministic, generated by the automaton

C1=({0,1},{s1,s2,s3,s4},s1,B1,B1)
C1(s1)=0
C1(s2)=1
C1(s3)=1
C1(s4)=0
C1(si)= s(i mod 4)+1
C2 HMM:
C2=({0,1},{s1,s2,s3,s4},s1, B2,B2)
C2(s1,0)= 0.6502
C2(s1,1)= 0.3498
C2(s2,0)= 0.0
C2(s2,1)= 1.0
C2(s3,0)= 0.0
C2(s3,1)= 1.0
C2(s4,0)= 1.0
C2(s4,1)= 0.0
C2(s1,s1)=0.7
C2(s1,s2)=0.3
C2(s1,s3)=C2(s1,s4)=0.0
C2(si,sj)=1.0,
for j=(i mod 4)+1, i>1
C2(si,sj)=0.0,
otherwise.
C3 random sequence containing the same number of symbols 1 as the

sequence C2.
The set D consisted of these sequences:
D1 deterministic, generated by the automaton

D1=({0,1},{s1,s2,s3,s4,s5,s6},s1,D1,D1)
D1(s1)=0
D1(s5)=0
D1(s2)=1
D1(s6)=0
D1(s3)=1
D1(s4)=0
D1(si)= s(i mod 6)+1
D2 HMM:
D2=({0,1},{s1,s2,s3,s4,s5,s6},s1,D2,D2)
D2(s1,0)= 0.9
D2(s1,1)= 0.1
D2(s2,0)= 0.1
D2(s2,1)= 0.9
D2(s3,0)= 0.1
D2(s3,1)= 0.9
D2(s4,0)= 0.9
D2(s4,1)= 0.1
23
D2(s5,0)= 0.1
D2(s5,1)= 0.9
D2(s6,0)= 1.9
D2(s6,1)= 0.1
D2(si,sj)=0.05, for j=i

D2(si,sj)=0.95, for j=(i mod 6)+1
D2(si,sj)=0.0, otherwise.
D3 random sequence containing the same number of symbols 1 as the

sequence D2.
The sequences were computer generated using the software tools we have built
for our experiments. The diagram at Figure 3.4. shows the flow of data between the
individual pieces.
Listings of the programs and sample data files can be found in Appendix A. The
detailed account of each utility follows.
*(16B+00
QDPHBRIBWDVN
is a general HMM automaton. It first reads the automaton parameters from text file
name_of_task.hmm. It starts in the initial state and produces a HMM sequence of
expected length. The sequence contains one symbol per line and it is saved into file
name_of_task.trs. This automaton was used for generating deterministic sequences
(by setting HMM transition function (si,sj)=1.0, if (si)=sj and (si,sj)=0.0,
otherwise).
21(6 QDPHBRIBILOH
counts the number of symbols 1 in the input sequence. After reaching the end of file,
the overall probability of 1s in the sequence is printed.
*(16B51' RXWSXWBILOH OHQJWK RQHVBSUREDELOLW\

produces a file with length symbols 0 or 1. The probability of symbols 1 in the
output sequence is specified by the argument ones_probability.
(1 SURMHFWBQDPH OHQJWK PD[1 >GHWDLO@

computes entropies of the sequence stored in the file project_name.trs. The length
argument specifies the maximum length of the input sequence and maxN specifies the
length of the largest window for which the entropy is computed.
24
%&061/,1 SURMHFWBQDPH
is the main BCM neuron simulation utility. It reads parameters of the model from
project_name.bcm file, initializes all neurons and feeds the neurons inputs with
symbols from project_name.trs. It periodically stores actual weights and thresholds
into the output log-file project_name.out. The neuron training is performed by a

function train_cell() and the actual cell activity is computed by the function
cycle_cell().
3/27 IRUPDWBILOH GDWDBILOH >GDWDBILOH@

is a tool for plotting the data produced by bcmsnlin at the computer screen in order to
have a fast way for tuning the parameters of the model. This program reads
information about how to plot the data from format_file[.plt]. Plotted data may
originate in several data files. This allows comparing the different output files and
gives to possibility to see trends influenced by changing parameters.
$9(5$*( SURMHFWBQDPH FROXPQV VNLS Q

is a simple utility for computing an average of given number of columns in the output
file project_name.out. Since the neurons weights need some time to settle down,
the program provides two arguments skip and n which specify how many initial
outputs should be ignored and how many should be taken into account, respectively.
The output file project_name.av contains the averages for columns 1..columns.
&203$5( SURMHFWBQDPH FROXPQ VNLS Q

uses the average from file project_name.av to produce a new symbolic sequence
with symbols 0 and 1 based on the differences of the data in specified column in
file project_name.out from its average. Symbols 0 and 1 have the meaning
below and above the average, respectively.
(175B63(&7580 SURMHFWBQDPHWUV V\PBILOH WHPSBILOH EORFNBOHQJWK

computes entropy spectrum for symbolic sequence project_name.trs with symbols
specified in sym_file and for temperatures named in temp_file for a single block
length defined by block_length argument. The resulting table is written into file
project_name_[window_length].es.
3DUDPHWHUV RI WKH VLPXODWLRQ
Before running the simulation, several parameters of the BCM neuron should be
specified. Our model (always with only one input) must have had the following
parameters:
25
shape of the sigmoid function (specified by its range [sig_min,sig_max])
c0 constant for scaling
input noise amplitude [-,]
speed of learning ()
number of iterations (N)
time averaging coefficient (specified by K=1/)
which numeric value does symbol 0 represent
which numeric value does symbol 1 represent
according to output logging: how often the actual values of weight and are
stored into the output file (output_step) and whether the average or actual
values should be used
26
+LGGHQ 0DUNRY
0RGHO DXWRPDWRQ
GHVFULSWLRQ
+00
(QWURSLHV
IRU GLIIHUHQW
ZLQGRZ OHQJWK
JHQHUDWLQJ +00
VHTXHQFHV
FRPSXWLQJ
(QWURS\
GHWHUPLQLQJ
RI V\PEROV

V\PEROLF 6(48(1&(
*HQHUDWLQJ
UDQGRP
VHTXHQFHV
(1
HQ
JHQVBKPP
RQHV
756
VLPXODWLRQ
SDUDPHWHUV
GHVFULSWLRQ
FRPSXWLQJ
%&0 1(8521
6,08/$7,21
GLVSOD\LQJ
GHYHORSPHQW RI
DQG ZHLJKW
UHFRUG /2* RI
GHYHORSPHQW RI
DQG ZHLJKW
6(77,1*6
)25 3$,17,1*
7+( *5$3+
&203$5,1*
$&78$/ 9$/8(6
LQ /2* ZLWK
DYHUDJH
EFP
%&061/,1
RXW
SORW
DYHUDJH
/,67 2) 7(03(5$785(6
)25 :+,&+ 7+( (17523<
6+28/' %( &20387('
$9(5$*(V 2)
$// &2O8016
$9
V\PEROLF 6(48(1&(
06 ([FHO
&20387,1*
(17523,&
63(&7580
(17523,(6
)25 ',))(5(17
7(03(5$785(6
756
/HJHQG
&20387,QJ
DYHUDJH

FRPSDUH
3/7
7(0
JHQVB51'
(175B63(
SXUSRVH RI ),/(
H[WHQVLRQ
(6
SXUSRVH RI
SURJUDP
SURJUDPB1$0(
Figure 3.4. Chart Diagram of the Simulation software.

27
3.3.4. Results
First, we will discuss the results we acquired from the plot program. We ran
experiments for all input sequences A1,A2,A3D1,D2,D3. The entropies of
sequences were different (see Figure 3.5.). We hoped that the weight of the neuron
would follow these differences.
(QWURSLHV RI VHTXHQFHV $ '(7 $ +00 $ 51' FRPSXWHG E\ (1 XWLOLW\
window
length N
1
2
3
4
5
6
7
8
9
10
11
12
$ HQWURS\
$ QRUP
$ HQWURS\
$ QRUP
$ HQWURS\
$ QRUP
+1
HQWURS\ KQ
+1
HQWURS\ KQ
+1
HQWURS\ KQ

Figure 3.5.a. Entropies for sequences in set A. H(N) is the entropy for window length
N and h(N) is the same normalized for one symbol. Deterministic sequence has low
entropy (h(N) 0), whereas random sequence has high entropy (h(N) ). HMM
automaton A2 is similar to random sequence, and therefore the difference in entropy
of A2 and A3 is low.
(QWURSLHV RI VHTXHQFHV % '(7 % +00 % 51' FRPSXWHG E\ (1 XWLOLW\
window
length N
% HQWURS\
% QRUP
% HQWURS\
% QRUP
% HQWURS\
% QRUP
+1
HQWURS\ KQ
+1
HQWURS\ KQ
+1
HQWURS\ KQ

Figure 3.5.b. Entropies for sequences in set B. In this case, the difference
between entropies of B2 and B3 is small although larger than for the A set.
28
(QWURSLHV RI VHTXHQFHV & '(7 & +00 & 51' FRPSXWHG E\ (1 XWLOLW\
window
length N
& HQWURS\
& QRUP
& HQWURS\
& QRUP
& HQWURS\
& QRUP
+1
HQWURS\ KQ
+1
HQWURS\ KQ
+1
HQWURS\ KQ

Figure 3.5.c. Entropies for sequences in set C. The difference between entropies
of C2 and C3 is larger than for the B set.
(QWURSLHV RI VHTXHQFHV ' '(7 ' +00 ' 51' FRPSXWHG E\ (1 XWLOLW\
window
length N
' HQWURS\
' QRUP
' HQWURS\
' QRUP
' HQWURS\
' QRUP
+1
HQWURS\ KQ
+1
HQWURS\ KQ
+1
HQWURS\ KQ

Figure 3.5.d. Entropies for sequences in set D. The difference between entropies
of D2 and D3 is large, because the HMM automaton for D2 contains deterministic
parts.
However, the experiments with the linear neuron (i.e. without the sigmoid
function, (3.5)) showed that the neuron is not very sensitive to differences in entropy
of the input sequences (see Figures 3.6. and 3.7.). Tuning the parameters did not make
the difference. Therefore, we switched to a nonlinear neuron with sigmoid transition
function (3.6), which improved the situation. In most cases, the neuron did recognize
the differences in the input sequences and the resulting weight (after the convergent
process) separated sequences with different entropies (see Figures 3.8 and 3.9.).
29
Figure 3.6. Development of the weight (y-axis) with time constant =10 for linear
neuron for symbolic time sequence with 200000 symbols (x-axis).We can see that the
weight doesnt clearly separate the sequences. We used the following parameters:
learning speed =0.001, c0=0.85, cell_sig_min=-2, cell_sig_max=2, length of
sequence: 200 000 symbols. The chart was obtained by shareware screen-grabbing
utility and applying a dilate filter on the resulting image in order to improve
visibility of curves.
30
Figure 3.7. Development of the weight with time constant =20 for linear
neuron. Here the weights for different sequences are even more undistinguishable.
Figure 3.8. Development of the weight with time constant =10 for nonlinear
neuron. All other parameters remain the same as for the linear case. The weight of
the input separates the sequences.
31
neuron.
An interesting effect appears for =2, when the weight difference is the highest
(see Figure 3.10). This should be compared with the highest correlation of entropy
spectrum of time-sequence for =2, with the entropy spectrum of the original
sequence, see below.
To test the model robustness, we have introduced a QRLVH parameter. A randomly
distributed additive noise (not Gaussian) on the input did have only a quantitative, not
the qualitative impact on the model. The weight separation was not as clear as in the
case without the noise, which can be explained by changing the properties of the input
sequence. HMM generated sequence is already quite random and adding noise results
in a sequence, which cannot be represented by the original HMM, automaton and
therefore has different properties. Because the noise didnt influence the quality of the
model, we removed it from the experiments.
Changing the PHDQLQJ RI V\PEROV DQG had the following effect: if we
replaced 0 with 0.25 and 1 with 0.75, the adaptation process retarded down, but
the final result of weight separation did not changed. This is an important property of
32
our model, because the weight changes are pure 0 for pure 0 in the input (see
(3.10) and (3.11)).
6KDSH RI WKH VLJPRLG IXQFWLRQ drives the degree of nonlinearity. For

sig_min - and sig_max we get a linear transition function. In our experiments
with linear neuron, we have used the values 400000 and 400000. For nonlinear
neuron, we used 2 and 2. The shape has major influence on the final point of weight
convergence.
Also the scaling FRQVWDQW F is used to drive the process of weight
development. It is usually smaller than 2 (we used 0.85). Increasing the constant
makes the parabola steeper and the input weight less stable.
Stability of the input weight is ensured by using very small OHDUQLQJ VSHHG . A
reasonable stability is achieved for value 0.001, which we used in the experiments.
neuron (similar effect appears also for linear neuron). All other parameters remain
the same as for the former case. The weight of the input separates the sequences.
33
Dynamic changes in the input are compensated by the dynamic threshold (see
Figures 3.11 and 3.12). Increased potentiation in the weight results in increasing the
average activity and in increasing of . Low probability of symbols 1 in input results
in decreasing the average neuron activity and also in decreasing the threshold. In this
way, the dynamic threshold adapts to the input. We should also note that although the
probability of 1 is the important factor for the resulting converged weight, it is not
the only one. All sequences in one set contained the same number of symbols 1, but
the resulting weights are different for nonlinear neuron. To find another properties of
the input time sequences, which might be significant, we have employed entropy
spectra.
Figure 3.11. The development of dynamic threshold for symbolic time

sequences in the set B computed with the same parameters and linear neuron. Each
point of the chart curve corresponds to averaged over last 500 iterations. It takes up
to 70000 iterations for and the input weight to stabilize. Therefore we used for
computing entropy spectra (see below) only the part of the sequence starting at
70000th symbol.
34
Figure 3.12. If the onset of is too rapid, it might begin to oscillate. This is
usual for linear neuron and very large time window constant . (here =1000 and the
other parameters are the same as in the former case).
The entropies of input sequences at Figure 3.5. demonstrate that the sequences
with more complicated internal structures have higher entropy. The transformed
entropies for high temperatures show the structure with respect to very likely words.
Entropies for low temperatures show the structure with respect to very unlikely words.
A special case for T is called topological entropy, determines how many different
words are in the sequence.
Deterministic sequences have constant and low entropies for all temperatures,
because they consist of a periodical subsequence repeated many times. On the other
hand, random sequences contain all possible words for a given window length,
although not periodically but randomly distributed. Therefore the entropy is high. We
computed the entropy spectra up to the length of window = 12, due to the
computational complexity and huge amount of data. It can be also observed for
completely random sequences (Bernoulli source) that the entropy decreases for high
temperatures and this is caused only by the bounded length of the sequence (the
number of words for large window is high) we worked with sequences of 200000
symbols.
35
Second part of the results discusses the entropy spectra of sequences created by
comparing the actual value of the threshold with its long-term average. They are
included in the Appendix B.
First type of figures consists of entropy spectra computed for fixed length of the
window. Different figures show these entropy spectra for different values of
parameter of the BCM model. Our goal was to find some kind of correlation between
the structure of the sequence, length of the window used for computing the entropy,
and the parameter.
Second type of figures combines all figures of the first type for a particular
sequence into one chart, where the sums of differences of the entropies of individual
sequences of changes (added for all measured temperatures, i.e. the values around 0
are scanned more densily) and the entropy of the original sequence (also added for all
measured temperatures) are plotted against different window length used for
computing the entropy. Each chart contains data for several different parameters,
which allows, for example, comparing, which sequence is the best model of the
original sequence.
All figures reveal that the sequence of changes is most similar to the original
sequence when the parameter is equal 2 (except the case =1, in which case the
sequence of changes is almost identical to the original sequence). This supports the
previous findings that the weight of the input connection separates the sequences best
in the case =2.
When exploring the figures, the original sequence should be observed first. If the
curve of the entropy spectrum turns down quickly, it means that the sequence contains
few words, which are much more probable than the others. If it declines slowly (as in
the case of C3), it means that most words present in the sequence are equally frequent.
The negative temperatures exhibit the situation with less frequent words. Again, if the
decline is rapid, the sequence contains few words, which have precious occurrence.
Sequences with moderate slope in the negative part of the spectrum curve contain
words with meager frequency.
We have included all the entropy spectra for the sequences C2 and C3, because
of the special nature of the C2 HMM automaton. When the state 2 is achieved, the
36
automaton always generates sequence 110. Therefore the frequency of words

containing the sequence 110 is very high.
Last, we have to mention also the experiments with the model with one recurrent
connection. BCM has been successfully applied in several complicated architectures,
some containing also recurrent connection. Another motivation lies in the fact that one
of the usual tools for modeling symbolic time sequences are recurrent neural
networks. However, the difficulty lies in building the learning rule for the recurrent
connection, which should be derived from the cost functions. Cost function requires a
concrete goal, which has to be reached by the network. We have applied the learning
rule for the input weight also for weight of the recurrent connection. The difference in
results was not significant to mention and the challenge to develop a recurrent
architecture remains for the future work.
37
5HVXPH
The first part of the thesis contains an overview of unsupervised artificial neural
network architectures. Their principles, advantages and disadvantages are named.
Second part focuses on the specific example of the Hebbian learning, the BCM
learning rule. In addition to the summary of the BCM bibliography, we have used the
present variant of this rule for experiments with time sequences for both linear and
nonlinear transition functions. We have shown that the cell activity after some initial
period depends on the internal structure of the input sequence. All sequences, i.e.
deterministic, HMM and random in one set contained the same number of symols 1,
but they differed in the complexity of the internal structure. Our measure of this
complexity was the value of the entropy per one symbol. We have found that for two
sequences with different entropies and the same number of symbols 1, the resulting
weight of the BCM neuron differes, i.e. it is dependent on the internal structure of the
input sequence. We have supported this statement by exploring the properties of both
input symbolic sequences and symbolic sequences produced by comparing the
dynamic threshold to its long-term average. We computed the entropies for input
sequences and entropy spectra for the sequences. Here, we have found that the both
nonlinear and linear BCM neuron with the shortest internal memory, i.e. =2, is
able to model the input sequence most closely. We discussed the role of the
parameters of the model and demonstrated that and how the nonlinearity can improve
the characteristics of the development of converged input weight. Thanks to the
entropy spectra, a better view of the internal structure of sequences can be obtained.
We have demonstrated an example of application of this technique for studying
symbolic time sequences.
38
5HIHUHQFHV
[%DFKPDQ]
Bachman, C., M., Musman, S., A., Luong, D., and Shultz, A.,
Unsupervised BCM projection pursuit algorithms for classification of simulated radar

presentations. Neural Networks, Vol. 7, No. 4, p.709-728, 1994.
[%HDU]
Bear, M., F., Cooper, L., N., and Ebner, F., F., A physiological
basis for a theory of synapse modification. Science Wash. DC 237: p. 42-48, 1987.
[%HXNRYi]
%H XNRYi
'LDPRQG 0( (EQHU )) '\QDPLF V\QDSWLF
modification threshold: computational model of experience-dependent palasticity in

adult rat barrel cortex, Proc. Natl. Acad. Sci. USA, Vol. 91, p. 4791-4795, 1994.
[%HXNRYi]
%H XNRYi
0RGHOOLQJ SODVWLFLW\ LQ UDG EDUUHO FRUWH[
induced by one spared whisker, in: Artificial Neural Networks: 7th international
conference; proceedings (Lecture Notes in Computer Science vol 1327), 1997.
[%LHQHQVWRFN]
Bienenstock E., L., Cooper L., N., and Munro P., W., Theory
for the development of neuron selectivity: orientation specificity and binocular

interaction in visual cortex, J. Neuroscience, vol.2, p.32-48, 1982.
[&ORWKLDX[]
Clothiaux E., E., Bear M., F., Cooper L., N., Synaptic plasticity
in visual cortex: comparison of theory with experiment, J.Neurophysiology, Vol 66.,

No.5, 1991.
[&RRSHU]
Cooper, L., N., Liberman, F., and Oja, E., A theory for the
acquisition and loss of neuron specificity in visual cortex, Biol. Cybern., Vol. 33, p.928, 1979.
[&RRSHU]
Cooper, L., N., and Scofield, C., L., Mean-field theory of a
neural network. Proc. Natl. Acad. Sci. USA, Vol.85, p.1973-77, 1988.
[&RRSHU]
Shouval, H., Intrator N., Cooper L., N., BCM network develops
orientation selectivity and ocular dominance in natural scene environment, to appear

in Vision Research, 1997.
[&RRSHU]
Castellani, G., C., Intrator N., Shouval H., Cooper L.N.,
Characterizing solutions of a BCM learning rule in a network of lateral interacting

non-linear neurons, tech.report, 1998.
[)\IH]
Fyfe, C., A general exploratory projection pursuit network,
Neural Processing Letters, Vol.2(3), p. 17-19, 1995.

[)\IH]
Fyfe, C., A comparative study of two neural methods of
exploratory projection pursuit, Neural Networks, Vol 10, No.2, p.255-262, 1997.
39
[+HEE]
Hebb, D., The organization of behavior. J. Wiley and Sons,
New York, 1949.

[+HUW]]
Hertz, J., Krogh A., and Palmer, R., Introduction to the theory
of neural computation. Addison-Wesley, Redwood City, CA, 1991.

[+XEHO]
Hubel, D., H., and Wiesel, T.N., Integrative action in the cats
lateral geniculate body. Journal of Physiology, Vol.148, p.574-491, 1959.

[,QWUDWRU]
Intrator, N., Cooper, L.N., Objective function formulation of
the BCM theory of visual cortical platicity: statistical connections, stability

conditions, Neural Networks, Vol 5. p.3-17, 1992.
[.DWRN]
Katok, A. Hasselblatt B., Introduction to the Modern Theory of
Dynamical Systems, Cambridge University Press, 1995.

[.RKRQHQ]
Kohonen, T., Physiological interpretation of the self-organizing
map algorithm, Neural Networks, Vol 6., p.895-905, 1993.

[.RKRQHQ]
Kohonen, T., Self-organizing Maps, Springer, 1995.
[/LQVNHU]
Linsker, R., From basic network principles to neural
architecture, Proceedings of the National Academy of Sciences of the USA, Vol. 83, p.
7508-7512, 1986.
[1DVV]
Nass, M., N., and Cooper, L.N., A theory for the development
of feature detecting cells in visual cortex, Biol. Cyb., Vol 19., p. 1-18, 1975.
[2MD]
Oja, E., Neural networks, principal components, and subspaces,
International Journal of Neural Systems, Vol. 1, p. 61-68, 1989.

[5DELQHU]
Rabiner, R., L., Juang, B.H., An introduction to hidden markov
models, IEEE ASSP Magazine, Vol 3., p. 4-16,1986.

[6DQJHU]
Sanger, T., D., Optimal unsupervised learning in a single-layer
linear feedforward neural network, Neural Networks, Vol. 2, p. 459-473, 1989.

[6HOPDQ]
Selman, B., Brooks, R., A., Dean, T., Horvitz, E., Mitchell, T.,
M., Nillson, N., J., Challenges for artificial intelligence, Proceedings of AAAI-96,
1996.
[7LR]
TiR 3 .|WHOHV 0 0RGHOLQJ FRPSOH[ V\PEROLF sequences
with neural and hybrid neural based systems, submitted to IEEE Transactions of
Neural Networks, 1996.
40
$SSHQGL[ $ VRXUFH FRGH DQG H[DPSOHV RI GDWD ILOHV
All programs were written in ANSI C. We used a DOS version of free GNU C
compiler (GCC) from DJ Delorie and a makefile to build all the programs.
First part of the appendix B contains source code of all utilities we have built for the
experiments. The second part contains examples of data files.
3URJUDP VRXUFH FRGH

sources of all programs are included in the file src.zip.
$SSHQGL[ % HQWURSLF VSHFWUD
41

Learning Rules and Its Computational Properties

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning Rules and Its Computational Properties

Uploaded by

Copyright:

Available Formats

) ACULTY

OF 0ATHEMATICS AND 3HYSICS

8NSUPERVISED 1EURAL 1ETWORKS /EARNING 5ULES %&0

AND ,TS &OMPUTATIONAL 3ROPERTIES

XELFD %HXNRYi &6F

HVWQH SUHKODVXMHP H VRP GLSORPRY~ SUiFX Y\SUDFRYDO ViP LED V SRXLWtP

citovanch zdrojov a literatry.

provide either some statistical information, perform a useful transformation or

 8QVXSHUYLVHG /HDUQLQJ 5XOHV

We can understand a neural network (even consisting only of a single neuron) as

from the space of input vectors

space of possible output vectors ) [ 2, but keeping in mind that

may change while inputs are processed. In unsupervised learning,

- how familiar the input pattern is to the typical or average patterns

seen in the past. The network gradually learns what is typical.

3ULQFLSDO &RPSRQHQW $QDO\VLV

- the similarity to the previous examples is

measured along the set of axes.

- input vectors are assigned to belong to a certain category and the

network outputs a prototype of the category that input belongs to.

- output could be an encoded version of the input, in fewer bits, keeping

as much relevant information as possible. This could be used for data

- output units may be seen as in a fixed geometrical

arrangement with only one on at a time mapping input patterns to different

7KH IRUPXODV LQ WKLV FKDSWHU DUH EDVHG RQ >+HUW]@

We consider input vectors

with components xij for j=1..N and with the

where Z is the weight vector.

Figure 2.1. Plain Hebbian learning architecture.

Since the output ) [L has to be a measure of similarity, a plain Hebbian learning

where & is a symmetric correlation matrix defined as:

has some positive eigenvalues and fluctuation having a component along an

approach this direction with increasingly huge norm.

We can restrict the growth of weight vector

in the plain Hebbian learning for

example by normalization after each change wj' = wj with such that |Z

It can be proven, that

approaches length of one and tends to an eigenvector

with the largest eigenvalue.

Here Z converges to the eigenvector with the largest eigenvalue with

 3ULQFLSDO FRPSRQHQW DQDO\VLV 3&$

Linsker reports [Linsker86] that performing principal component analysis is

& defined above

and it is always possible to center data to achieve the zero mean.

maximize output variance with the constraint

Hierarchical knowledge cannot be represented, only one level of categorization is

 %&0 /HDUQLQJ 5XOH

The area of artificial neural networks was originally motivated by biological

The dynamic modification threshold m(t) is a nonlinear function of the time-averaged

$OO HTDWLRQV LQ WKLV VHFWLRQ DUH EDVHG RQ >,QWUDWRU@

[BeXNRYi] for modeling the neuron after whisker sparing. The

researchers, primarily by Nathan Intrator. Objective function for the originally

Figure 3.2. General scheme of circuit for experiments.

is to compensate for changes in input. Thus it might be interesting to investigate the

for linear neuron and

with the derivation:

Weight is updated by the rule:

where the weight modification is

for linear neuron and

E\ 3HWHU 7L R 6HH IRU H[DPSOH >7L R@ RU >.DWRN@

for nonlinear neuron, where

Entropy can be understood as a measure of uncertainty. The uncertainty is high

The desired entropy of the symbolic sequence is then:

To gain a better picture of the internal structure of the sequence, we employed

8NSUPERVISED 1EURAL 1ETWORKS /EARNING 5ULES %&0

XELFD %HXNRYi &6F

HVWQH SUHKODVXMHP H VRP GLSORPRY~ SUiFX Y\SUDFRYDO ViP LED V SRXLWtP

8QVXSHUYLVHG /HDUQLQJ 5XOHV

space of possible output vectors )[ 2, but keeping in mind that

7KH IRUPXODV LQ WKLV FKDSWHU DUH EDVHG RQ >+HUW]@

Since the output )[L has to be a measure of similarity, a plain Hebbian learning

3ULQFLSDO FRPSRQHQW DQDO\VLV 3&$

%&0 /HDUQLQJ 5XOH

$OO HTDWLRQV LQ WKLV VHFWLRQ DUH EDVHG RQ >,QWUDWRU@

[BeXNRYi] for modeling the neuron after whisker sparing. The

E\ 3HWHU 7L R 6HH IRU H[DPSOH >7L R@ RU >.DWRN@

6HTXHQFHV JHQHUDWHG E\ WKH +LGGHQ 0DUNRY 0RGHO DXWRPDWRQ

(1 SURMHFWBQDPH OHQJWK PD[1 >GHWDLO@

(175B63(&7580 SURMHFWBQDPHWUV V\PBILOH WHPSBILOH EORFNBOHQJWK

'LDPRQG 0( (EQHU )) '\QDPLF V\QDSWLF

0RGHOOLQJ SODVWLFLW\ LQ UDG EDUUHO FRUWH[