Professional Documents
Culture Documents
OF ,NFORMATICS
Diploma Thesis
$UTHOR
$DVISOR
3DYHO 3HWURYL
51'R
podpis
$FNQRZOHGJHPHQWV
I would like to give my appreciation to my advisor RNDr.
XELFD
%H XNRYi
CSc. She provided me with all necessary information and references on BCM and a
lot of help. She created a very friendly and cooperative atmosphere, and invited me to
take part in scientific seminars held at the Department of Computer Science and
Engineering of Slovak University of Technology.
0DQ\ WKDQNV EHORQJV DOVR WR KHU FROOHDJXH ,QJ 3HWHU 7L R &6F ZKR ZDV D
generous source of many ideas, had a strong influence on the direction of our work in
most crucial momements and who provided me with Information Theory references
and software utilities.
I am also grateful to my tutor PhDr. Jn efrnek, DrSc., who organized the
diploma thesis seminar, where we regulary discussed the progress in our work on
thesis.
7DEOH RI FRQWHQWV
A&.12:/('*(0(176 .............................................................................................1
T$%/( 2) &217(176................................................................................................2
1. I1752'8&7,21 ....................................................................................................3
2. U1683(59,6(' L($51,1* R8/(6 ........................................................................5
2.1. Unsupervised Hebbian Learning ...................................................................6
2.2. Oja's Rule .....................................................................................................7
2.3. Principal component analysis (PCA).............................................................8
2.4. One-Layer Feed-Forward Networks..............................................................8
2.5. Self-Organizing Feature Extraction with Hebbian learning...........................9
2.6. Unsupervised Competitive Learning............................................................10
3. BCM L($51,1* R8/( .......................................................................................12
3.1. Introduction ................................................................................................12
3.2. Basic concepts of the BCM Theory..............................................................12
3.3. Experiments with the BCM Neuron and Time Sequences.............................15
3.3.1. Introduction ..........................................................................................15
3.3.2. Theoretical background.........................................................................17
Figure 3.3.b: An example of the HMM automaton (automaton definition). .....21
3.3.3. Implementation.....................................................................................21
3.3.4. Results ..................................................................................................28
R(680( ................................................................................................................38
R()(5(1&(6 ..........................................................................................................39
A33(1',; A, 6285&( &2'( $1' (;$03/(6 2) '$7$ ),/(6. ...................................41
Program source code.........................................................................................41
A33(1',; B, (17523,& 63(&75$ ...........................................................................41
,QWURGXFWLRQ
One of the first (and often cited) works regarding neural networks, "The
Organization of Behavior" by Donald Hebb [Hebb49] introduces a basic principle for
updating synaptic strengths: "When an axon of cell A ... excite(s) cell B and
repeatedly or persistently takes part in firing it, some growth process or metabolic
change takes place in one or both cells so that A's efficiency as one of the cells firing
B is increased". It is not yet completely revealed which underlying processes are
responsible for described changes in the real biological neural networks.
Although detailed mathematical models of neurons that used differential
equations have been created, they were not acceptable in terms of computational
complexity. In addition, they contained too many modifiable parameters. One of the
primary challenges of artificial intelligence of nowadays is to develop an efficient
model of neuron as clearly stated by Rodney A. Brooks in [Selman96].
Many simplifying models have been developed and they provided a large area for
formal research. This led to an establishment of a new independent area, artificial
neural networks, which, although originally biologically motivated, went far from the
real biological neural networks. Their classification and pattern recognition features
approach statistics, theory of learning and information.
The central topic in artificial neural networks is the learning process, which is
driven primary by the learning rule. The basic learning rules that have been suggested
have been exhaustively analyzed in detail. Most of their limitations have been
uncovered and it seems that the simplifications inherently contained in these models
imply in most cases too strong constraints. That's why is it reasonable to explore
learning rules based on different principles and this is the motivation of the second
part of this thesis. The hope is that this laborious and iterative process will converge
into more biologically plausible and efficient rule with either general or highly
specific purpose.
Two main approaches of the learning rules and network architectures can be
identified: unsupervised and supervised learning. Supervised learning requires a large
set of training data and it is usually not very adaptable to the changes of the
circumstances. Once the training stage is finished, the network learning ceases. On the
other hand, outputs of networks with unsupervised learning are of different kind. They
) ,
2 of input vector
[ ,
L
to the
is dynamic and
is updated
without external intervention. Changes are implied only by the inputs. The network
must find the regularities, categories, correlation or patterns in the input and organize
itself to provide useful codes on the output side. Unsupervised learning is possible
only thanks to the redundancy in the input data. Otherwise, it would not be possible to
differentiate between the input with information and random noise. According to
[Hertz91], output of the unsupervised neural network may represent different features
of the input:
)DPLOLDULW\
&OXVWHULQJ
output of the network is a binary vector with only one bit identifying the
category set. Each cluster of similar or nearby patterns would then be classified
as a single output class.
3URWRW\SLQJ
- the input data are classified to categories but for each input the
(QFRGLQJ
)HDWXUH 0DSSLQJ
similar input patterns are projected to nearby output units. A global organization
of the output units evolves.
These cases are not distinct and might be combined. Encoding can be performed
using principal component analysis or by clustering called also vector quantization.
Principal component analysis can be used also for dimensionality reduction before
performing clustering or feature mapping avoiding "curse of dimensionality". Another
usage of unsupervised learning is to replace supervised learning, when possible either
because of the computationally complexity of supervised networks or to provide a
mechanism for network to adapt to changes.
Unsupervised neural networks tend:
to be of a simple architecture (complexities lie in the learning rules)
not to have many layers (usually only one)
to contain many fewer output units than inputs (with exception of the Feature
Mapping)
to be more biologically plausible than other kinds of networks.
We can separate unsupervised networks into 2 main categories: based on the
modified Hebb rule and based on competitive learning. We will describe them in 2
different subsections.
8QVXSHUYLVHG +HEELDQ /HDUQLQJ
probability distribution P([ ). The single linear output unit (see Figure 2.1) is the
simplest example and the formula giving the output is:
1
) [ = Z M [ LM = Z 7 [ L = [ Z
L
M =
)[
w
w1
w
xL
xL
xL1
(2.1)
(2.2)
where is the learning rate. Frequent input patterns strengthen the weights most
and thus they produce the largest output. The problem with this rule is that the
weights keep growing forever. This can be fixed. We suppose that there exist an
equilibrium point for Z. At equilibrium, we want the changes to
to be 0 on average
and thus:
0 = wj = )[, xIj = kwkxIkxIj = k&MNwk
Z = Z&.
(2.3)
xLMxLN
(2.4)
[L[L T.
(2.5)
or
&MN
If there were the equilibrium Z, it would be the eigenvector of & with eigenvalue
0, but
&
eigenvector with positive eigenvalue would grow exponentially. The direction of the
eigenvector with the largest eigenvalue PD[ of & would become dominant and
will
(2.6)
(2.7)
Z = PD[
instead of |Z|=1. Main advantage of the Oja's rule is that there is an associated cost
function
(
& MN Z M Z N
MN
+ Z = Z 7 &Z
M
M
(2.8)
Oja's rule can be used to find the first principal component in zero-mean data. To
find M principal components we can use a one-layer feed forward network designed
by Sanger in 1989 [Sanger89] or Oja in 1989 [Oja89]. The output )([ ) is given by the
L
formula:
1
) [ =
L
N =
L
MN [ M
= Z7 [ = [
L
(2.9)
where Z is the vector for the jWK output. Both Sanger's and Oja's learning rule is:
M
Z MN = )
[ [ NL ) [ Z ON
L
O =
(2.10)
but the upper limit of the sum is j for Sanger and M for Oja. The main difference is
that Sanger's rule finds exactly the first M principal components, whereas Oja's rule
8
finds M vectors that span the same subspace as the first M eigenvectors, but they are
not the same and depend on the initial conditions. These rules are not local, since
updating the weight wij requires information from other nodes other than input k and
output j. There exists a modification of the Sanger rule which is local:
Z
MN
= ) [
M
[ ) [ Z
L
=
MN
) [ Z
L
MN
(2.11)
Other network architectures are used for principal component analysis too. Selfsupervised back-propagation network with N inputs, N outputs and one hidden layer
of M < N units trained so, that the outputs are as close as possible to the inputs in the
training set produces the same result as the Oja's M-unit rule.
Another architecture designed by Rubner and Tavan in 1989 contains one-layer
network with trainable lateral connections between the M output units (lateral
connections exist only "from the left to the right" or between all units in similar
Fldik's approach). The ordinary weights are trained with the plain Hebbian rule
with renormalization to the unit length and the lateral weights are trained with antiHebbian learning:
u =- ) [ )
L
MN
(2.12)
This architecture extracts M principal components as the Sanger's rule, and the lateral
connection converges to zero.
6HOI2UJDQL]LQJ )HDWXUH ([WUDFWLRQ ZLWK +HEELDQ OHDUQLQJ
In the feature extraction the goal is to have many output units, each one most
sensitive to a particular input and different output units choose different input
patterns. The number of the output units may be larger than the number of input units.
This can be measured by selectivity of a particular output Oi (as defined by
Bienenstock in 1982):
Selectivity = 1 - Oi / max Oi
M
(2.13)
where the average Oi and the maximum are both taken over all the possible inputs.
The selectivity is near 1, if the output unit favors single input (or a narrow range of
inputs) and it is near zero, if it responds equally to all inputs.
The aim is to define an architecture and learning rule in which the outputs
converge to high selectivity and to have different output units to become sensitive to
different input patterns with some output unit matched to every input pattern. Another
goal is that the similar input patterns should activate nearby output units arranged in
geometrical structure.
The area of feature mapping is very challenging also because of the experimental
biological evidence in the animal visual cortex.
An interesting model consisting of three layers A, B and C where units are
connected with the feed-forward connections with the neighboring units in the
previous layer (called receptive field) has been designed by Linsker in 1986
[Linsker86]. Output O of the particular unit receiving input from K units, Ij, j=1..K is:
O = a + wkVk
(2.14)
where a is an optional threshold and the sum runs from k=1 to K. Since the units are
linear, the result can be replaced by a network with just one layer, but the layers are
important for the learning rule with tunable parameters b, c and d:
w =() [ O + b ) [ + cO + d)
L
(2.15)
Then the weights are clipped to the range w- wi w+ In addition, this rule tries to
wj = . ( is a constant - combination
of constants a-d).
8QVXSHUYLVHG &RPSHWLWLYH /HDUQLQJ
The main principle in the competitive learning is that only one output unit or only
one per group is activated at a time. The output units compete for activation and this is
where the name for this type of learning comes from: winner-takes-all.
The aim of these architectures is to cluster or categorize the input data. Similar
input patterns are classified into the same category. The network has to find the
classes based on the correlation of the input data. The possible uses include any
categorization in AI, data encoding and compression, function approximation, image
processing, statistical analysis and combinatorial optimization. There are several
disadvantages for winner-takes-all architectures:
The output code is not very effective, since one output cell represents one
category and N units can represent only N categories instead of possible 2N.
These architectures are not robust to degradation or failure, if one output unit fails,
then we lose the whole category.
10
11
,QWURGXFWLRQ
Original motivation came from the real experiments with the visual cortex of
animals. In 1975, Nass and Cooper [Nass75] explored a theory in which the
modification of visual cortical synapses was purely Hebbian.
Then a significant extension of theory was presented by Cooper, Liberman and
Oja in 1979 [Cooper79]. It was presented as a theoretical solution to the problem of
visual cortex plasticity. The main idea was that the sign of weight modification should
be based on whether the postsynaptic response is above or below a threshold.
Responses above the threshold should lead to strengthening of the active synapses,
responses below the threshold lead to weakening of the active synapses.
12
Neurons in the primary visual cortex of a normal adult cat are sharply tuned to
the orientation of an elongated slit of light and most are activated by stimulation of
either eye as stated by Hubel and Wiesel in 1959 [Hubel59]. Both of these properties orientation selectivity and binocularity depend on the type of visual environment
experienced during a critical period of early postnatal development [Intrator92].
Several striking effects appear after not normal postnatal development. The theoretical
solution for the plasticity of visual cortex was presented by Cooper, Liberman and Oja
in [Cooper79].
The original theory used a modification threshold m that was static. Bienenstock,
Cooper and Munro suggested [Bienenstock82] that this value varies as a nonlinear
function of the average output of the postsynaptic neuron, which is the main concept
of the present BCM model. This provides stability properties and explains several
important effects. The form of synaptic modification is:
mj = (c,m(t)) dj
(3.1)
where mj is the efficacy of the jth Lateral Geniculate Nucleus (LGN) synapse onto a
cortical neuron (i.e. the input weight), dj is the level of presynaptic activity of the jth
LGN afferent (i.e. the input), c is the level of activation or the postsynaptic activity of
the postsynaptic neuron (i.e. the output), which is given (in the linear region), by md,
and m is a nonlinear function of cell activity averaged over some time that in the
original BCM formulation was proposed as
0 = F
(3.2)
P (W ) =
F
(3.3)
where c0 is a positive scaling constant (originaly, c(t) 2 was used instead of c2(t)).
The averaged cell activity over some recent past c2(t) is determined for example by:
F W
W W
W H
13
GW
(3.4)
where is the averaging period. From these relations it follows that when the
postsynaptic activity c(t) is greater than zero but less than the modification threshold
m all active synapses (i.e. di(t) > 0) weaken. On the other hand, when postsynaptic
activity c(t) is greater than m all active synapses potentiate. Since c(t)=mi(t)di(t), the
correlation of excitatory inputs plays a crucial role in driving the postsynaptic cell
activity above the modification threshold m, The key property of m is that it is not
fixed, its current value is proportional to the postsynaptic response averaged over
some recent past time.
The shape of the function for different m is drawn in the Figure 3.1.
! SRWHQWLDWLRQ
F
SRVWV\QDSWLF DFWLYLW\ F
GHSUHVVLRQ
Figure 3.1. The function for two different values of threshold m. Usually, takes the
form of parabola, i.e. = c.(c- m).
The BCM model has a great biological relevance, details can be found for example in
[Bear87]. The group of researchers working with the BCM theory is currently making
a lot of effort in this area. The model has been applied also to a rat barrel cortex by
BeXNRYi
HW DO
proposed model has been further extended with inhibitory cells in [BeXNRYi97].
The important aspect of this work is that the theory is usually compared with the
experiment as in [Clothiaux91] where such process in the visual network as
monocular deprivation (MD), normal rearing (NR), reverse suture (RS), strabismus
(ST), binocular deprivation (BD) and recovery from deprivation (RE) are explained
using the BCM model. A neural network of visual cortex has also been modeled and
its stability and fixed points of synapses were analyzed in [Cooper88]. The BCM
Theory was mathematically analyzed from various points of view by several
14
3.3.1. Introduction
The BCM learning rule was applied in several different situations. It was shown,
for example, that a modified version can perform an efficient computation of
projection pursuit, determine bimodality, statistical skewness and curtosis of the
distribution of input data.
The original motivation for our experiments was the internal state of the BCM
neuron - its threshold . We believed it should have a significant impact on the ability
of the BCM neuron (or some kind of network based on BCM neurons) to recognize
some properties of the symbolic time sequences.
The goal of all experiments which we performed with the BCM learning rule was
to study the behavior of a single neuron exposed to symbolic time sequences built
from 2 symbols, namely 0 and 1. Symbols from the input sequence were feeding the
weighted neurons input. The weight w and the threshold were updated after each
15
input. Figure 3.2 shows a scheme of this circuit. The development of both weight and
threshold was observed.
ZW
%&0
W FW
6\PEROLF
6HTXHQFH
of symbols 1 the resulting weight differs, i.e. it is dependent on the entropy of the
input sequence (or at least has a strong relation with it).
We were interested also in the nature of the dynamics of the threshold . The
weight becomes stabilized after certain number of iterations. The role of the threshold
(3.5)
c(t) = (w(t).d(t))
(3.6)
(x)=2 / (1 - e-2x) 1
(3.7)
`(x)= 4e-2x/(1+e-2x)2
(3.8)
w(t+1)= w(t) + w
(3.9)
(3.10)
w = . ( c(t), ) . `(x) . d
(3.11)
7KH IRUPXODV UHJDUGLQJ %&0 DUH IURP >,QWUDWRU@ DQG WKH LQIRUPDWLQ WKHRU\ HTXWLRQV DUH SURYLGHG
17
( x, ) = c.(c - )
(3.12)
In cases with additive noise, we used d(t)+noise(t) instead of d(t), where noise(t)
is a Bernoulli source, uniformly distributed in the interval [-,].
The threshold is defined by (3.4), but the continuous integral was approximated
by the discrete sum, particularly by accumulating the weighted average activity over
recent iterations (see the source code). In addition, a scaling parameter c0 allows
scaling of this average:
F W
F
(3.13)
For determining the entropy of symbolic sequences, we have used the calculation
of entropies for increasing length of window.
If the probability of some event is Pi, its information content is given by the
formula:
= ORJ
(3.14)
Based on this, Shannon introduced the entropy of the data as the sum of products
of probabilities and information contents of all possible events:
= 3,
L
(3.15)
For a window of length n there exist 2n different words (having only two symbols
0 and 1). Let the probability of word wi of length n is Pn(wi). Entropy of the
sequence for window of length n is thus:
= 3
Z
=Q
(Z )ORJ 3 Z
Q
(3.16)
18
KQ = + Q
Q
(3.17)
= OLP Kn
(3.18)
3Q Z =
3Q Z
3Q Z
ZD
(3.19)
This transformed probability enables to reveal words which are more likely to
take over the less frequent words when the temperature is positive and larger than 1
(i.e. >1). Opposite effect happens for negative temperatures (<0), when the least
frequent words are dominant. Therefore entropies computed for different temperatures
uncover more information about the histogram of the sequences. (A special case when
T , i.e. 0 is called a topological case. In this case, all words, which are present
in the input sequence, are represented in the histogram by 1, and all words, which are
not present in the input sequence, by 0).
6HTXHQFHV JHQHUDWHG E\ WKH GHWHUPLQLVWLF VWDWH DXWRPDWRQ
19
Prob DE = 1
(3.20)
E6\PEROV
Trans DE = 1
(3.21)
E6WDWHV
VWDWH
VWDWH
VWDWH
20
Prob(state 1,1)=0.9
Prob(state 2,0)=0.9
Prob(state 2,1)=0.1
Prob(state 1,0)=0.3
Prob(state 1,1)=0.7
A1(s1)=1
A1(s1)=s2
A1(s2)=0
A1(s2)=s3
A1(s3)=0
A1(s3)=s1
A2 HMM:
A2=({0,1},{s1,s2},s1, A2,A2)
A2(s1,0)= 0.433
A2(s1,s1)=0.75
A2(s1,1)= 0.567
A2(s1,s2)=0.25
A2(s2,0)= 0.9
A2(s2,s1)=0.25
A2(s2,1)= 0.1
A2(s2,s2)=0.75
21
B1(s1)=1
B1(s7)=0
B1(s2)=1
B1(s8)=0
B1(s3)=1
B1(s9)=0
B1(s4)=1
B1(s10)=0
B1(s5)=1
B1(s6)=0
B1(si)= s(i
mod 10)+1
B2 HMM:
B2=({0,1},{s1,s2,s3,s4,s5,s6,s7,s8,s9,s10},s1,B2,B2)
B2(s1,0)= 0.05
B2(s1,1)= 0.95
B2(s2,0)= 0.15
B2(s2,1)= 0.85
B2(s3,0)= 0.25
B2(s3,1)= 0.75
B2(s4,0)= 0.35
B2(s4,1)= 0.65
B2(s5,0)= 0.45
B2(s5,1)= 0.55
B2(s6,0)= 0.55
B2(s6,1)= 0.45
B2(s7,0)= 0.65
B2(s7,1)= 0.35
B2(s8,0)= 0.75
B2(s8,1)= 0.25
B2(s9,0)= 0.85
B2(s9,1)= 0.15
B2(s10,0)= 0.95
B2(s10,1)= 0.05
B2(si,sj)=25.0,
for j=i
B2(si,sj)=75.0,
B2(si,sj)=0.0,
otherwise.
22
C1(s1)=0
C1(s2)=1
C1(s3)=1
C1(s4)=0
C2 HMM:
C2=({0,1},{s1,s2,s3,s4},s1, B2,B2)
C2(s1,0)= 0.6502
C2(s1,1)= 0.3498
C2(s2,0)= 0.0
C2(s2,1)= 1.0
C2(s3,0)= 0.0
C2(s3,1)= 1.0
C2(s4,0)= 1.0
C2(s4,1)= 0.0
C2(s1,s1)=0.7
C2(s1,s2)=0.3
C2(s1,s3)=C2(s1,s4)=0.0
C2(si,sj)=1.0,
C2(si,sj)=0.0,
otherwise.
D1(s1)=0
D1(s5)=0
D1(s2)=1
D1(s6)=0
D1(s3)=1
D1(s4)=0
D2 HMM:
D2=({0,1},{s1,s2,s3,s4,s5,s6},s1,D2,D2)
D2(s1,0)= 0.9
D2(s1,1)= 0.1
D2(s2,0)= 0.1
D2(s2,1)= 0.9
D2(s3,0)= 0.1
D2(s3,1)= 0.9
D2(s4,0)= 0.9
D2(s4,1)= 0.1
23
D2(s5,0)= 0.1
D2(s5,1)= 0.9
D2(s6,0)= 1.9
D2(s6,1)= 0.1
for our experiments. The diagram at Figure 3.4. shows the flow of data between the
individual pieces.
Listings of the programs and sample data files can be found in Appendix A. The
detailed account of each utility follows.
*(16B+00
QDPHBRIBWDVN
is a general HMM automaton. It first reads the automaton parameters from text file
name_of_task.hmm. It starts in the initial state and produces a HMM sequence of
expected length. The sequence contains one symbol per line and it is saved into file
name_of_task.trs. This automaton was used for generating deterministic sequences
(by setting HMM transition function (si,sj)=1.0, if (si)=sj and (si,sj)=0.0,
otherwise).
21(6 QDPHBRIBILOH
counts the number of symbols 1 in the input sequence. After reaching the end of file,
the overall probability of 1s in the sequence is printed.
24
%&061/,1 SURMHFWBQDPH
is the main BCM neuron simulation utility. It reads parameters of the model from
project_name.bcm file, initializes all neurons and feeds the neurons inputs with
symbols from project_name.trs. It periodically stores actual weights and thresholds
Before running the simulation, several parameters of the BCM neuron should be
specified. Our model (always with only one input) must have had the following
parameters:
25
speed of learning ()
according to output logging: how often the actual values of weight and are
stored into the output file (output_step) and whether the average or actual
values should be used
26
+LGGHQ 0DUNRY
0RGHO DXWRPDWRQ
GHVFULSWLRQ
+00
(QWURSLHV
IRU GLIIHUHQW
ZLQGRZ OHQJWK
JHQHUDWLQJ +00
VHTXHQFHV
FRPSXWLQJ
(QWURS\
GHWHUPLQLQJ
RI V\PEROV
V\PEROLF 6(48(1&(
*HQHUDWLQJ
UDQGRP
VHTXHQFHV
(1
HQ
JHQVBKPP
RQHV
756
VLPXODWLRQ
SDUDPHWHUV
GHVFULSWLRQ
FRPSXWLQJ
%&0 1(8521
6,08/$7,21
GLVSOD\LQJ
GHYHORSPHQW RI
DQG ZHLJKW
UHFRUG /2* RI
GHYHORSPHQW RI
DQG ZHLJKW
6(77,1*6
)25 3$,17,1*
7+( *5$3+
&203$5,1*
$&78$/ 9$/8(6
LQ /2* ZLWK
DYHUDJH
EFP
%&061/,1
RXW
SORW
DYHUDJH
/,67 2) 7(03(5$785(6
)25 :+,&+ 7+( (17523<
6+28/' %( &20387('
$9(5$*(V 2)
$// &2O8016
$9
V\PEROLF 6(48(1&(
06 ([FHO
&20387,1*
(17523,&
63(&7580
(17523,(6
)25 ',))(5(17
7(03(5$785(6
756
/HJHQG
&20387,QJ
DYHUDJH
FRPSDUH
3/7
7(0
JHQVB51'
(175B63(
SXUSRVH RI ),/(
H[WHQVLRQ
(6
SXUSRVH RI
SURJUDP
SURJUDPB1$0(
3.3.4. Results
First, we will discuss the results we acquired from the plot program. We ran
experiments for all input sequences A1,A2,A3D1,D2,D3. The entropies of
sequences were different (see Figure 3.5.). We hoped that the weight of the neuron
would follow these differences.
(QWURSLHV RI VHTXHQFHV $ '(7 $ +00 $ 51' FRPSXWHG E\ (1 XWLOLW\
window
length N
1
2
3
4
5
6
7
8
9
10
11
12
$ HQWURS\
$ QRUP
$ HQWURS\
$ QRUP
$ HQWURS\
$ QRUP
+1
HQWURS\ KQ
+1
HQWURS\ KQ
+1
HQWURS\ KQ
Figure 3.5.a. Entropies for sequences in set A. H(N) is the entropy for window length
N and h(N) is the same normalized for one symbol. Deterministic sequence has low
entropy (h(N) 0), whereas random sequence has high entropy (h(N) ). HMM
automaton A2 is similar to random sequence, and therefore the difference in entropy
of A2 and A3 is low.
(QWURSLHV RI VHTXHQFHV % '(7 % +00 % 51' FRPSXWHG E\ (1 XWLOLW\
window
length N
% HQWURS\
% QRUP
% HQWURS\
% QRUP
% HQWURS\
% QRUP
+1
HQWURS\ KQ
+1
HQWURS\ KQ
+1
HQWURS\ KQ
Figure 3.5.b. Entropies for sequences in set B. In this case, the difference
between entropies of B2 and B3 is small although larger than for the A set.
28
(QWURSLHV RI VHTXHQFHV & '(7 & +00 & 51' FRPSXWHG E\ (1 XWLOLW\
window
length N
& HQWURS\
& QRUP
& HQWURS\
& QRUP
& HQWURS\
& QRUP
+1
HQWURS\ KQ
+1
HQWURS\ KQ
+1
HQWURS\ KQ
Figure 3.5.c. Entropies for sequences in set C. The difference between entropies
of C2 and C3 is larger than for the B set.
(QWURSLHV RI VHTXHQFHV ' '(7 ' +00 ' 51' FRPSXWHG E\ (1 XWLOLW\
window
length N
' HQWURS\
' QRUP
' HQWURS\
' QRUP
' HQWURS\
' QRUP
+1
HQWURS\ KQ
+1
HQWURS\ KQ
+1
HQWURS\ KQ
Figure 3.5.d. Entropies for sequences in set D. The difference between entropies
of D2 and D3 is large, because the HMM automaton for D2 contains deterministic
parts.
However, the experiments with the linear neuron (i.e. without the sigmoid
function, (3.5)) showed that the neuron is not very sensitive to differences in entropy
of the input sequences (see Figures 3.6. and 3.7.). Tuning the parameters did not make
the difference. Therefore, we switched to a nonlinear neuron with sigmoid transition
function (3.6), which improved the situation. In most cases, the neuron did recognize
the differences in the input sequences and the resulting weight (after the convergent
process) separated sequences with different entropies (see Figures 3.8 and 3.9.).
29
Figure 3.6. Development of the weight (y-axis) with time constant =10 for linear
neuron for symbolic time sequence with 200000 symbols (x-axis).We can see that the
weight doesnt clearly separate the sequences. We used the following parameters:
learning speed =0.001, c0=0.85, cell_sig_min=-2, cell_sig_max=2, length of
sequence: 200 000 symbols. The chart was obtained by shareware screen-grabbing
utility and applying a dilate filter on the resulting image in order to improve
visibility of curves.
30
Figure 3.7. Development of the weight with time constant =20 for linear
neuron. Here the weights for different sequences are even more undistinguishable.
Figure 3.8. Development of the weight with time constant =10 for nonlinear
neuron. All other parameters remain the same as for the linear case. The weight of
the input separates the sequences.
31
Figure 3.9. Development of the weight with time constant =10 for nonlinear
neuron.
An interesting effect appears for =2, when the weight difference is the highest
(see Figure 3.10). This should be compared with the highest correlation of entropy
spectrum of time-sequence for =2, with the entropy spectrum of the original
sequence, see below.
To test the model robustness, we have introduced a QRLVH parameter. A randomly
distributed additive noise (not Gaussian) on the input did have only a quantitative, not
the qualitative impact on the model. The weight separation was not as clear as in the
case without the noise, which can be explained by changing the properties of the input
sequence. HMM generated sequence is already quite random and adding noise results
in a sequence, which cannot be represented by the original HMM, automaton and
therefore has different properties. Because the noise didnt influence the quality of the
model, we removed it from the experiments.
Changing the PHDQLQJ RI V\PEROV DQG had the following effect: if we
replaced 0 with 0.25 and 1 with 0.75, the adaptation process retarded down, but
the final result of weight separation did not changed. This is an important property of
32
our model, because the weight changes are pure 0 for pure 0 in the input (see
(3.10) and (3.11)).
Figure 3.10. Development of the weight with time constant =2 for nonlinear
neuron (similar effect appears also for linear neuron). All other parameters remain
the same as for the former case. The weight of the input separates the sequences.
33
Dynamic changes in the input are compensated by the dynamic threshold (see
Figures 3.11 and 3.12). Increased potentiation in the weight results in increasing the
average activity and in increasing of . Low probability of symbols 1 in input results
in decreasing the average neuron activity and also in decreasing the threshold. In this
way, the dynamic threshold adapts to the input. We should also note that although the
probability of 1 is the important factor for the resulting converged weight, it is not
the only one. All sequences in one set contained the same number of symbols 1, but
the resulting weights are different for nonlinear neuron. To find another properties of
the input time sequences, which might be significant, we have employed entropy
spectra.
34
Figure 3.12. If the onset of is too rapid, it might begin to oscillate. This is
usual for linear neuron and very large time window constant . (here =1000 and the
other parameters are the same as in the former case).
The entropies of input sequences at Figure 3.5. demonstrate that the sequences
with more complicated internal structures have higher entropy. The transformed
entropies for high temperatures show the structure with respect to very likely words.
Entropies for low temperatures show the structure with respect to very unlikely words.
A special case for T is called topological entropy, determines how many different
words are in the sequence.
Deterministic sequences have constant and low entropies for all temperatures,
because they consist of a periodical subsequence repeated many times. On the other
hand, random sequences contain all possible words for a given window length,
although not periodically but randomly distributed. Therefore the entropy is high. We
computed the entropy spectra up to the length of window = 12, due to the
computational complexity and huge amount of data. It can be also observed for
completely random sequences (Bernoulli source) that the entropy decreases for high
temperatures and this is caused only by the bounded length of the sequence (the
number of words for large window is high) we worked with sequences of 200000
symbols.
35
Second part of the results discusses the entropy spectra of sequences created by
comparing the actual value of the threshold with its long-term average. They are
included in the Appendix B.
First type of figures consists of entropy spectra computed for fixed length of the
window. Different figures show these entropy spectra for different values of
parameter of the BCM model. Our goal was to find some kind of correlation between
the structure of the sequence, length of the window used for computing the entropy,
and the parameter.
Second type of figures combines all figures of the first type for a particular
sequence into one chart, where the sums of differences of the entropies of individual
sequences of changes (added for all measured temperatures, i.e. the values around 0
are scanned more densily) and the entropy of the original sequence (also added for all
measured temperatures) are plotted against different window length used for
computing the entropy. Each chart contains data for several different parameters,
which allows, for example, comparing, which sequence is the best model of the
original sequence.
All figures reveal that the sequence of changes is most similar to the original
sequence when the parameter is equal 2 (except the case =1, in which case the
sequence of changes is almost identical to the original sequence). This supports the
previous findings that the weight of the input connection separates the sequences best
in the case =2.
When exploring the figures, the original sequence should be observed first. If the
curve of the entropy spectrum turns down quickly, it means that the sequence contains
few words, which are much more probable than the others. If it declines slowly (as in
the case of C3), it means that most words present in the sequence are equally frequent.
The negative temperatures exhibit the situation with less frequent words. Again, if the
decline is rapid, the sequence contains few words, which have precious occurrence.
Sequences with moderate slope in the negative part of the spectrum curve contain
words with meager frequency.
We have included all the entropy spectra for the sequences C2 and C3, because
of the special nature of the C2 HMM automaton. When the state 2 is achieved, the
36
37
5HVXPH
The first part of the thesis contains an overview of unsupervised artificial neural
network architectures. Their principles, advantages and disadvantages are named.
Second part focuses on the specific example of the Hebbian learning, the BCM
learning rule. In addition to the summary of the BCM bibliography, we have used the
present variant of this rule for experiments with time sequences for both linear and
nonlinear transition functions. We have shown that the cell activity after some initial
period depends on the internal structure of the input sequence. All sequences, i.e.
deterministic, HMM and random in one set contained the same number of symols 1,
but they differed in the complexity of the internal structure. Our measure of this
complexity was the value of the entropy per one symbol. We have found that for two
sequences with different entropies and the same number of symbols 1, the resulting
weight of the BCM neuron differes, i.e. it is dependent on the internal structure of the
input sequence. We have supported this statement by exploring the properties of both
input symbolic sequences and symbolic sequences produced by comparing the
dynamic threshold to its long-term average. We computed the entropies for input
sequences and entropy spectra for the sequences. Here, we have found that the both
nonlinear and linear BCM neuron with the shortest internal memory, i.e. =2, is
able to model the input sequence most closely. We discussed the role of the
parameters of the model and demonstrated that and how the nonlinearity can improve
the characteristics of the development of converged input weight. Thanks to the
entropy spectra, a better view of the internal structure of sequences can be obtained.
We have demonstrated an example of application of this technique for studying
symbolic time sequences.
38
5HIHUHQFHV
[%DFKPDQ]
Bachman, C., M., Musman, S., A., Luong, D., and Shultz, A.,
Bear, M., F., Cooper, L., N., and Ebner, F., F., A physiological
basis for a theory of synapse modification. Science Wash. DC 237: p. 42-48, 1987.
[%HXNRYi]
%H XNRYi
%H XNRYi
induced by one spared whisker, in: Artificial Neural Networks: 7th international
conference; proceedings (Lecture Notes in Computer Science vol 1327), 1997.
[%LHQHQVWRFN]
Bienenstock E., L., Cooper L., N., and Munro P., W., Theory
Clothiaux E., E., Bear M., F., Cooper L., N., Synaptic plasticity
Cooper, L., N., Liberman, F., and Oja, E., A theory for the
acquisition and loss of neuron specificity in visual cortex, Biol. Cybern., Vol. 33, p.928, 1979.
[&RRSHU]
neural network. Proc. Natl. Acad. Sci. USA, Vol.85, p.1973-77, 1988.
[&RRSHU]
Shouval, H., Intrator N., Cooper L., N., BCM network develops
exploratory projection pursuit, Neural Networks, Vol 10, No.2, p.255-262, 1997.
39
[+HEE]
Hertz, J., Krogh A., and Palmer, R., Introduction to the theory
Hubel, D., H., and Wiesel, T.N., Integrative action in the cats
[/LQVNHU]
architecture, Proceedings of the National Academy of Sciences of the USA, Vol. 83, p.
7508-7512, 1986.
[1DVV]
Nass, M., N., and Cooper, L.N., A theory for the development
of feature detecting cells in visual cortex, Biol. Cyb., Vol 19., p. 1-18, 1975.
[2MD]
Selman, B., Brooks, R., A., Dean, T., Horvitz, E., Mitchell, T.,
M., Nillson, N., J., Challenges for artificial intelligence, Proceedings of AAAI-96,
1996.
[7LR]
with neural and hybrid neural based systems, submitted to IEEE Transactions of
Neural Networks, 1996.
40
All programs were written in ANSI C. We used a DOS version of free GNU C
compiler (GCC) from DJ Delorie and a makefile to build all the programs.
First part of the appendix B contains source code of all utilities we have built for the
experiments. The second part contains examples of data files.
41