Professional Documents
Culture Documents
PART 3
Dimitri P. Solomatine
www.ihe.nl/hi/sol sol@ihe.nl
Clustering
classification is aimed at identifying mapping (function) that maps any given input xi to a nominal variable (class) yi. finding the groups (clusters) in an input data set is clustering
Clustering is often the preparation phase for classification:
the identified clusters could be labelled as classes, each input instance then can be associated with an output value (class) and the instances set {xi, yi} can be built
Cluster1 Cluster 3
Cluster 2
b)
labelling large data sets can be very costly; clustering may actually give an insight into the data and help discover classes which are not known in advance; clustering may find features that can be used for categorization.
Voronoi diagrams
partition-based clustering (K-means, fuzzy C-means, based on Euclidean distance); hierarchical clustering (agglomerative hierarchical clustering, nearest-neighbour algorithm); feature extraction methods: principal component analysis (PCA), self-organizing feature (SOF) maps (also referred to as Kohonen neural networks).
k-means claustering
find the best division of N samples by K clusters Ci such that the total distance between the clustered samples and their respective centers (that is, the total variance) is minimized:
J = | xn i |2
i =1 nCi
i =
3 reassigne the instances to the nearest clusters center 4 recalculate centers 5 reassign the instances to new centers repeat 2-5 until total variance J stops decreasing (or centers stop to move).
1 Ni
nCi
| x
10
1 2 3 11 1M
12
... ...
N NM
1 2 x1 x2 xM
a)
b)
11
1 While stopping condition is false, do iteration t (steps 28): 2 For each input vector x = {x1,..., xN} do steps 3 8:
3 For each output node k calculate the similarity measure (in this case the Euclidean distance) between the input and the weight vector: N
D ( k ) = ( wik xi ) 2
i =1
12
4 Find index kmax such that D(k) is a minimum this will refer to the winning node. 5 Update the weights for the node kmax and for all nodes k within a specified neighborhood radius r from kmax:
13
SOFM: example
Input set: sampling points in a square randomly (the probability of sampling a point in the central square region was 20 times greater than elsewhere in a square) The target space is discrete and includes 100 output nodes arranged in 2 dimensions SOFM is able to find the cluster the area of the points concentration
14
and mostly used method. This is a plot showing for each output node number of times when it was a winning one. It can be interpolated into colour shading as well distance matrix (of size K x K) which elements are Euclidean distance of each output unit to its immediate neighbouring units
15
16
17
18
Eager learning:
first ML (data-driven) model is built then it is tested and used
Lazy learning
no ML model is built (i.e lazy) when new examples come, the output is generated immediately on the basis of the training examples Other names for lazy learning:
Instance-based Exemplar-based Case-based Experience-based Edited k-nearest neighbor
D.P. Solomatine. Data-driven modelling (part 3). 19
instances are points in 2-dim. space, output is boolean (+ or -) new instance xq is classified w.r.t. proximity of nearest training instances
to class + (if 1 neighbor is considered) to class - (if 4 neighbors are considered)
20
10
Notations
instance x as {a1(x) ... an(x)} where ar(x) denotes the value of the r-th attribute of instance x. distance between two instances xi and xj is defined to be d(xi, xj) where
d ( xi , x j ) = ( ar ( xi ) ar ( x j )) 2
21
Training
Build the set of training examples D.
Classification
Given a query instance xq to be classified, Let x1... xk denote the k instances from D that are nearest to xq Return
F ( xq ) = arg max ( v, f ( xi ))
vV i =1
where (a, b)=1, if a = b, and (a, b)=0 otherwise V = {v1,,vs} set of possible output values.
22
11
taken, or the weighted average) values and proximity of nearest training instances (locally weighted regression model is built and used to predict the value of new instance)
In this case the final line on the k-NN algorithm should be replaced by the line k
F ( xq ) =
D.P. Solomatine. Data-driven modelling (part 3).
f ( x ))
i =1 i
k
23
F ( xq ) = arg max wi ( v, f ( xi ))
vV i =1
wi =
1 d ( xq , xi )2
24
12
F ( xq ) =
where the weight is
w f ( x ))
i =1 i i
w
i =1
wi =
1 d ( xq , xi )2
25
for classification:
F ( xq ) = arg max
vV
All instances i =1 i
w (v, f ( x ))
i
for regression:
All instances
F ( xq ) =
w f ( x ))
i i i =1 All instances
w
i =1
26
13
k-NN creates a local model of the proximity of new instance, instead of a global model of all training instances robust to noisy training data requires considerable amount of data distance between instances is calculated based on all attributes (and not on 1 as in decision trees). Possible problem:
imagine instances described by 20 attributes, but only 2 are relevant to target function curse of dimensionality: nearest neighbor method is easily mislead when high-dimensional X solution: stretch j-th axis by weight zj chosen to minimize prediction error
27
construct an explicit approximation F (x) of the target function f(x) over a local region surrounding the new query point xq If F(x) is linear then this is called locally weighted linear regression
F ( x) = w0 + w1 a1 ( x) + ... + wn a n ( x)
Instead of minimizing the global error E, here the local error E(xq) has to be minimized
28
14
E1 ( xq ) =
1 ( f ( x ) F ( x )) 2 2 xk nearest nbrs of xq
Minimize the squared error over entire set D of training examples, while weighting the error of each training example by some decreasing function K of its distance from xq:
E 2 ( xq ) =
1 ( f ( x) F ( x))2 K (d ( xq , x)) 2 xD
instance-based learning, but output is not-real valued but is represented by symbolic descriptions methods used to retrieve similar instances are more elaborate (not just Euclidean distance) Applications:
conceptual design of mechanical devices based on a stored library of previous designs (Sycara 1992) new legal cases based on previous rulings (Ashley 1990) selection of an appropriate hydrological model based on previous experience (Kukuric 1997, PhD of IHE)
30
15
Lazy methods: k-NN, locally weighted regression, CBR Eager learners: are "eager" to before they observe the testing instance xq they already built the global approximation of the target function. Lazy learners:
defer the decision of how to generalize beyond the training data until each new instance is encountered, when new examples come, the output is generated immediately on the basis of nearest training examples
Lazy learners have a richer set of hypotheses - they select an appropriate hypothesis (e.g. linear function) for each new instance So Lazy methods are better suited to customize to unknown future instances
D.P. Solomatine. Data-driven modelling (part 3). 31
32
16
Fuzzy logic
introduced in 1965 by Lotfi ZADEH, Univ. of Berkeley Boolean logic is two-valued (False, True). Fuzzy logic is multivalued (False...AlmostFalse...AlmostTrue...True) Fuzzy set theory deals with degree of truth that the outcome belongs to a certain category (partial truth) a fuzzy set on a universe U: for any uU there is a corresponding real number A(u)[0,1] called grade of membership of u belonging to A mapping A : U [0,1] is called membership function of A
D.P. Solomatine. Data-driven modelling (part 3). 33
34
17
b) Bell-shaped function
c) Dome-shaped function
35
support
D.P. Solomatine. Data-driven modelling (part 3).
kernel
36
18
Alpha-cut
Fuzzy numbers
Special cases of fuzzy sets are fuzzy numbers A fuzzy subset A of the set of real numbers is called a fuzzy number if there is at least one z such that A(z) = 1 (normality assumption) for every real numbers a, b, c with a < c < b A(c) min (A(a), A(b))
(convexity assumption, meaning that the membership function of a fuzzy number consists of an increasing and decreasing part, and possibly flat parts)
38
19
Linguistic variable can take linguistic values (like low, high, navigable) associated with fuzzy subsets M of the universe U (here U = [0,50])
WATER LEVEL LINGUISTIC VARIABLE Fuzzy Restriction Fuzzy Values of water level Enough Volume for flood detention Environmentally Friendly Navigable
10
15
20
25
30
35
BASE VARIABLE
39
40
20
Fuzzy rules
Fuzzy rules are linguistic constructs of the type IF A THEN B where A and B are collections of propositions containing linguistic variables (i.e. variables with linguistic values). A is called a premise and B is the consequence of the rule. If there are K premises in a system, the i-th rule has the form:
If a1 is Ai ,1 a2 is Ai , 2 . . . ak is Ai ,k then Bi
where a is a crisp input, A and B are linguistic variables, is one of the operators AND, OR, XOR.
41
42
21
use linguistic variables based on fuzzy logic based on encoding relationships between variables in the form of rules rules are generated through the analysis of a large data samples such rules are used to produce the values of the output variables given new input values
43
100
RIGHT
HOT WARM 30 35
15 20 25 TEMPERATURE 0C MEDIUM
STOP SLOW 20
COLD
HOT WARM 30 35
15 20 25 TEMPERATURE
44
22
(left) part of a fuzzy rule is satisfied The means to combine the memberships of the inputs to the corresponding fuzzy sets into a DOF is called inference Product inference for rule i is defined as:
i = DOF ( Ai ) = A (ak )
(rule is sensitive to the change in the amount of truth contained in each premise) Minimum inference for rule i is define like this:
k =1
i ,k
45
Input 2
H M H L L
D.P. Solomatine. Data-driven modelling (part 3).
Input 1
M
46
23
weight
1.0 0.6 0.2 0.0
If there are I rules each having a response fuzzy set Bi with DOF of i, the combined membership function
B ( x) =
D.P. Solomatine. Data-driven modelling (part 3).
i =1 i I u i =1
Bi
( x)
Max i Bi ( x )
47
membership function is clipped off at a height corresponding to the rules degree of fulfillment
1.0 0.6 0.2 0.0 0 20 40 60 80 AIR MOTOR SPEED 100 The crested weighted sum combination method.
If there are I rules each having a response fuzzy set Bi with DOF of i, the combined membership function I Min( i , B ( x) ) i =1 B ( x) = I Max Min ( i , B ( x ) )
i
i =1
48
24
Defuzzification is a mapping from a fuzzy consequence of consequences Bi to a crisp consequence this is actually the identification of the fuzzy mean
the most widely used method is:
find the centroid (center of gravity) of the area below the membership function and take its abscissa coordinate as the crisp output.
1.0 0.6 0.2 0.0 0 20 40 60 80 AIR MOTOR SPEED 100
49
In the previous example the rules were given. But how to build them from data?
the following is given/assumed:
the known rule structure, that is the number of premises in each rule shapes of membership functions the number of rules the training set T is given: a set of S observed inputs (a) and output (b) real-valued vectors:
T = {(a1 (s ), . . . , a K (s ), b(s )); s = 1, . . . , S } It is assumed that we are training I rules with K premises in a system, where the i-th rule has the following form: If a1 is Ai ,1 AND a2 is Ai , 2 AND . . . AND ak is Ai ,k then Bi
where a is a crisp input, A and B are triangular fuzzy numbers. parameters of A and B (supports and kernels) are to be found
D.P. Solomatine. Data-driven modelling (part 3). 50
25
51
52
26
i1,k =
1 Ni
sRi
a ( s)
k
3 Calculate the DOFs i (s) for each premise vector (a1(s) ak(s)) corresponding to the training set T and each rule i whose premises were determined in step 1. 4 Select a threshold > 0 such that only responses with DOF > will be considered in the construction of the rule response. The corresponding response is assumed to be also a triangular fuzzy number (-i,k 1i,k , +i,k)T defined by: i ( s)b(s) i = Min b( s ) i+ = Max b( s ) i1 = i ( s ) > i ( s )> i ( s ) > ( s )> i ( s)
i
53
FUZZIFIER
DEFUZZIFIER
RULES
HISTORICAL DATA
TRAINING
EXPERT JUDGEMENTS
54
27
Modeling spatial rainfall distribution using Fuzzy rulebased system: filling missing data in past records estimating rainfall depth at a station Caprile (based on data for Arabba and Andraz) in case of a sudden equipment failure
55
Problem formulation
Daily precipitation at three stations in 1985-91 Data split for training and verification Daily precipitation at Andraz & Arabba used to determine the daily precipitation at Caprile Performance indices
Mean square error (MSE) b/n modeled & observed data Percentage of predictions within a predefined tolerance target (5% is used)
Problems:
missing records in training data non-uniform distribution ofdata
D.P. Solomatine. Data-driven modelling (part 3). 56
28
Methods cosidered
57
Too many rules lead to overfitting and the higher error on verification
Effect of the Number of Rules
Mean Square Error 12 10 8 6 4 2 4 9 16 Num ber of rules
D.P. Solomatine. Data-driven modelling (part 3). 58
1988-91(T) 1985-87(V)
25
36
29
70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 Observed Precipitation
Observed Precipitation
59
Veneto case study: comparison of fuzzy rules, neural network and the normal ratio method
-9 89 19
) 1( T
60
30
FRBS was the most accurate than the ANN and the Normal ratio method training is faster than of ANN Issues to pay attention to:
curse of dimensionality: more than 5 inputs is very difficult to handle too many rules may cause overfitting non-uniformly distributed data lead to empty areas where rules cannot be trained
61
Case study Delfland: training ANN or Fuzzy controller on data obtained from an optimal controller in water level control
data-driven controller (ANN or Fuzzy rule-based system) is trained on data generated by the optimal controller, then can replace it
Training Aquarius optimal controller y(t)d
y(t)
u(t)
pumping rate
62
31
63
64
32
Pump status
65
Pump status
66
33
Bayesian learning
67
Bayesian theorem
we are interested in determining the best hypothesis h from some space H, given the observed data D Some notations:
P(h) = prior probability that hypothesis h holds P(D) = prior probability that training data D will be observed (without knowledge which hypothesis holds) P(D/h) = probability of observing data D given h holds P(h/D) = probability that h holds given observed data D
Bayes theorem:
P(h / D) =
P ( D / h) P ( h ) P( D)
68
34
learning in Bayesian sense: selecting the most probable hypothesis (maximum a posteriori hypothesis MAP)
P(D/h) is called likelihood of data D given h if all hypotheses are equally probable, then maximum likelihood (ML) hypothesis:
69
hypothesis h = "patient has cancer", alternative = "no cancer" prior knowledge (without data): P(h)=0.008 data that can be observed: test with 2 outcomes (+ or -):
right results:
P(+/cancer) = 0.98 P(-/nocancer) = 0.97 P(+/nocancer) = 0.03
errors:
P(-/cancer) = 0.02
suppose data is observed: a patient is tested and result is + is then hypothesis correct?: choose hypothesis with MAP, that is hypothesis for which P(D/h)P(h) = max
P(+/cancer) P(cancer) = 0.98 * 0.008 = 0.0078 P(+/nocancer) P(nocancer) = 0.03 * 0.992 = 0.0298 --> hypothesis "no cancer" wins
D.P. Solomatine. Data-driven modelling (part 3). 70
35
assume that each instance x of the data set is characterized by the several attributes {a1,, an} target function F (x) can take on any value from a finite set V a set of training examples {xi} is provided when a new instance < a1,, an > is presented, the classifier should identify the most probable target value vMAP.
71
P(vj) can be estimated simply by counting the frequency with which each target value vj occurs in data
72
36
terms P(a1,, an / vj) can be estimated by counting in a similar way, however, the total number of these terms is equal to the number of possible instances times the number of possible target values - so it is difficult The solution is in a simplifying assumption that the attribute values a1,, an are conditionally independent given the target value. In this case P(a1,, an / vj) = i P(ai / vj) and to estimate P(ai / vj) is much easier also by counting the frequency. This gives the rule of the nave Bayes classifier:
73
74
37
past records
New record (hydrometeorological condition). It is to be attributed to one (or several classes), and the corresponding models will be run
75
76
38
soft split: split according to how well a given model trained with this data, and then train also other models. Example: boosting
present the original training data (N examples) set to machine 1 assign higher probability to samples that are badly classified sample N examples from training set based on the new distribution train machine 2 continue, ending with n machines
D.P. Solomatine. Data-driven modelling (part 3). 77
Committee machine with hard split, expert (specialised) models trained on subsets
Input x Splitting (gating machine)
Machine 1 y1
Machine 2 y2
Machine n yn
78
39
Committee machine with no split (ensemble), all models are trained on the same set
Input x No splitting
Machine 1 y1
Machine 2 y2
Machine n yn
Input x
sampling N training examples from the distribution where badly predicted examples are given higher probability
Machine 1 y1
Machine 2 y2
Machine n yn
40
Using mixture of experts (models): each model is for particular hydrological condition
INPUT Condition 1 Q xt-1> 1000 N Condition 2 Q xt-1 1000 Q xt > 200 Condition 3 Pa t-1 > 50 Pa Mov2 t-2 < 5 Pa Mov2 t-4 < Condition x 5
D.P. Solomatine. Data-driven modelling (part 3).
Module 1 Module 2
M5 ANN M5 ANN
Module 3 Module x
M5
?
81
Observed output
Input data
Model errors Improved output DATADRIVEN error forecasting model Forecasted errors
Model output
82
41
End of Part 3
83
42