Handouts On Data-Driven Modelling, Part 3 (UNESCO-IHE)

Data-driven modelling in water-related problems.
PART 3
Dimitri P. Solomatine
www.ihe.nl/hi/sol sol@ihe.nl
UNESCO-IHE Institute for Water Education

Hydroinformatics Chair
Finding groups (clusters) in data (unsupervised learning)
D.P. Solomatine. Data-driven modelling (part 3).
Clustering
classification is aimed at identifying mapping (function) that maps any given input xi to a nominal variable (class) yi. finding the groups (clusters) in an input data set is clustering
Clustering is often the preparation phase for classification:
the identified clusters could be labelled as classes, each input instance then can be associated with an output value (class) and the instances set {xi, yi} can be built
Cluster1 Cluster 3
Cluster 2
D.P. Solomatine. Data-driven modelling (part 3). a)
b)
Reasons to use clustering
labelling large data sets can be very costly; clustering may actually give an insight into the data and help discover classes which are not known in advance; clustering may find features that can be used for categorization.
Voronoi diagrams
Methods for clustering
partition-based clustering (K-means, fuzzy C-means, based on Euclidean distance); hierarchical clustering (agglomerative hierarchical clustering, nearest-neighbour algorithm); feature extraction methods: principal component analysis (PCA), self-organizing feature (SOF) maps (also referred to as Kohonen neural networks).
k-means claustering
find the best division of N samples by K clusters Ci such that the total distance between the clustered samples and their respective centers (that is, the total variance) is minimized:
J = | xn i |2
i =1 nCi
where i is the center of class i.
k-means clustering: algorithm
1 randomly assigning instances to the clusters 2 compute the centers according to
i =
3 reassigne the instances to the nearest clusters center 4 recalculate centers 5 reassign the instances to new centers repeat 2-5 until total variance J stops decreasing (or centers stop to move).
1 Ni
nCi
| x
k-means clustering: illustration
Kohonen network (Self-organizing feature map - SOFM)
10
SOFM: main idea
1 2 3 11 1M
12
... ...
N NM
j (0) j (t2 ) j(t1 ) j
1 2 x1 x2 xM
a)
b)
11
SOFM: algorithm (1)
0 Initialize weight, normally with small random values.

Set topological neighborhood parameters. Set learning rate parameters. Iteration number t = 1.
1 While stopping condition is false, do iteration t (steps 28): 2 For each input vector x = {x1,..., xN} do steps 3 8:
3 For each output node k calculate the similarity measure (in this case the Euclidean distance) between the input and the weight vector: N
D ( k ) = ( wik xi ) 2
i =1
12
SOFM: algorithm (2)
4 Find index kmax such that D(k) is a minimum this will refer to the winning node. 5 Update the weights for the node kmax and for all nodes k within a specified neighborhood radius r from kmax:
wik (t + 1) = wik (t ) + (t ) N ( r, t ) [ xi wik (t )]

6 Update learning rate (t) 7 Reduce radius r used in the neigborhood function N (this can be done less frequently than at each iteration). 8 Test stopping condition.
13
SOFM: example
Input set: sampling points in a square randomly (the probability of sampling a point in the central square region was 20 times greater than elsewhere in a square) The target space is discrete and includes 100 output nodes arranged in 2 dimensions SOFM is able to find the cluster the area of the points concentration
14
SOFM: visualisation and interpretation
count maps, which is the easiest
and mostly used method. This is a plot showing for each output node number of times when it was a winning one. It can be interpolated into colour shading as well distance matrix (of size K x K) which elements are Euclidean distance of each output unit to its immediate neighbouring units
15
SOFM: visualization and interpretation
vector position or cluster maps:

colours are coded according to their similarity in the input space each dot corresponds to one output map unit each map unit is connected to its neighbours by line
16
SOFM: visualization and interpretation
vector position or cluster maps:

in 3D
17
Instance-based learning (lazy learning)
18
Lazy and eager learning
Eager learning:
first ML (data-driven) model is built then it is tested and used
Lazy learning
no ML model is built (i.e lazy) when new examples come, the output is generated immediately on the basis of the training examples Other names for lazy learning:
Instance-based Exemplar-based Case-based Experience-based Edited k-nearest neighbor
D.P. Solomatine. Data-driven modelling (part 3). 19
k-Nearest neighbors method: classification
instances are points in 2-dim. space, output is boolean (+ or -) new instance xq is classified w.r.t. proximity of nearest training instances
to class + (if 1 neighbor is considered) to class - (if 4 neighbors are considered)
for discrete-valued outputs assign: the most common value
Voronoi diagram for 1-Nearest neighbor 1-
20
10
Notations
instance x as {a1(x) ... an(x)} where ar(x) denotes the value of the r-th attribute of instance x. distance between two instances xi and xj is defined to be d(xi, xj) where
d ( xi , x j ) = ( ar ( xi ) ar ( x j )) 2
21
k-Nearest neighbor algorithm
Training
Build the set of training examples D.
Classification
Given a query instance xq to be classified, Let x1... xk denote the k instances from D that are nearest to xq Return
F ( xq ) = arg max ( v, f ( xi ))
vV i =1
where (a, b)=1, if a = b, and (a, b)=0 otherwise V = {v1,,vs} set of possible output values.
22
11
k-Nearest neighbors: regression (target function is real-valued )

model a real-valued target function F : n . instances are points in n-dim. space, output is a real number new instance xq is valued w.r.t. values of nearest training instances (average of k instances is
taken, or the weighted average) values and proximity of nearest training instances (locally weighted regression model is built and used to predict the value of new instance)
In this case the final line on the k-NN algorithm should be replaced by the line k
F ( xq ) =
f ( x ))
i =1 i
k
23
Distance weighted k-NN algorithm (classification)

weigh the contribution of each of the k neighbors according to their distance to the query point xq, giving greater weight wi to closer neighbors This can be accomplished by replacing the final line in the algorithm by
F ( xq ) = arg max wi ( v, f ( xi ))
vV i =1
where the weight is
wi =
1 d ( xq , xi )2
24
12
Distance weighted k-NN algorithm (numerical prediction)

for real-valued output this is accomplished by replacing the final line in the algorithm by
F ( xq ) =
where the weight is
w f ( x ))
i =1 i i
w
i =1
wi =
1 d ( xq , xi )2
25
k-Nearest neighbors: using all examples
for classification:
F ( xq ) = arg max
vV
All instances i =1 i
w (v, f ( x ))
i
for regression:
All instances
F ( xq ) =
w f ( x ))
i i i =1 All instances
w
i =1
26
13
k-Nearest neighbors: comments
k-NN creates a local model of the proximity of new instance, instead of a global model of all training instances robust to noisy training data requires considerable amount of data distance between instances is calculated based on all attributes (and not on 1 as in decision trees). Possible problem:
imagine instances described by 20 attributes, but only 2 are relevant to target function curse of dimensionality: nearest neighbor method is easily mislead when high-dimensional X solution: stretch j-th axis by weight zj chosen to minimize prediction error
with number of training instances , kNN approaches Bayesian optimal classification
27
Locally weighted regression (1)
construct an explicit approximation F (x) of the target function f(x) over a local region surrounding the new query point xq If F(x) is linear then this is called locally weighted linear regression
F ( x) = w0 + w1 a1 ( x) + ... + wn a n ( x)
Instead of minimizing the global error E, here the local error E(xq) has to be minimized
28
14
Locally weighted regression (2)
Various approaches to minimizing error E(xq):
Minimize the squared error over just k nearest neighbors
E1 ( xq ) =
1 ( f ( x ) F ( x )) 2 2 xk nearest nbrs of xq
Minimize the squared error over entire set D of training examples, while weighting the error of each training example by some decreasing function K of its distance from xq:
E 2 ( xq ) =
1 ( f ( x) F ( x))2 K (d ( xq , x)) 2 xD
Combine 1 and 2 (to reduce computational costs):

E3 ( x q ) = 1 ( f ( x) F ( x)) 2 K (d ( xq , x)) 2 xk nearest nbrs of xq
29
Case-base reasoning (CBR)
instance-based learning, but output is not-real valued but is represented by symbolic descriptions methods used to retrieve similar instances are more elaborate (not just Euclidean distance) Applications:
conceptual design of mechanical devices based on a stored library of previous designs (Sycara 1992) new legal cases based on previous rulings (Ashley 1990) selection of an appropriate hydrological model based on previous experience (Kukuric 1997, PhD of IHE)
30
15
Remarks on Lazy and Eager learning
Lazy methods: k-NN, locally weighted regression, CBR Eager learners: are "eager" to before they observe the testing instance xq they already built the global approximation of the target function. Lazy learners:
defer the decision of how to generalize beyond the training data until each new instance is encountered, when new examples come, the output is generated immediately on the basis of nearest training examples
Lazy learners have a richer set of hypotheses - they select an appropriate hypothesis (e.g. linear function) for each new instance So Lazy methods are better suited to customize to unknown future instances
Fuzzy rule-based systems
32
16
Fuzzy logic
introduced in 1965 by Lotfi ZADEH, Univ. of Berkeley Boolean logic is two-valued (False, True). Fuzzy logic is multivalued (False...AlmostFalse...AlmostTrue...True) Fuzzy set theory deals with degree of truth that the outcome belongs to a certain category (partial truth) a fuzzy set on a universe U: for any uU there is a corresponding real number A(u)[0,1] called grade of membership of u belonging to A mapping A : U [0,1] is called membership function of A
Example of an ordinary and a fuzzy set "tall people"
34
17
Various shapes of membership functions
[, +] is support of the fuzzy set, 1 is its kernel

1 1
a) Triangular membership function
b) Bell-shaped function
c) Dome-shaped function
d) Inverted cycloid function
35
Example of a membership function "appropriate water level in the reservoir"
support
kernel
36
18
Alpha-cut
0.5-cut = [4.5, 7.0]

Fuzzy numbers
Special cases of fuzzy sets are fuzzy numbers A fuzzy subset A of the set of real numbers is called a fuzzy number if there is at least one z such that A(z) = 1 (normality assumption) for every real numbers a, b, c with a < c < b A(c) min (A(a), A(b))
(convexity assumption, meaning that the membership function of a fuzzy number consists of an increasing and decreasing part, and possibly flat parts)
38
19
Linguistic variable: example
Linguistic variable can take linguistic values (like low, high, navigable) associated with fuzzy subsets M of the universe U (here U = [0,50])
WATER LEVEL LINGUISTIC VARIABLE Fuzzy Restriction Fuzzy Values of water level Enough Volume for flood detention Environmentally Friendly Navigable
0.2 0.3 1 1 1 1 0.7
0 0.5 0.7 0.8 1 0.9 0.8 0.9 1 1 1 1 Compatibility Links
10
15
20
25
30
35
40 45 50 Water level (m)
BASE VARIABLE
39
Operations on fuzzy sets
40
20
Fuzzy rules
Fuzzy rules are linguistic constructs of the type IF A THEN B where A and B are collections of propositions containing linguistic variables (i.e. variables with linguistic values). A is called a premise and B is the consequence of the rule. If there are K premises in a system, the i-th rule has the form:
If a1 is Ai ,1 a2 is Ai , 2 . . . ak is Ai ,k then Bi
where a is a crisp input, A and B are linguistic variables, is one of the operators AND, OR, XOR.
41
Additive model of combining rules
42
21
Fuzzy rule-based systems (FS)
use linguistic variables based on fuzzy logic based on encoding relationships between variables in the form of rules rules are generated through the analysis of a large data samples such rules are used to produce the values of the output variables given new input values
43
Example: Fuzzy rules in control

rules like: IF Temperature is Cool THEN AirMotorSpeed := Slow
BLAST If Warm, then fast If Right, then medium If Cool, then slow If Cold, then stop RIGHT COOL 5 10 If Hot, then blast
Input: Temperature = 22 What will be the AirMotorSpeed?

Temperature is RIGHT with degree of fulfillment (DOF) = 0.6 and WARM with DOF = 0.2 two rules are fired
100
1.0 COLD 0.6 0.2 0.0 1.0 0.6 0.2 0.0

0 5 COOL 10
RIGHT
HOT WARM 30 35
80 FAST AIR MOTOR SPEED 60 MEDIUM 40 SLOW 20 0 STOP
15 20 25 TEMPERATURE 0C MEDIUM
STOP SLOW 20
BLAST FAST 100
40 60 80 AIR MOTOR SPEED
COLD
HOT WARM 30 35
1.0 0.6 0.2 0.0

0 20 40 60 80 AIR MOTOR SPEED
defuzzyfication using centroid of the area

The weighted sum combination method. 100
15 20 25 TEMPERATURE
1.0 0.6 0.2 0.0

The crested weighted sum combination method. 0 20 40 60 80 AIR MOTOR SPEED 100
44
22
Combining premises in a rule
(left) part of a fuzzy rule is satisfied The means to combine the memberships of the inputs to the corresponding fuzzy sets into a DOF is called inference Product inference for rule i is defined as:
Degree of fulfillment (DOF) is the extent to which the premise
i = DOF ( Ai ) = A (ak )
(rule is sensitive to the change in the amount of truth contained in each premise) Minimum inference for rule i is define like this:
k =1
i ,k
i = DOF ( Ai ) = Min A (ak )

k =1.. K
i ,k
45
Combining rules: example for 2 inputs

Output
Input 2
H M H L L
Input 1
M
46
23
Combining rules: weighted sum combination
weight
1.0 0.6 0.2 0.0
weighted sum combination uses the DOF of each rule as a
The weighted sum combination method. 0 20 40 60 80 AIR MOTOR SPEED 100
If there are I rules each having a response fuzzy set Bi with DOF of i, the combined membership function
B ( x) =

i =1 i I u i =1
Bi
( x)
Max i Bi ( x )
47
Combining rules: crested weighted sum combination
crested weighted sum combination is there, when each output
membership function is clipped off at a height corresponding to the rules degree of fulfillment
1.0 0.6 0.2 0.0 0 20 40 60 80 AIR MOTOR SPEED 100 The crested weighted sum combination method.
If there are I rules each having a response fuzzy set Bi with DOF of i, the combined membership function I Min( i , B ( x) ) i =1 B ( x) = I Max Min ( i , B ( x ) )
i
i =1
48
24
Combining rules: defuzzification
Defuzzification is a mapping from a fuzzy consequence of consequences Bi to a crisp consequence this is actually the identification of the fuzzy mean
the most widely used method is:
find the centroid (center of gravity) of the area below the membership function and take its abscissa coordinate as the crisp output.
1.0 0.6 0.2 0.0 0 20 40 60 80 AIR MOTOR SPEED 100
49
defuzzyfication using centroid of the area
The weighted sum combination method.
In the previous example the rules were given. But how to build them from data?
the following is given/assumed:
the known rule structure, that is the number of premises in each rule shapes of membership functions the number of rules the training set T is given: a set of S observed inputs (a) and output (b) real-valued vectors:
T = {(a1 (s ), . . . , a K (s ), b(s )); s = 1, . . . , S } It is assumed that we are training I rules with K premises in a system, where the i-th rule has the following form: If a1 is Ai ,1 AND a2 is Ai , 2 AND . . . AND ak is Ai ,k then Bi
where a is a crisp input, A and B are triangular fuzzy numbers. parameters of A and B (supports and kernels) are to be found
25
Building rules from data: weighted counting algorithm (1)
51

uses the subset of the training set that satisfies the premises of a rule at least to a degree of fulfilment of threshold to construct the shape of the corresponding consequence It is accomplished with the following steps (i is the rule number, k is the premise number)
52
26

1 Define the support (-i,k +i,k) of the i-th rules premise Ai,k 2 Ai,k is assumed to be a triangular fuzzy number (-i,k 1i,k , +i,k)T where1i,k is the mean of all possible ak(s) values which fulfil at least partially the ith rule:
i1,k =
1 Ni
sRi
a ( s)
k
3 Calculate the DOFs i (s) for each premise vector (a1(s) ak(s)) corresponding to the training set T and each rule i whose premises were determined in step 1. 4 Select a threshold > 0 such that only responses with DOF > will be considered in the construction of the rule response. The corresponding response is assumed to be also a triangular fuzzy number (-i,k 1i,k , +i,k)T defined by: i ( s)b(s) i = Min b( s ) i+ = Max b( s ) i1 = i ( s ) > i ( s )> i ( s ) > ( s )> i ( s)
i
53
Fuzzy rule-based system: rulelearning rules from data
CRISP INPUT (X)
FUZZIFIER
FUZZY INFERENCE ENGINE
DEFUZZIFIER
CRISP OUTPUT (Y)
RULES
HISTORICAL DATA
TRAINING
EXPERT JUDGEMENTS
(are not considered here)
54
27
Case study: catchment in Veneto region, Italy Andraz Arabba Caprile
Modeling spatial rainfall distribution using Fuzzy rulebased system: filling missing data in past records estimating rainfall depth at a station Caprile (based on data for Arabba and Andraz) in case of a sudden equipment failure
55
Problem formulation
Daily precipitation at three stations in 1985-91 Data split for training and verification Daily precipitation at Andraz & Arabba used to determine the daily precipitation at Caprile Performance indices
Mean square error (MSE) b/n modeled & observed data Percentage of predictions within a predefined tolerance target (5% is used)
Problems:
missing records in training data non-uniform distribution ofdata
28
Methods cosidered
Traditional Normal ratio method

PX N N 1 N = X PA + X PB + X PC NB NC 3 N A
Neural network Fuzzy rule-based system
57
How many rules to use?
Too many rules lead to overfitting and the higher error on verification
Effect of the Number of Rules
Mean Square Error 12 10 8 6 4 2 4 9 16 Num ber of rules
1988-91(T) 1985-87(V)
25
36
29
Results: best performance

Training Performance (1989-91)
80 Simulated Precipitation
Sim ulated Precipitation 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80
Verification Performance (1985-88)
70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 Observed Precipitation
Observed Precipitation
Precipitation at CAPRILE for the first 120 days of 1987
59
Veneto case study: comparison of fuzzy rules, neural network and the normal ratio method
Performance Comparison (Case 1) M e an Square Er ror Within 5% Tole rance

9 8 7 6 5 4 3 2
Performance Comparison (Case 1)

98 96 94 92 90 88 86
-9 89 19
) 1( T
) ) ) ) (V (V (V (V 88 87 86 8 - 88 19 19 19 19 85 19 FRBS NNN TRAD ) 5( V
) ) ) ) ) ) 1( T 985( V 986( V 987( V 988( V - 88( V 5 9- 9 1 1 1 1 98 98 1 1 FRBS NNN TRAD
60
30
Veneto case study: conclusions
FRBS was the most accurate than the ANN and the Normal ratio method training is faster than of ANN Issues to pay attention to:
curse of dimensionality: more than 5 inputs is very difficult to handle too many rules may cause overfitting non-uniformly distributed data lead to empty areas where rules cannot be trained
61
Case study Delfland: training ANN or Fuzzy controller on data obtained from an optimal controller in water level control
data-driven controller (ANN or Fuzzy rule-based system) is trained on data generated by the optimal controller, then can replace it
Training Aquarius optimal controller y(t)d
y(t)
Target Error in + water control level signal
ANN or FRBS model
u(t)
pumping rate
Hydrological y(t) processes in water the polders level
62
31
Case study: Delfland
63
Replicating controller by ANN (output - pump status at time t )

Input variables in Local control
water level at time t-1 water level at time t pump status at time t-1
Input variables in Centralised dynamic control

precipitation at time t-2 precipitation at time t-1 precipitation at time t water level at time t-1 water level at time t groundwater level at time t pump status at time t-1
64
32
Performance of the Neural network reproducing behaviour of an optimal controller
Pump status
65
Fuzzy rules reproducing optimal control of water level in Delfland
Pump status
66
33
Bayesian learning
67
Bayesian theorem
we are interested in determining the best hypothesis h from some space H, given the observed data D Some notations:
P(h) = prior probability that hypothesis h holds P(D) = prior probability that training data D will be observed (without knowledge which hypothesis holds) P(D/h) = probability of observing data D given h holds P(h/D) = probability that h holds given observed data D
Bayes theorem:
P(h / D) =
P ( D / h) P ( h ) P( D)
68
34
Selecting "best" hypothesis using Bayes theorem
learning in Bayesian sense: selecting the most probable hypothesis (maximum a posteriori hypothesis MAP)
hMAP arg max P (h / D)

hH
P ( D / h ) P ( h) P( D) hH = arg max P ( D / h) P (h) = arg max

hH
P(D/h) is called likelihood of data D given h if all hypotheses are equally probable, then maximum likelihood (ML) hypothesis:
hML = arg max P ( D / h)

hH
69
Bayesian learning: example
hypothesis h = "patient has cancer", alternative = "no cancer" prior knowledge (without data): P(h)=0.008 data that can be observed: test with 2 outcomes (+ or -):
right results:
P(+/cancer) = 0.98 P(-/nocancer) = 0.97 P(+/nocancer) = 0.03
errors:
P(-/cancer) = 0.02
suppose data is observed: a patient is tested and result is + is then hypothesis correct?: choose hypothesis with MAP, that is hypothesis for which P(D/h)P(h) = max
P(+/cancer) P(cancer) = 0.98 * 0.008 = 0.0078 P(+/nocancer) P(nocancer) = 0.03 * 0.992 = 0.0298 --> hypothesis "no cancer" wins
35
Naive Bayes classifier
assume that each instance x of the data set is characterized by the several attributes {a1,, an} target function F (x) can take on any value from a finite set V a set of training examples {xi} is provided when a new instance < a1,, an > is presented, the classifier should identify the most probable target value vMAP.
71
Nave Bayes classifier (2)
This condition can be written like this

vMAP = arg max P(v j / a1 ,..., an )
vjV
or by applying Bayes theorem:

vMAP = arg max
v j V v j V
P (a1 ,..., an / v j ) P (v j ) P (a1 ,..., an )
= arg max P (a1 ,..., an / v j ) P (v j )
P(vj) can be estimated simply by counting the frequency with which each target value vj occurs in data
72
36
Nave Bayes classifier (3)
terms P(a1,, an / vj) can be estimated by counting in a similar way, however, the total number of these terms is equal to the number of possible instances times the number of possible target values - so it is difficult The solution is in a simplifying assumption that the attribute values a1,, an are conditionally independent given the target value. In this case P(a1,, an / vj) = i P(ai / vj) and to estimate P(ai / vj) is much easier also by counting the frequency. This gives the rule of the nave Bayes classifier:
v NaiveBayes = arg max P(v j ) P (ai / v j )

v j V i
73
Modular models: committee machines, ensembles, mixtures of experts, boosting
74
37
Committee machine (modular model)

Instead of building one model, several models are built each responsible for a particular situation
Consider a forecasting model Q(t+1) = f (R(t-2), R(t-3), Q(t-1) Rainfall (t-2) High flows separate models are built
Medium flows Rainfall (t-3) Low flows Flow Q(t)

past records
New record (hydrometeorological condition). It is to be attributed to one (or several classes), and the corresponding models will be run
75
Committee machine (modular model)
76
38
Committee machines (modular model)

input data is split into subsets and separate data-driven models are trained:
hard split: sort according to position in the input space (low - high rainfall) this allows to bring in the physical insight no split: do not sort, but train several models on the same data and then combine results by some voting scheme (committee machine)
voting by majority, weighted majority, by averaging
soft split: split according to how well a given model trained with this data, and then train also other models. Example: boosting
present the original training data (N examples) set to machine 1 assign higher probability to samples that are badly classified sample N examples from training set based on the new distribution train machine 2 continue, ending with n machines
Committee machine with hard split, expert (specialised) models trained on subsets
Input x Splitting (gating machine)
Machine 1 y1
Machine 2 y2
Machine n yn
78
39
Committee machine with no split (ensemble), all models are trained on the same set
Input x No splitting
Machine 1 y1
Machine 2 y2
Machine n yn
Combiner (averaging scheme) y

Committee machine with soft split of data. Boosting

redistribution redistribution
Input x
sampling N training examples from the distribution where badly predicted examples are given higher probability
Machine 1 y1
Machine 2 y2
Machine n yn
Combiner (weighted averaging scheme) y

40
Using mixture of experts (models): each model is for particular hydrological condition
INPUT Condition 1 Q xt-1> 1000 N Condition 2 Q xt-1 1000 Q xt > 200 Condition 3 Pa t-1 > 50 Pa Mov2 t-2 < 5 Pa Mov2 t-4 < Condition x 5
Module 1 Module 2
M5 ANN M5 ANN
Module 3 Module x
M5
?
81
Combining physically-based and data-driven models. Complementary use of a Data-driven model

PHYSICAL SYSTEM
Observed output
Input data
Model parameters HYDRO LOGIC FORECASTING MODEL
Model errors Improved output DATADRIVEN error forecasting model Forecasted errors
Model output
82
41
End of Part 3
83
42

Handouts On Data-Driven Modelling, Part 3 (UNESCO-IHE)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Handouts On Data-Driven Modelling, Part 3 (UNESCO-IHE)

Uploaded by

Copyright:

Available Formats

Data-driven modelling in water-related problems.

UNESCO-IHE Institute for Water Education

Finding groups (clusters) in data (unsupervised learning)

D.P. Solomatine. Data-driven modelling (part 3).

D.P. Solomatine. Data-driven modelling (part 3). a)

Reasons to use clustering

D.P. Solomatine. Data-driven modelling (part 3).

D.P. Solomatine. Data-driven modelling (part 3).

Methods for clustering

D.P. Solomatine. Data-driven modelling (part 3).

where i is the center of class i.

D.P. Solomatine. Data-driven modelling (part 3).

k-means clustering: algorithm

1 randomly assigning instances to the clusters 2 compute the centers according to

D.P. Solomatine. Data-driven modelling (part 3).

k-means clustering: illustration

D.P. Solomatine. Data-driven modelling (part 3).

Kohonen network (Self-organizing feature map - SOFM)

D.P. Solomatine. Data-driven modelling (part 3).

SOFM: main idea

j (0) j (t2 ) j(t1 ) j

D.P. Solomatine. Data-driven modelling (part 3).

SOFM: algorithm (1)

0 Initialize weight, normally with small random values.

D.P. Solomatine. Data-driven modelling (part 3).

SOFM: algorithm (2)

wik (t + 1) = wik (t ) + (t ) N ( r, t ) [ xi wik (t )]

D.P. Solomatine. Data-driven modelling (part 3).

D.P. Solomatine. Data-driven modelling (part 3).

SOFM: visualisation and interpretation

count maps, which is the easiest

D.P. Solomatine. Data-driven modelling (part 3).

SOFM: visualization and interpretation

vector position or cluster maps:

D.P. Solomatine. Data-driven modelling (part 3).

SOFM: visualization and interpretation

vector position or cluster maps:

D.P. Solomatine. Data-driven modelling (part 3).

Instance-based learning (lazy learning)

D.P. Solomatine. Data-driven modelling (part 3).

Lazy and eager learning

k-Nearest neighbors method: classification

for discrete-valued outputs assign: the most common value

D.P. Solomatine. Data-driven modelling (part 3).

Voronoi diagram for 1-Nearest neighbor 1-

D.P. Solomatine. Data-driven modelling (part 3).

k-Nearest neighbor algorithm

D.P. Solomatine. Data-driven modelling (part 3).

k-Nearest neighbors: regression (target function is real-valued )

Distance weighted k-NN algorithm (classification)

where the weight is

D.P. Solomatine. Data-driven modelling (part 3).

Distance weighted k-NN algorithm (numerical prediction)

D.P. Solomatine. Data-driven modelling (part 3).

k-Nearest neighbors: using all examples

D.P. Solomatine. Data-driven modelling (part 3).

k-Nearest neighbors: comments

with number of training instances , kNN approaches Bayesian optimal classification

D.P. Solomatine. Data-driven modelling (part 3).

Locally weighted regression (1)

D.P. Solomatine. Data-driven modelling (part 3).

Locally weighted regression (2)