You are on page 1of 42

Data-driven modelling in water-related problems.

PART 3
Dimitri P. Solomatine
www.ihe.nl/hi/sol sol@ihe.nl

UNESCO-IHE Institute for Water Education


Hydroinformatics Chair

Finding groups (clusters) in data (unsupervised learning)

D.P. Solomatine. Data-driven modelling (part 3).

Clustering

classification is aimed at identifying mapping (function) that maps any given input xi to a nominal variable (class) yi. finding the groups (clusters) in an input data set is clustering
Clustering is often the preparation phase for classification:
the identified clusters could be labelled as classes, each input instance then can be associated with an output value (class) and the instances set {xi, yi} can be built
Cluster1 Cluster 3

Cluster 2

D.P. Solomatine. Data-driven modelling (part 3). a)

b)

Reasons to use clustering

labelling large data sets can be very costly; clustering may actually give an insight into the data and help discover classes which are not known in advance; clustering may find features that can be used for categorization.

D.P. Solomatine. Data-driven modelling (part 3).

Voronoi diagrams

D.P. Solomatine. Data-driven modelling (part 3).

Methods for clustering

partition-based clustering (K-means, fuzzy C-means, based on Euclidean distance); hierarchical clustering (agglomerative hierarchical clustering, nearest-neighbour algorithm); feature extraction methods: principal component analysis (PCA), self-organizing feature (SOF) maps (also referred to as Kohonen neural networks).

D.P. Solomatine. Data-driven modelling (part 3).

k-means claustering

find the best division of N samples by K clusters Ci such that the total distance between the clustered samples and their respective centers (that is, the total variance) is minimized:

J = | xn i |2
i =1 nCi

where i is the center of class i.

D.P. Solomatine. Data-driven modelling (part 3).

k-means clustering: algorithm

1 randomly assigning instances to the clusters 2 compute the centers according to

i =

3 reassigne the instances to the nearest clusters center 4 recalculate centers 5 reassign the instances to new centers repeat 2-5 until total variance J stops decreasing (or centers stop to move).

1 Ni

nCi

| x

D.P. Solomatine. Data-driven modelling (part 3).

k-means clustering: illustration

D.P. Solomatine. Data-driven modelling (part 3).

Kohonen network (Self-organizing feature map - SOFM)

D.P. Solomatine. Data-driven modelling (part 3).

10

SOFM: main idea

1 2 3 11 1M
12

... ...

N NM

j (0) j (t2 ) j(t1 ) j

1 2 x1 x2 xM

a)

b)
11

D.P. Solomatine. Data-driven modelling (part 3).

SOFM: algorithm (1)

0 Initialize weight, normally with small random values.


Set topological neighborhood parameters. Set learning rate parameters. Iteration number t = 1.

1 While stopping condition is false, do iteration t (steps 28): 2 For each input vector x = {x1,..., xN} do steps 3 8:

3 For each output node k calculate the similarity measure (in this case the Euclidean distance) between the input and the weight vector: N

D ( k ) = ( wik xi ) 2
i =1

D.P. Solomatine. Data-driven modelling (part 3).

12

SOFM: algorithm (2)

4 Find index kmax such that D(k) is a minimum this will refer to the winning node. 5 Update the weights for the node kmax and for all nodes k within a specified neighborhood radius r from kmax:

wik (t + 1) = wik (t ) + (t ) N ( r, t ) [ xi wik (t )]


6 Update learning rate (t) 7 Reduce radius r used in the neigborhood function N (this can be done less frequently than at each iteration). 8 Test stopping condition.

D.P. Solomatine. Data-driven modelling (part 3).

13

SOFM: example

Input set: sampling points in a square randomly (the probability of sampling a point in the central square region was 20 times greater than elsewhere in a square) The target space is discrete and includes 100 output nodes arranged in 2 dimensions SOFM is able to find the cluster the area of the points concentration

D.P. Solomatine. Data-driven modelling (part 3).

14

SOFM: visualisation and interpretation

count maps, which is the easiest

and mostly used method. This is a plot showing for each output node number of times when it was a winning one. It can be interpolated into colour shading as well distance matrix (of size K x K) which elements are Euclidean distance of each output unit to its immediate neighbouring units

D.P. Solomatine. Data-driven modelling (part 3).

15

SOFM: visualization and interpretation

vector position or cluster maps:


colours are coded according to their similarity in the input space each dot corresponds to one output map unit each map unit is connected to its neighbours by line

D.P. Solomatine. Data-driven modelling (part 3).

16

SOFM: visualization and interpretation

vector position or cluster maps:


in 3D

D.P. Solomatine. Data-driven modelling (part 3).

17

Instance-based learning (lazy learning)

D.P. Solomatine. Data-driven modelling (part 3).

18

Lazy and eager learning

Eager learning:
first ML (data-driven) model is built then it is tested and used

Lazy learning
no ML model is built (i.e lazy) when new examples come, the output is generated immediately on the basis of the training examples Other names for lazy learning:
Instance-based Exemplar-based Case-based Experience-based Edited k-nearest neighbor
D.P. Solomatine. Data-driven modelling (part 3). 19

k-Nearest neighbors method: classification

instances are points in 2-dim. space, output is boolean (+ or -) new instance xq is classified w.r.t. proximity of nearest training instances
to class + (if 1 neighbor is considered) to class - (if 4 neighbors are considered)

for discrete-valued outputs assign: the most common value

D.P. Solomatine. Data-driven modelling (part 3).

Voronoi diagram for 1-Nearest neighbor 1-

20

10

Notations

instance x as {a1(x) ... an(x)} where ar(x) denotes the value of the r-th attribute of instance x. distance between two instances xi and xj is defined to be d(xi, xj) where

d ( xi , x j ) = ( ar ( xi ) ar ( x j )) 2

D.P. Solomatine. Data-driven modelling (part 3).

21

k-Nearest neighbor algorithm

Training
Build the set of training examples D.

Classification
Given a query instance xq to be classified, Let x1... xk denote the k instances from D that are nearest to xq Return

F ( xq ) = arg max ( v, f ( xi ))
vV i =1

where (a, b)=1, if a = b, and (a, b)=0 otherwise V = {v1,,vs} set of possible output values.

D.P. Solomatine. Data-driven modelling (part 3).

22

11

k-Nearest neighbors: regression (target function is real-valued )


model a real-valued target function F : n . instances are points in n-dim. space, output is a real number new instance xq is valued w.r.t. values of nearest training instances (average of k instances is

taken, or the weighted average) values and proximity of nearest training instances (locally weighted regression model is built and used to predict the value of new instance)

In this case the final line on the k-NN algorithm should be replaced by the line k

F ( xq ) =
D.P. Solomatine. Data-driven modelling (part 3).

f ( x ))
i =1 i

k
23

Distance weighted k-NN algorithm (classification)


weigh the contribution of each of the k neighbors according to their distance to the query point xq, giving greater weight wi to closer neighbors This can be accomplished by replacing the final line in the algorithm by

F ( xq ) = arg max wi ( v, f ( xi ))
vV i =1

where the weight is

wi =

1 d ( xq , xi )2
24

D.P. Solomatine. Data-driven modelling (part 3).

12

Distance weighted k-NN algorithm (numerical prediction)


for real-valued output this is accomplished by replacing the final line in the algorithm by

F ( xq ) =
where the weight is

w f ( x ))
i =1 i i

w
i =1

wi =

1 d ( xq , xi )2
25

D.P. Solomatine. Data-driven modelling (part 3).

k-Nearest neighbors: using all examples

for classification:

F ( xq ) = arg max
vV

All instances i =1 i

w (v, f ( x ))
i

for regression:
All instances

F ( xq ) =

w f ( x ))
i i i =1 All instances

w
i =1

D.P. Solomatine. Data-driven modelling (part 3).

26

13

k-Nearest neighbors: comments

k-NN creates a local model of the proximity of new instance, instead of a global model of all training instances robust to noisy training data requires considerable amount of data distance between instances is calculated based on all attributes (and not on 1 as in decision trees). Possible problem:
imagine instances described by 20 attributes, but only 2 are relevant to target function curse of dimensionality: nearest neighbor method is easily mislead when high-dimensional X solution: stretch j-th axis by weight zj chosen to minimize prediction error

with number of training instances , kNN approaches Bayesian optimal classification

D.P. Solomatine. Data-driven modelling (part 3).

27

Locally weighted regression (1)

construct an explicit approximation F (x) of the target function f(x) over a local region surrounding the new query point xq If F(x) is linear then this is called locally weighted linear regression

F ( x) = w0 + w1 a1 ( x) + ... + wn a n ( x)

Instead of minimizing the global error E, here the local error E(xq) has to be minimized

D.P. Solomatine. Data-driven modelling (part 3).

28

14

Locally weighted regression (2)

Various approaches to minimizing error E(xq):

Minimize the squared error over just k nearest neighbors

E1 ( xq ) =

1 ( f ( x ) F ( x )) 2 2 xk nearest nbrs of xq

Minimize the squared error over entire set D of training examples, while weighting the error of each training example by some decreasing function K of its distance from xq:

E 2 ( xq ) =

1 ( f ( x) F ( x))2 K (d ( xq , x)) 2 xD

Combine 1 and 2 (to reduce computational costs):


E3 ( x q ) = 1 ( f ( x) F ( x)) 2 K (d ( xq , x)) 2 xk nearest nbrs of xq
29

D.P. Solomatine. Data-driven modelling (part 3).

Case-base reasoning (CBR)

instance-based learning, but output is not-real valued but is represented by symbolic descriptions methods used to retrieve similar instances are more elaborate (not just Euclidean distance) Applications:
conceptual design of mechanical devices based on a stored library of previous designs (Sycara 1992) new legal cases based on previous rulings (Ashley 1990) selection of an appropriate hydrological model based on previous experience (Kukuric 1997, PhD of IHE)

D.P. Solomatine. Data-driven modelling (part 3).

30

15

Remarks on Lazy and Eager learning

Lazy methods: k-NN, locally weighted regression, CBR Eager learners: are "eager" to before they observe the testing instance xq they already built the global approximation of the target function. Lazy learners:
defer the decision of how to generalize beyond the training data until each new instance is encountered, when new examples come, the output is generated immediately on the basis of nearest training examples

Lazy learners have a richer set of hypotheses - they select an appropriate hypothesis (e.g. linear function) for each new instance So Lazy methods are better suited to customize to unknown future instances
D.P. Solomatine. Data-driven modelling (part 3). 31

Fuzzy rule-based systems

D.P. Solomatine. Data-driven modelling (part 3).

32

16

Fuzzy logic

introduced in 1965 by Lotfi ZADEH, Univ. of Berkeley Boolean logic is two-valued (False, True). Fuzzy logic is multivalued (False...AlmostFalse...AlmostTrue...True) Fuzzy set theory deals with degree of truth that the outcome belongs to a certain category (partial truth) a fuzzy set on a universe U: for any uU there is a corresponding real number A(u)[0,1] called grade of membership of u belonging to A mapping A : U [0,1] is called membership function of A
D.P. Solomatine. Data-driven modelling (part 3). 33

Example of an ordinary and a fuzzy set "tall people"

D.P. Solomatine. Data-driven modelling (part 3).

34

17

Various shapes of membership functions

[, +] is support of the fuzzy set, 1 is its kernel


1 1

a) Triangular membership function

b) Bell-shaped function

c) Dome-shaped function

d) Inverted cycloid function

D.P. Solomatine. Data-driven modelling (part 3).

35

Example of a membership function "appropriate water level in the reservoir"

support
D.P. Solomatine. Data-driven modelling (part 3).

kernel
36

18

Alpha-cut

0.5-cut = [4.5, 7.0]


D.P. Solomatine. Data-driven modelling (part 3). 37

Fuzzy numbers

Special cases of fuzzy sets are fuzzy numbers A fuzzy subset A of the set of real numbers is called a fuzzy number if there is at least one z such that A(z) = 1 (normality assumption) for every real numbers a, b, c with a < c < b A(c) min (A(a), A(b))
(convexity assumption, meaning that the membership function of a fuzzy number consists of an increasing and decreasing part, and possibly flat parts)

D.P. Solomatine. Data-driven modelling (part 3).

38

19

Linguistic variable: example

Linguistic variable can take linguistic values (like low, high, navigable) associated with fuzzy subsets M of the universe U (here U = [0,50])
WATER LEVEL LINGUISTIC VARIABLE Fuzzy Restriction Fuzzy Values of water level Enough Volume for flood detention Environmentally Friendly Navigable

0.2 0.3 1 1 1 1 0.7

0 0.5 0.7 0.8 1 0.9 0.8 0.9 1 1 1 1 Compatibility Links

10

15

20

25

30

35

40 45 50 Water level (m)

BASE VARIABLE

D.P. Solomatine. Data-driven modelling (part 3).

39

Operations on fuzzy sets

D.P. Solomatine. Data-driven modelling (part 3).

40

20

Fuzzy rules

Fuzzy rules are linguistic constructs of the type IF A THEN B where A and B are collections of propositions containing linguistic variables (i.e. variables with linguistic values). A is called a premise and B is the consequence of the rule. If there are K premises in a system, the i-th rule has the form:
If a1 is Ai ,1 a2 is Ai , 2 . . . ak is Ai ,k then Bi

where a is a crisp input, A and B are linguistic variables, is one of the operators AND, OR, XOR.

D.P. Solomatine. Data-driven modelling (part 3).

41

Additive model of combining rules

D.P. Solomatine. Data-driven modelling (part 3).

42

21

Fuzzy rule-based systems (FS)

use linguistic variables based on fuzzy logic based on encoding relationships between variables in the form of rules rules are generated through the analysis of a large data samples such rules are used to produce the values of the output variables given new input values

D.P. Solomatine. Data-driven modelling (part 3).

43

Example: Fuzzy rules in control


rules like: IF Temperature is Cool THEN AirMotorSpeed := Slow
BLAST If Warm, then fast If Right, then medium If Cool, then slow If Cold, then stop RIGHT COOL 5 10 If Hot, then blast

Input: Temperature = 22 What will be the AirMotorSpeed?


Temperature is RIGHT with degree of fulfillment (DOF) = 0.6 and WARM with DOF = 0.2 two rules are fired

100

1.0 COLD 0.6 0.2 0.0 1.0 0.6 0.2 0.0


0 5 COOL 10

RIGHT

HOT WARM 30 35

80 FAST AIR MOTOR SPEED 60 MEDIUM 40 SLOW 20 0 STOP

15 20 25 TEMPERATURE 0C MEDIUM

STOP SLOW 20

BLAST FAST 100

40 60 80 AIR MOTOR SPEED

COLD

HOT WARM 30 35

1.0 0.6 0.2 0.0


0 20 40 60 80 AIR MOTOR SPEED

defuzzyfication using centroid of the area


The weighted sum combination method. 100

15 20 25 TEMPERATURE

1.0 0.6 0.2 0.0


The crested weighted sum combination method. 0 20 40 60 80 AIR MOTOR SPEED 100

D.P. Solomatine. Data-driven modelling (part 3).

44

22

Combining premises in a rule

(left) part of a fuzzy rule is satisfied The means to combine the memberships of the inputs to the corresponding fuzzy sets into a DOF is called inference Product inference for rule i is defined as:

Degree of fulfillment (DOF) is the extent to which the premise

i = DOF ( Ai ) = A (ak )
(rule is sensitive to the change in the amount of truth contained in each premise) Minimum inference for rule i is define like this:
k =1
i ,k

i = DOF ( Ai ) = Min A (ak )


k =1.. K
i ,k

D.P. Solomatine. Data-driven modelling (part 3).

45

Combining rules: example for 2 inputs


Output

Input 2

H M H L L
D.P. Solomatine. Data-driven modelling (part 3).

Input 1

M
46

23

Combining rules: weighted sum combination

weight
1.0 0.6 0.2 0.0

weighted sum combination uses the DOF of each rule as a

The weighted sum combination method. 0 20 40 60 80 AIR MOTOR SPEED 100

If there are I rules each having a response fuzzy set Bi with DOF of i, the combined membership function

B ( x) =
D.P. Solomatine. Data-driven modelling (part 3).


i =1 i I u i =1

Bi

( x)

Max i Bi ( x )

47

Combining rules: crested weighted sum combination

crested weighted sum combination is there, when each output

membership function is clipped off at a height corresponding to the rules degree of fulfillment
1.0 0.6 0.2 0.0 0 20 40 60 80 AIR MOTOR SPEED 100 The crested weighted sum combination method.

D.P. Solomatine. Data-driven modelling (part 3).

If there are I rules each having a response fuzzy set Bi with DOF of i, the combined membership function I Min( i , B ( x) ) i =1 B ( x) = I Max Min ( i , B ( x ) )
i

i =1

48

24

Combining rules: defuzzification

Defuzzification is a mapping from a fuzzy consequence of consequences Bi to a crisp consequence this is actually the identification of the fuzzy mean
the most widely used method is:
find the centroid (center of gravity) of the area below the membership function and take its abscissa coordinate as the crisp output.
1.0 0.6 0.2 0.0 0 20 40 60 80 AIR MOTOR SPEED 100
49

defuzzyfication using centroid of the area

The weighted sum combination method.

D.P. Solomatine. Data-driven modelling (part 3).

In the previous example the rules were given. But how to build them from data?
the following is given/assumed:
the known rule structure, that is the number of premises in each rule shapes of membership functions the number of rules the training set T is given: a set of S observed inputs (a) and output (b) real-valued vectors:

T = {(a1 (s ), . . . , a K (s ), b(s )); s = 1, . . . , S } It is assumed that we are training I rules with K premises in a system, where the i-th rule has the following form: If a1 is Ai ,1 AND a2 is Ai , 2 AND . . . AND ak is Ai ,k then Bi

where a is a crisp input, A and B are triangular fuzzy numbers. parameters of A and B (supports and kernels) are to be found
D.P. Solomatine. Data-driven modelling (part 3). 50

25

Building rules from data: weighted counting algorithm (1)

D.P. Solomatine. Data-driven modelling (part 3).

51

Building rules from data: weighted counting algorithm (2)


uses the subset of the training set that satisfies the premises of a rule at least to a degree of fulfilment of threshold to construct the shape of the corresponding consequence It is accomplished with the following steps (i is the rule number, k is the premise number)

D.P. Solomatine. Data-driven modelling (part 3).

52

26

Building rules from data: weighted counting algorithm (3)


1 Define the support (-i,k +i,k) of the i-th rules premise Ai,k 2 Ai,k is assumed to be a triangular fuzzy number (-i,k 1i,k , +i,k)T where1i,k is the mean of all possible ak(s) values which fulfil at least partially the ith rule:

i1,k =

1 Ni

sRi

a ( s)
k

D.P. Solomatine. Data-driven modelling (part 3).

3 Calculate the DOFs i (s) for each premise vector (a1(s) ak(s)) corresponding to the training set T and each rule i whose premises were determined in step 1. 4 Select a threshold > 0 such that only responses with DOF > will be considered in the construction of the rule response. The corresponding response is assumed to be also a triangular fuzzy number (-i,k 1i,k , +i,k)T defined by: i ( s)b(s) i = Min b( s ) i+ = Max b( s ) i1 = i ( s ) > i ( s )> i ( s ) > ( s )> i ( s)
i

53

Fuzzy rule-based system: rulelearning rules from data

CRISP INPUT (X)

FUZZIFIER

FUZZY INFERENCE ENGINE

DEFUZZIFIER

CRISP OUTPUT (Y)

RULES

HISTORICAL DATA

TRAINING

EXPERT JUDGEMENTS

(are not considered here)

D.P. Solomatine. Data-driven modelling (part 3).

54

27

Case study: catchment in Veneto region, Italy Andraz Arabba Caprile

Modeling spatial rainfall distribution using Fuzzy rulebased system: filling missing data in past records estimating rainfall depth at a station Caprile (based on data for Arabba and Andraz) in case of a sudden equipment failure

D.P. Solomatine. Data-driven modelling (part 3).

55

Problem formulation

Daily precipitation at three stations in 1985-91 Data split for training and verification Daily precipitation at Andraz & Arabba used to determine the daily precipitation at Caprile Performance indices
Mean square error (MSE) b/n modeled & observed data Percentage of predictions within a predefined tolerance target (5% is used)

Problems:
missing records in training data non-uniform distribution ofdata
D.P. Solomatine. Data-driven modelling (part 3). 56

28

Methods cosidered

Traditional Normal ratio method


PX N N 1 N = X PA + X PB + X PC NB NC 3 N A

Neural network Fuzzy rule-based system

D.P. Solomatine. Data-driven modelling (part 3).

57

How many rules to use?

Too many rules lead to overfitting and the higher error on verification
Effect of the Number of Rules
Mean Square Error 12 10 8 6 4 2 4 9 16 Num ber of rules
D.P. Solomatine. Data-driven modelling (part 3). 58

1988-91(T) 1985-87(V)

25

36

29

Results: best performance


Training Performance (1989-91)
80 Simulated Precipitation
Sim ulated Precipitation 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80

Verification Performance (1985-88)

70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 Observed Precipitation

Observed Precipitation

Precipitation at CAPRILE for the first 120 days of 1987

D.P. Solomatine. Data-driven modelling (part 3).

59

Veneto case study: comparison of fuzzy rules, neural network and the normal ratio method

Performance Comparison (Case 1) M e an Square Er ror Within 5% Tole rance


9 8 7 6 5 4 3 2

Performance Comparison (Case 1)


98 96 94 92 90 88 86

-9 89 19

) 1( T

) ) ) ) (V (V (V (V 88 87 86 8 - 88 19 19 19 19 85 19 FRBS NNN TRAD ) 5( V

) ) ) ) ) ) 1( T 985( V 986( V 987( V 988( V - 88( V 5 9- 9 1 1 1 1 98 98 1 1 FRBS NNN TRAD

D.P. Solomatine. Data-driven modelling (part 3).

60

30

Veneto case study: conclusions

FRBS was the most accurate than the ANN and the Normal ratio method training is faster than of ANN Issues to pay attention to:
curse of dimensionality: more than 5 inputs is very difficult to handle too many rules may cause overfitting non-uniformly distributed data lead to empty areas where rules cannot be trained

D.P. Solomatine. Data-driven modelling (part 3).

61

Case study Delfland: training ANN or Fuzzy controller on data obtained from an optimal controller in water level control
data-driven controller (ANN or Fuzzy rule-based system) is trained on data generated by the optimal controller, then can replace it
Training Aquarius optimal controller y(t)d

y(t)

Target Error in + water control level signal

ANN or FRBS model

u(t)
pumping rate

Hydrological y(t) processes in water the polders level

D.P. Solomatine. Data-driven modelling (part 3).

62

31

Case study: Delfland

D.P. Solomatine. Data-driven modelling (part 3).

63

Replicating controller by ANN (output - pump status at time t )


Input variables in Local control
water level at time t-1 water level at time t pump status at time t-1

Input variables in Centralised dynamic control


precipitation at time t-2 precipitation at time t-1 precipitation at time t water level at time t-1 water level at time t groundwater level at time t pump status at time t-1

D.P. Solomatine. Data-driven modelling (part 3).

64

32

Performance of the Neural network reproducing behaviour of an optimal controller

Pump status

D.P. Solomatine. Data-driven modelling (part 3).

65

Fuzzy rules reproducing optimal control of water level in Delfland

Pump status

D.P. Solomatine. Data-driven modelling (part 3).

66

33

Bayesian learning

D.P. Solomatine. Data-driven modelling (part 3).

67

Bayesian theorem

we are interested in determining the best hypothesis h from some space H, given the observed data D Some notations:
P(h) = prior probability that hypothesis h holds P(D) = prior probability that training data D will be observed (without knowledge which hypothesis holds) P(D/h) = probability of observing data D given h holds P(h/D) = probability that h holds given observed data D

Bayes theorem:

P(h / D) =

P ( D / h) P ( h ) P( D)
68

D.P. Solomatine. Data-driven modelling (part 3).

34

Selecting "best" hypothesis using Bayes theorem

learning in Bayesian sense: selecting the most probable hypothesis (maximum a posteriori hypothesis MAP)

hMAP arg max P (h / D)


hH

P ( D / h ) P ( h) P( D) hH = arg max P ( D / h) P (h) = arg max


hH

P(D/h) is called likelihood of data D given h if all hypotheses are equally probable, then maximum likelihood (ML) hypothesis:

hML = arg max P ( D / h)


hH

D.P. Solomatine. Data-driven modelling (part 3).

69

Bayesian learning: example

hypothesis h = "patient has cancer", alternative = "no cancer" prior knowledge (without data): P(h)=0.008 data that can be observed: test with 2 outcomes (+ or -):
right results:
P(+/cancer) = 0.98 P(-/nocancer) = 0.97 P(+/nocancer) = 0.03

errors:
P(-/cancer) = 0.02

suppose data is observed: a patient is tested and result is + is then hypothesis correct?: choose hypothesis with MAP, that is hypothesis for which P(D/h)P(h) = max
P(+/cancer) P(cancer) = 0.98 * 0.008 = 0.0078 P(+/nocancer) P(nocancer) = 0.03 * 0.992 = 0.0298 --> hypothesis "no cancer" wins
D.P. Solomatine. Data-driven modelling (part 3). 70

35

Naive Bayes classifier

assume that each instance x of the data set is characterized by the several attributes {a1,, an} target function F (x) can take on any value from a finite set V a set of training examples {xi} is provided when a new instance < a1,, an > is presented, the classifier should identify the most probable target value vMAP.

D.P. Solomatine. Data-driven modelling (part 3).

71

Nave Bayes classifier (2)

This condition can be written like this


vMAP = arg max P(v j / a1 ,..., an )
vjV

or by applying Bayes theorem:


vMAP = arg max
v j V v j V

P (a1 ,..., an / v j ) P (v j ) P (a1 ,..., an )

= arg max P (a1 ,..., an / v j ) P (v j )

P(vj) can be estimated simply by counting the frequency with which each target value vj occurs in data

D.P. Solomatine. Data-driven modelling (part 3).

72

36

Nave Bayes classifier (3)

terms P(a1,, an / vj) can be estimated by counting in a similar way, however, the total number of these terms is equal to the number of possible instances times the number of possible target values - so it is difficult The solution is in a simplifying assumption that the attribute values a1,, an are conditionally independent given the target value. In this case P(a1,, an / vj) = i P(ai / vj) and to estimate P(ai / vj) is much easier also by counting the frequency. This gives the rule of the nave Bayes classifier:

v NaiveBayes = arg max P(v j ) P (ai / v j )


v j V i

D.P. Solomatine. Data-driven modelling (part 3).

73

Modular models: committee machines, ensembles, mixtures of experts, boosting

D.P. Solomatine. Data-driven modelling (part 3).

74

37

Committee machine (modular model)


Instead of building one model, several models are built each responsible for a particular situation
Consider a forecasting model Q(t+1) = f (R(t-2), R(t-3), Q(t-1) Rainfall (t-2) High flows separate models are built

Medium flows Rainfall (t-3) Low flows Flow Q(t)


D.P. Solomatine. Data-driven modelling (part 3).

past records

New record (hydrometeorological condition). It is to be attributed to one (or several classes), and the corresponding models will be run

75

Committee machine (modular model)

D.P. Solomatine. Data-driven modelling (part 3).

76

38

Committee machines (modular model)


input data is split into subsets and separate data-driven models are trained:
hard split: sort according to position in the input space (low - high rainfall) this allows to bring in the physical insight no split: do not sort, but train several models on the same data and then combine results by some voting scheme (committee machine)
voting by majority, weighted majority, by averaging

soft split: split according to how well a given model trained with this data, and then train also other models. Example: boosting
present the original training data (N examples) set to machine 1 assign higher probability to samples that are badly classified sample N examples from training set based on the new distribution train machine 2 continue, ending with n machines
D.P. Solomatine. Data-driven modelling (part 3). 77

Committee machine with hard split, expert (specialised) models trained on subsets
Input x Splitting (gating machine)

Machine 1 y1

Machine 2 y2

Machine n yn

D.P. Solomatine. Data-driven modelling (part 3).

78

39

Committee machine with no split (ensemble), all models are trained on the same set
Input x No splitting

Machine 1 y1

Machine 2 y2

Machine n yn

Combiner (averaging scheme) y


D.P. Solomatine. Data-driven modelling (part 3). 79

Committee machine with soft split of data. Boosting


redistribution redistribution

Input x

sampling N training examples from the distribution where badly predicted examples are given higher probability

Machine 1 y1

Machine 2 y2

Machine n yn

Combiner (weighted averaging scheme) y


D.P. Solomatine. Data-driven modelling (part 3). 80

40

Using mixture of experts (models): each model is for particular hydrological condition
INPUT Condition 1 Q xt-1> 1000 N Condition 2 Q xt-1 1000 Q xt > 200 Condition 3 Pa t-1 > 50 Pa Mov2 t-2 < 5 Pa Mov2 t-4 < Condition x 5
D.P. Solomatine. Data-driven modelling (part 3).

Module 1 Module 2

M5 ANN M5 ANN

Module 3 Module x

M5

?
81

Combining physically-based and data-driven models. Complementary use of a Data-driven model


PHYSICAL SYSTEM

Observed output

Input data

Model parameters HYDRO LOGIC FORECASTING MODEL

Model errors Improved output DATADRIVEN error forecasting model Forecasted errors

Model output

D.P. Solomatine. Data-driven modelling (part 3).

82

41

End of Part 3

D.P. Solomatine. Data-driven modelling (part 3).

83

42

You might also like