You are on page 1of 591

Data Mining

Frank Klawonn
f.klawonn@fh-wolfenbuettel.de

Department of Computer Science University of Applied Sciences Braunschweig/Wolfenbuettel http://public.rz.fh-wolfenbuettel.de/ klawonn

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.1/39

Basic references
Data Mining: Introductory and Advanced Topics. Prentice Hall, Upper Saddle River, NJ (2002). J. Han, M. Kamber: Data Mining (2nd ed.). Morgan Kaufmann, San Mateo, CA (2005). D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, Cambridge, MA (2001). D.T. Larose: Data Mining Methods and Models. Wiley, Chichester (2006). D.T. Larose: Discovering Knowledge in Data: An Introduction to Data Mining. Wiley, Chichester (2006). I.H. Witten, E. Frank: Data Mining (2nd ed.). Morgan Kaufmann, San Mateo, CA (2005).
M.H. Dunham:
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.2/39

Web references
http://www.cs.waikato.ac.nz/ ml/

Website for the book by

Witten/Frank with

Java software, example data sets, slides for teaching.


http://www.ics.uci.edu/ mlearn/MLRepository.html

(UCI repository). A collection of example data sets with various properties. A powerful statistics software.

http://www.r-project.org/

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.3/39

What is data mining?

Data mining is the analysis of (often large) observational data sets to nd unsuspected relationships and to summerize the data in novel ways that are both understandable and useful to the data owner. (Hand/Mannila/Smyth)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.4/39

Data Mining

In the context of data mining, it is quite usual that


the data to be examined were not necessarily collected for the specic purpose of analysis, in contrast to experimental data where a suitable experiment is designed to collect the data. Example: Money transactions within a bank must be documented for safety reasons. Nevertheless, requirements and customer preferences can be deduced from such data, for instance, how much money might be needed daily for new ATM.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.5/39

Data Mining

The data mining process should produce a result


that can be represented in a form that is understandable for the user of the data, not for a statistician. In the ideal case, the data mining methods are directly applied by the user of the data. In most cases, this does not work. This is why we are here!

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.6/39

Data Mining

Finding new or interesting relationships or patterns


in data is not a trivial task. Example: Applying a simple algorithm to hospital might nd the true, but completely uninteresting rule If pregnant, then woman.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.7/39

Data Mining

Data sets can be large.


This migt have severe consequences on the choice of suitable methods. Quadratic complexity in the number of data is very often not acceptable.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.8/39

Characteristics of data

refer to single instances (single objects, persons,


events, points in time etc.)

describe individual properties are often available in huge amounts (databases,


archives)

are usually easy to collect or to obtain (e.g. cash


registers with scanners in supermarkets, Internet)

do not allow us to make predictions

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.9/39

Characteristics of knowledge

refers to classes of instances (sets of objects,


persons, points in time etc.)

describes general patterns, structure, laws,


principles etc.

consists of as few statements as possible (this is


an objective!)

is usually difcult to nd or to obtain (e.g. natural


laws, education)

allows us to make predictions

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.10/39

Data vs. knowledge

(Customer) data ID Name Age Sex Income . . . . . . . . . . . . . . . 2448 Miller 35 Male 6000 2449 Smith 39 Female 7000 . . . . . . . . . . . . . . .
Knowledge:

80% of our customers are between 30 and 40 years old and earn between 5000$ and 9000$ per month.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.11/39

Criteria to assess knowledge

Not all statements are equally important, equally substantial, equally useful. Knowledge must be assessed. Assessment Criteria

Correctness (probability, success in tests) Generality (range of validity, conditions of validity) Usefulness (relevance, predictive power) Comprehensibility (simplicity, clarity, parsimony) Novelty (previously unknown, unexpected)
Data Mining p.12/39

University of Applied Sciences Braunschweig/Wolfenbuettel

Examples

A bank has decades of experience in giving loans to customers. Large amounts of data of customers who have or have not returned their loans are available. Is it possible to derive an algorithm from the data that decides for a new customer desiring a loan, whether the loan should be granted or not?

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.13/39

Examples

Problems/questions

Are the data complete and correct? Is the complete available customer information
(age, income, address, prediction? customer available? ) needed for the

Is all the necessary information about the Are the data representative? Can an unambiguous prediction/decision be
made?
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.14/39

Examples

A producer of porcelain wants to install an automatic quality check device that sorts out broken parts. The produced parts are stimulated by an acoustic signal and a frequency spectrum is measured. The parts are manually classied as broken or OK. Is it possible to derive an algorithm from the data that classies a new part as broken or OK, given the measured frequency spectrum?

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.15/39

Examples
A bank provides credit cards for customers and wants to detect as many fraud transactions as possible.
Problems/questions

Are the data complete and correct? Is the complete available customer information
(amount, location of the transaction, customer address, previous customer history, ) needed for the prediction?

Is all the necessary information about the


customer available?

Are the data representative? Can an unambiguous prediction/decision be


made?
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.16/39

Examples

Churn detection: Given customer data and history, nd the possible candidates for churners.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.17/39

Examples

Various properties of garbarge particles are measured (colour, weight, magnetism, ). Is it possible to sort the particles automtically into groups like paper, glass, metal, plastic?

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.18/39

(Supervised) Classication

Common property of all previous examples: Based on certain measurements/information, an assignment of an object (customer, porcelain part, garbage particle) to one group of a nite number of classes (grant loan yes/no, broken/OK, churner/nonchurner, paper/glass/metal/plastic) is needed. Such tasks are called (supervised) classication problems.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.19/39

Examples

Given (selected) stock prices and indices of today, predict a specic stock price or index value for tomorrow. Given todays weather conditions, predict the (local) temperature for tomorrow. In both cases, historical data for decades are available.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.20/39

Regression

These problems are also supervised learning problems. Supervised refers to the fact that the outcome/prediction for the given/historical data is known. In contrast to classication problems, the variable to be predicted (stock price, temperature) is continuous. It can take arbitrary real values. Such tasks are called regression problems.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.21/39

Examples

Given a customer database. Are the properties of the customers just randomly distributed or can the be grouped into typical customer segments? (poor young students, rich yuppies, OAPs with medium capital)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.22/39

Examples

Given the train tickets bought in Germany. Are there main connections that are used typically (at certain times/days)? (like from region Berlin to region Hamburg, from region Frankfurt to region Munich, )

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.23/39

(Unsupervised) Classication
In the two previous examples, the data should be grouped into classes. Similar data should be put into the same class. The classes are not known in advance. Such tasks are called unsupervised classication problems, usually solved by cluster analysis.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.24/39

Examples
Market basket analysis: Are there typical combinations of items that customers tend to buy together? For instance: Customers who buy wallpaper and paint also buy wallpaper paste in 90% of the cases.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.25/39

Examples

Analysis of molecular structures (for instance for the design of drugs):

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.26/39

Frequent item set mining

Such tasks are called frequent item set mining and association rule mining. One is only interested in substructures of the data set, not in describing or covering the whole data set.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.27/39

Subgroup discovery

aims at nding subsets where a given class (e.g. customers who tend to buy certain products) is signicantly over- or underrepresented.
Subgroup discovery

Subgroup discovery can be viewed as a partial classication problem. Not the whole data set needs to be classied. It is sufcient to nd subsets with good classication rates.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.28/39

Change detection

Are their signicant changes in the data over time? e.g. Does the company loose a customer group or is it winning a new one? Is the quality of certain materials or products changing?

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.29/39

Online learning/data mining

Analysing the data online while they arrive in a ontinuous data stream. It is assumed that the source/model generating the data does not change over time.
incremental learning:

Changes in the model parameters over time might be possible. The information derived from old might not be applicable to new data.
evolving systems:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.30/39

Examples: Complex data structures

molecular structures (2D/3D, graphs) images (How to retrieve images from an image
database?)

time series (Given the past development of stock


market indices, which trend will they follow in the near future?)

(text) documents (How can documents be


clustered into groups automatically?)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.31/39

CRISP

CRoss-Industry Standard Process for Data Mining


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.32/39

Business understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem denition, and a preliminary plan designed to achieve the objectives.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.33/39

Data understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover rst insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.34/39

Data preparation
The data preparation phase covers all activities to construct the nal dataset (data that will be fed into the modelling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modelling tools.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.35/39

Modelling

In this phase, various modelling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specic requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.36/39

Evaluation
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to nal deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufciently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.37/39

Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organised and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions will need to be carried out in order to actually make use of the created models.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.38/39

Selected CRISP phases


exploratory data analysis techniques, outlier detection, visualisation

Data understanding: Data preparation: Modelling:

feature extraction, transformation, treatment of missing values Choice of the model (classier, regression model, cluster analysis technique, ) and estimation of the model parameters Model validation and selection. Is the derived model suitable to make reliable predictions for new data?

Evaluation:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.39/39

Statistical reasoning

Denition.

Let

and

two events with (

supports

), if

speaks against

holds.

), if

holds.

is irrelevant for

(independent of) holds.

), if

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.1/106

Statistical reasoning

should not be interpreted as some kind of probabilistic implication!


Theorem.

(a) (b) (c) (d)


Data Mining p.2/106

University of Applied Sciences Braunschweig/Wolfenbuettel

Statistical reasoning

" ":

Proof.

(only for (a) as an example)

" ":

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.3/106

Statistical reasoning

Theorem.

(a) (b) (c) (d)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.4/106

Statistical reasoning

Theorem.

(a) (b) (c)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.5/106

Statistical reasoning
,

Proof.

Choose

for

(a)

, ,

, ,

(b) analogously (c)


, , , , ,

University of Applied Sciences Braunschweig/Wolfenbuettel


Data Mining p.6/106

Statistical reasoning

Theorem.

Cooresponding theorems are valid for and


(a) (b) (c) (d)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.7/106

Statistical reasoning

Proof.

(only of (d) as an example)


,

for

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.8/106

Statistical reasoning

Denition.

The event supports the event under the general condition (event) ( ), if holds where

Theorem.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.9/106

Statistical reasoning

Proof.

, ,

for

Within :

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.10/106

Probabilistic scissors, rock & paper

Player one chooses on of the following wheels, then player two will choose a wheel. Then both turn their wheel. The one with the higher number is the winner.

0 .5 2 1 0 5 4 3

0 .2 5

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.11/106

Probabilistic scissors, rock & paper

How to choose the wheels?


0 .5 2 1 0 5 4 3 0 .2 5

blue red red green green blue

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.12/106

Causality
"A statistical survey has shown that students receiving a grant perform better in their exams."

Are the grants the reason that the students perform better, since they do not have to earn money for their studies and can spent more time for learning? Or do only students with better results in school or early university years receive a grant?

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.13/106

Causality

Two balls are drawn without replacement from a bag with two white and two black balls. What is the probability that the second ball will be white, when the rst ball that was drawn is white?


University of Applied Sciences Braunschweig/Wolfenbuettel


Data Mining p.14/106

Causality

The rst ball was drawn, but hidden from the observer. What is the probability that the rst ball was white, when the second ball that was drawn is white? Wrong answer: 0.5, since the second ball cannot have any inuence on the rst ball.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.15/106

Causality

Correct answer:
1 /3 1 /2 1 /2 B W 2 /3 B W B 2 /3 1 /3 W


Data Mining p.16/106

University of Applied Sciences Braunschweig/Wolfenbuettel

Causality
Another reason, why the answer 0.5 is wrong: Assume that there are two black balls and one white ball in the bag in the beginning. What is the probability that the rst ball is white? Answer: What is the probability that the rst ball was white, when the second ball is white? Answer: 0

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.17/106

Hidden variables
The University of Fantasia provides grants for students based on an individual selection scheme. No. of applications No. of grants 3000 373 3000 1304 6000 1677

female male total

Further investigations have shown that the female students applying for grants have more or less the same school marks as the male applicants. Does this statistic prove that the selection scheme favours male students?
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.18/106

Hidden variables
Looking at the distributions over the subjects, it turns out that natural sciences and engineering students are favoured compared to social sciences students. No. of applications Natural Sci. Engineering Social Sci. female 400 300 2300 male 1400 1200 400 No. of grants Natural Sci. Engineering Social Sci. 200 150 23 700 600 4
Data Mining p.19/106

female male

University of Applied Sciences Braunschweig/Wolfenbuettel

Simpsons paradox

Hospital Therapy Patients Efciency Old

A New Old

B New

250

1050

1050

250

No effect Cured

70 180 (72%)

420 630 (60%)

630 420 (40%)

180 70 (28%)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.20/106

Simpsons paradox

Therapy Patients Efciency

Old

New

1300 700 600 (46%)

1300 600 700 (54%)

No effect Cured

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.21/106

Simpsons paradox

Market Competitor Sales Year

Europe We Competitor

Asia We

250

1050

1050

250

2005 2006

70 180 (+157%)

420 630 (+50%)

630 420 (33%)

180 70 (61%)

The competitor has performed better than our company.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.22/106

Simpsons paradox

Competitor Worldwide Sales Year

We

1300 700 600 ( 14%)

1300

2005 2006

600 700 (+17%)

We are better than the competitor!?

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.23/106

Example data set: iris data


collected by E. Anderson in 1935 contains measurements of four real-valued variables: sepal length, sepal widths, petal lengths and petal width of 150 iris owers of types Iris Setosa, Iris Versicolor, Iris Virginica (50 each) The fth attribute is the name of the ower type.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.24/106

Example data set: iris data

iris setosa, iris versicolor and iris virginica

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.25/106

Example data set: iris data

slength 5.1 3.5 ... ... 5.0 3.3 7.0 3.2 ... ... 5.1 2.5 5.7 2.8 ... ... 5.9 3.0

swidth plength pwidth species 1.4 0.2 Iris-setosa

1.4 0.2 Iris-setosa 4.7 1.4 Iris-versicolor

3.0 1.1 Iris-versicolor 4.1 1.3 Iris-versicolor

5.1 1.8 Iris-virginica


Data Mining p.26/106

University of Applied Sciences Braunschweig/Wolfenbuettel

Example data set: iris data

The header (rst line) species names for the attributes. The following lines contain the data with the values for the dened attributes separated by blanks or tabs. Therefore, each of the lines with the data must contain 5 values.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.27/106

Statistics tool R
R uses a type-free command language. Assignments are written in the form
x <- y y is assigned to x.

The object y must be dened (generated), before it can be assigned to x. Declaration of x is not required.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.28/106

R: Reading a le

iris<-read.table(file.choose(),header=T)

opens a le chooser. The chosen le is assigned to the object named iris.


header = T a header.

means that the chosen le will contain

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.29/106

R: Accessing a single variable

pw <- iris$pwidth

assigns the column named pwidth of the data set contained in the object iris to the object pw. The command
print(pw)

prints the corresponding column.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.30/106

R: Printing on the screen

[1] [19] [37] [55] [73] [91] [109] [127] [145]

0.2 0.3 0.2 1.5 1.5 1.2 1.8 1.8 2.5

0.2 0.3 0.1 1.3 1.2 1.4 2.5 1.8 2.3

0.2 0.2 0.2 1.6 1.3 1.2 2.0 2.1 1.9

0.2 0.4 0.2 1.0 1.4 1.0 1.9 1.6 2.0

0.2 0.2 0.3 1.3 1.4 1.3 2.1 1.9 2.3

0.4 0.5 0.3 1.4 1.7 1.2 2.0 2.0 1.8

0.3 0.2 0.2 1.0 1.5 1.3 2.4 2.2

0.2 0.2 0.6 1.5 1.0 1.3 2.3 1.5

... ... ... ... ... ... ... ...

0.1 0.2 0.2 1.5 1.5 2.1 2.3 1.8

0.1 0.4 0.2 1.0 1.6 1.8 2.0 2.1

0.2 0.1 1.4 1.5 1.5 2.2 2.0 2.4

0.4 0.2 1.5 1.1 1.3 2.1 1.8 2.3

0.4 0.1 1.5 1.8 1.3 1.7 2.1 1.9

0.3 0.2 1.3 1.3 1.3 1.8 1.8 2.3

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.31/106

Empirical mean

In the following, we always consider a sample with for all .

The sample mean or empirical mean, denoted by , is given by


Denition.

Note the difference between the mean of a random variable and the empirical mean of a sample.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.32/106

(Empirical) median

denotes the sample in ascending order.

The (sample or empirical) median denoted by , is given by


Denition.

if is odd if is even

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.33/106

(Empirical) median
m e d ia n n o d d n e v e n m e d ia n

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.34/106

R: Empirical mean & median


The mean and median can be computed using R bythe functions mean() and median(), respectively.
> mean(pw) [1] 1.198667 > median(pw) [1] 1.3

The mean and median can also be applied to data objects consisting of more than one (numerical) column, yielding a vector of mean/median values.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.35/106

Empirical variance
The sample variance or empirical variance is dened by

Denition.

standard deviation.

is called sample standard deviation or empirical

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.36/106

R: Empirical variance

The function var() yields the empirical variance in R.


> var(pw) [1] 0.5824143

The function sd() yields the empirical standard deviation.


> sd(pw) [1] 0.7631607

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.37/106

R: min and max


The functions min() and max() compute the minimum and the maximum in a data set.
> min(pw) [1] 0.1 > max(pw) [1] 2.5

is called span.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.38/106

Interquartile range
The empirical variance or the deviation as well as the span are measures of dispersion. The span is extremely sensitive to outliers, but also the emperical variance is sensitive to outliers. Therefore, the interquartile range

is often used as a measure of dispersion. Interquartile range in R (for the data in pw): IQR(pw)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.39/106

Visualisation

John W. Tukey: There is no excuse for failing to plot and look.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.40/106

Bar charts
A bar chart shows the relative or absolute frequencies for the values of an attribute with a nite domain. In R, the package lattice is required.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.41/106

Installation of R-packages

Once the computer is connected to the Internet, packages (e.g. lattice) can be downloaded and installed by the command
install.packages()

After choosing the mirror site for download, available packages are shown in alphabetical order. (Packages need to be downloaded only once, but they might need to be installed again.)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.42/106

Bar charts
example:
cl <- iris$species barchart(cl)

Irisvirginica

Irisversicolor

Irissetosa

10

20

30

40

50

University of Applied Sciences Braunschweig/Wolfenbuettel

Freq

Data Mining p.43/106

Histograms

A histogram shows the absolute number of data or the relative frequency of data in different classes. For numerical samples, bins (intervals) representing the classes must be dened. In most cases, intervals of the same length are chosen. The area of each rectangle is proportional to the number of data in the corresponding range.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.44/106

R: Histograms
The function
hist(pw,breaks=6,prob=T,main="petal width")

generates a histogram for the (numerical) data in pw, partitioning the domain of pw using breaks=6 (5 intervals of the same length), showing relative frequencies (prob=T) with the caption "petal width". Using the command postscript("outputfile.eps") the generated graphics will not be shown. It is stored in the PostScript le "outputle.eps".
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.45/106

R: Histograms

petal width

Density

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.0

0.5

1.0 pw

1.5

2.0

2.5

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.46/106

Empirical cdf
The empirical cumulative distribution function is

Denition.

given by

plot.ecdf(pw,main="petal width")

generates the empirical cdf for the data set pw, including the title "petal width" in the graphics.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.47/106

Empirical cdf

petal width
1.0 Fn(x) 0.0 0.2 0.4 0.6 0.8

0.0

0.5

1.0 x

1.5

2.0

2.5

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.48/106

Boxplots

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.49/106

Boxplots

m e d ia n

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.50/106

Boxplots

in te rq u a rtile ra n g e

~ 5 0 % o f th e d a ta

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.51/106

Boxplots

~ 1 .5 in te r q a r tile r a n g e

{ {
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.52/106

Boxplots

o u tlie r

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.53/106

Boxplots
or box and whiskers plots summarise important characteristics of a sample.
Boxplots

A boxplot for the data set stored in irisslbyclass can be generated by


bxp(boxplot(irisslbyclass))

Usually, boxplots for different samples are compared. (Here: The sepal length for setosa, versicolor and virginica.)
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.54/106

Boxplots
1. Determine the median. Draw a thick line at the position of the median. 2. Determine the 25%- and the 75%-quartiles and for the sample. Draw a box limited by these two quartiles. The other dimension of the box can be chosen arbitrarily. 3. iqr is the interquartile range. The inner fence is dened by the two values iqr and iqr.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.55/106

Boxplots
4. Find the smallest data point greater than and the largest data point smaller than . Add "whiskers" to the box extending to these two data points. 5. Data points lying outside the box and the whsikers are called outliers. Enter these data points in the diagram, for instance by circles. 6. Sometimes, extreme outliers (out of the outer fence dened by iqr and iqr) are drawn in a different way than mild outliers outside the whiskers, but within the inner fence.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.56/106

Scatterplots with R
plot(iris)
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 7.5

slength

4.0

2.0

3.0

swidth

plength

1.5

2.5

pwidth

0.5

4.5

5.5

6.5

7.5

University of Applied Sciences Braunschweig/Wolfenbuettel

4.5

5.5

6.5

Data Mining p.57/106

Scatterplots with R
species <- which(colnames(iris)=="species") super.sym <- trellis.par.get("superpose.symbol") splom(iris[1:4], groups = species,data = iris, panel = panel.superpose)

Scatterplots of the numerical attributes using different symbols for the classes.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.58/106

Scatterplots with R
2.5 2.0 1.5

pwidth

7 6 5 4

plength

4 3 2

1 4.5 3.5 4.0 4.5 4.0 3.5

swidth

3.0 2.5

2.0 2.5 3.0 8 7 7 8

2.0

slength

6 5

Scatter Plot Matrix

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.59/106

Scatterplots with R
splom(iris[1:4], groups = species, data = iris,panel = panel.superpose, key = list(title = "Three Varieties of Iris", columns = 3, points = list(pch = super.sym$pch[1:3], col = super.sym$col[1:3]), text = list(c("Setosa", "Versicolor", "Virginica"))))

Note: This needs the additional R package lattice which has to be installed and loaded rst.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.60/106

Scatterplots with R

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.61/106

3D scatterplots with R

scatterplot3d(iris2$pw,iris2$pl,iris2$sw, pch=iris2$species)

3.5

4.0

4.5

iris2$sw

7 6 3.0 5 4 2.5 3 2 2.0 1 0.0 0.5 1.0 1.5 2.0 2.5

iris2$pw

University of Applied Sciences Braunschweig/Wolfenbuettel

iris2$pl

Data Mining p.62/106

Enriched 3D scatterplots

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.63/106

PCA
is a technique for dimension reduction, which can also be used for visualisation.
Principal component analysis (PCA)

PCA aims at nding a linear mapping of the data to a lower-dimensional space that maintains as much of the original variance as possible of the original data without stretching during the projection.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.64/106

PCA

A projection of the two-dimensional data to the (one-dimensional) line (principal component) leads to a dimension reduction with little loss of variance.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.65/106

PCA
Projection of -dimensional data to -dimensional space where . First, the data are centred in the origin of the coordinate system. Then a -matrix is needed for the projection:

denotes the mean value:

Data Mining p.66/106

University of Applied Sciences Braunschweig/Wolfenbuettel

Covariance
The empirical covariance of two attributes dened as and is

In case and


should hold.

are independent,

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.67/106

Covariance
The empirical correlation coefcient of as the normalised covariance: and is dened

The larger

always holds.

, the more the

and

correlate.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.68/106

PCA
The projection matrix for PCA is given by

where are the normalised eigenvectors of the covariance matrix of the data


.
Data Mining p.69/106

for the largest eigenvalues

University of Applied Sciences Braunschweig/Wolfenbuettel

PCA
The sum of the variances of the projected data is the sum of these eigenvalues:

When PCA is used for dimension reduction, it is important to know how much of the original variance is covered by the projection to dimensions.


Data Mining p.70/106

University of Applied Sciences Braunschweig/Wolfenbuettel

PCA with R
species <- which(colnames(iris)=="species") iris_pca <- prcomp(iris[,-species], center=T,scale=T)

The nominal attribute species must be omitted for PCA. This is carried out here as well. If all attributes of a data set dset are numerical, then
dset_pca <- prcomp(dset,center=T,scale=T)

is sufcient.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.71/106

PCA with R
center: a logical value indicating whether the variables should be shifted to be zero centred. Alternately, a vector of length equal the number of columns of the data set can be supplied. The value is passed to scale. scale: a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. The default is FALSE, but in general scaling is advisable. Alternatively, a vector of length equal the number of columns of the data set can be supplied.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.72/106

PCA with R
> print(iris_pca) Standard deviations: [1] 1.7061120 0.9598025 0.3838662 0.1435538 Rotation: PC1 slength 0.5223716 swidth -0.2633549 plength 0.5812540 pwidth 0.5656110 PC2 PC3 PC4 -0.37231836 0.7210168 0.2619956 -0.92555649 -0.2420329 -0.1241348 -0.02109478 -0.1408923 -0.8011543 -0.06541577 -0.6338014 0.5235463

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.73/106

PCA with R
> summary(iris_pca) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.706 0.960 0.3839 0.14355 Proportion of Variance 0.728 0.230 0.0368 0.00515 Cumulative Proportion 0.728 0.958 0.9949 1.00000 > iris_pca$center slength swidth plength pwidth 5.843333 3.054000 3.758667 1.198667

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.74/106

PCA with R
plot(predict(iris_pca)[])

PC2

2 3

0 PC1

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.75/106

Digression: Regression

Given: data (for example: two real-valued attributes) Choose a model. (for example:

Dene an error or goodness of t function (for example: the mean squared error ) Find the optimum of the tting function (for example: standard least squares regression)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.76/106

Digression: Regression

0.00

0.05

0.10

0.15

0.20

0.0

0.5

1.0

1.5 x

2.0

2.5

3.0

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.77/106

Digression: Regression

140 120 100 80 60 40 20 0 2 1.5 1 0.5 -2 0 -1.5 -1 -0.5 -0.5 0 -1 0.5 1 -1.5 1.5 2-2

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.78/106

Multidimensional scaling (MDS)

Multidimensional scaling

tries to map the high-dimensional data to low-(2- or 3-)dimensional space with the aim to preserve the distances between the data as much as possible. The denition of a suitable error measure is required.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.79/106

MDS
Typical error mesures:


Data Mining p.80/106

University of Applied Sciences Braunschweig/Wolfenbuettel

MDS
The error is often called stress. Optimisation w.r.t. is called Sammon method. For none of these error functions an analytical solution for the minimum is known. Therefore a gradient ascent is applied for minimising the error function.
To minimise or maximise a function multiple variables, gradient techniques are one possibility.
University of Applied Sciences Braunschweig/Wolfenbuettel

in
Data Mining p.81/106

Gradient method
The gradient (the vector give by the partial derivatives ) points in the direction of the steepest w.r.t. ascend of . In order to maximise , one starts in an arbitrary point, computes the gradient, goes into the direction of the gradient, computes the gradient in the new point and continuous until convergence. In order to minimise a function, one simply goes into the opposite direction of the gradient. A gradient method can only nd local extrema at best!

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.82/106

MDS
Determining the gradient for error function :

if otherwise

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.83/106

MDS algorithm
1. Given: Data set projection dimension , step width , threshold value , maximum number of steps . 2. Compute

). ).

3. Initialise PCA projection).


4. Compute 5. Compute 6. Update
new

randomly (or with a

(
old

7. Repeat from step 4, if maximum number of iteration steps is not reached.


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.84/106

).

( ). and the

R: MDS (Sammon mapping)

> d.iris <- dist(iris[,-species]) > iris.sammon <- sammon(d.iris,k=2) Initial stress : 0.00691 stress after 10 iters: 0.00492,magic=0.092 stress after 20 iters: 0.00447,magic=0.213 stress after 30 iters: 0.00408,magic=0.030 stress after 40 iters: 0.00405,magic=0.500 > plot(iris.sammon)

Note that duplicate tuples must be removed in advance. Otherwise zero distances lead to division by zero.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.85/106

MDS
1.5 iris.sammon$points[,2] 1.0 0.5 0.0 0.5 1.0

0 iris.sammon$points[,1]

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.86/106

Visualisation of high-dimens. data

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.87/106

Scatterplots

0.0

0.2

0.4

0.6

0.8

0.6

0.8

0.0

0.2

0.4

X2

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.88/106

0.0

0.2

0.4

X3

0.6

0.8

0.0

0.2

0.4

X1

0.6

0.8

PCA

PC2

0.6

0.4

0.2

0.0

0.2

0.4

0.6

0.4

0.2

0.0 PC1

0.2

0.4

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.89/106

MDS

sampo[,2]

0.5 0.8

0.0

0.5

0.6

0.4

0.2 sampo[,1]

0.0

0.2

0.4

0.6

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.90/106

Parallel coordinates
parallel(iris)
species

pwidth

plength

swidth

slength Min Max

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.91/106

Properties of data sets


ID Age Sex Married Education Income 248 54 Male Yes High school 10000 249 ? Female Yes High school 12000 250 29 Male Yes College 23000 251 9 Male No Child 0 252 85 Female No High school 19798 253 40 Male Yes High school 40100 254 38 Female No Ph.D. 2691 255 7 Male ? Child 0 256 49 Male Yes College 30000 257 76 Male Yes Ph.D. 30686
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.92/106

Properties of data sets


Such a data table is called data set. The columns are called attribute, variablen or features. A single line refers to an individual, instance, case, object, datum, record or entity.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.93/106

Types of attributes

Nominal (categorical) attributes

have nite domain.

Examples:

Sex (male/female) (binary attribute) Major subjects (statistics, databases, Nationality (German/English/French/ ) )

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.94/106

Types of attributes

Ordinale attributes

have a nite domain endowed with a linear ordering. Examples:

German school types (Hauptschule/Realschule/Fachgymnasium/ Gymnasium) Employee status (worker/department head/CEO)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.95/106

Types of attributes
Interval variables

are numbers measured in the same unit. However, there is no absolute denition of zero. Differences can be calculated, but sums or products are meaningless. Examples:

Date (year) (arbitrary deniton of the year 0) Temperature

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.96/106

Types of attributes

Ratio attributes

have a well-dened location of zero, in contrast to interval attributes. Quotients and sums are meaningful. Example:

distance Number of children

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.97/106

Types of attributes
can be interval or ratio attributes.

Integer attributes

Examples:

Date (year) (interval attribute) Number of children (ratio attribute)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.98/106

Types of attributes
Continuous attributes

can be interval or ratio attributes.

Examples:

Temperature (interval attribute) Distance (Ratio attribute) There are special attributes like angles that behave only locally as an ordinal or interval attribute.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.99/106

Missing Values

For some instances values of single attributes might be missing. Causes for missing values:

broken sensors refusal to answer a question irrelevante attribut for the corresponding object (pregnant (yes/no) for men)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.100/106

Treatment of missing values

When there are only few missing values: Remove objects with missing values. Imputation of missing values (mean value, median, most frequent value, or estimation based on other attributes) Treatment of missing values before or during the application of data mining algorithms (depending on the problem and the algorithm)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.101/106

Types of missing values

Consider the attribute denoted by ?.

obs .

A missing value is

is the true value of the considered attribute, i.e. we have


obs

if

obs

Let be the (multivariate) (random) variable denoting the other attributes apart from .

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.102/106

Types of missing values


Missing completely at random (MCAR):

The probability that a value for is missing does neither depend on the true value of nor on other variables.

obs

obs

Example: The maintenance staff sometimes forgets to change the batteries of a sensor, so that the sensor sometimes does not provide any measurements. MCAR is also called Observed At Random (OAR).
University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.103/106

Types of missing values

Missing at random (MAR):

The probability that a value for is missing does not depend on the true value of .

obs

obs

Example: The maintenance staff does not change the batteries of a sensor when it is raining, so that the sensor does not always provide measurements when it is raining.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.104/106

Types of missing values

Nonignorable:

The probability that a value for missing depends on the true value of .

is

Example: A sensor for the temperature will not work when there is frost. In the cases of MCAR and MAR, the missing values can be estimated at least in principle, when the data set is large enough based on the values of the other attributes. (The cause for the missing values is ignorable.) In the extreme case of the sensor for the temperature, it is impossible to make provide any statement concerning temperatures below C.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.105/106

Missing values

In the case of MCAR, it can be assumed that the missing values follow the same distribution as the observed values of . In the case of MAR, the missing values might not follow the distribution of . But by taking the other attributes into account, it is possible to derive reasonable imputations for the missing values. In the case of nonignorable missing values it is impossible to provide sensible estimations for the missing values.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.106/106

A classication problem

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.1/12

A classication problem

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.2/12

A classication problem

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.3/12

A classication problem

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.4/12

Darts results

professional, hobby player, beginner

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.5/12

Darts results

professional, hobby player, beginner

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.6/12

1D darts distributions

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.7/12

Objectives for classication

Minimise the misclassication rate or, more generally,

minimise the expected loss.

cost matrix

for misclassications

Data Mining p.8/12

resulting costs when predicting class instead of the correct class

General assumption:

University of Applied Sciences Braunschweig/Wolfenbuettel

Cost matrix example

Production of items (cups,


Broken items should be sorted out cost matrix

predicted class broken true class OK OK 0 broken 0


University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.9/12

Cost matrix example

production costs for one item


posting costs for resending the item, loss of reputation, payment for compensation, cost matrix for misclassication rate predicted class

true class

. . .

1 1

1 ... 1

1 0
Data Mining p.10/12

University of Applied Sciences Braunschweig/Wolfenbuettel

Expected loss

and predicting

The expected loss given evidence class is



loss

predicted

argmin loss

Data Mining p.11/12

University of Applied Sciences Braunschweig/Wolfenbuettel

Expected loss

is a constant factor independent of .

It is sufcient to consider the likelihoods (unnormalised probabilities)

and the relative expected losses



rloss

Data Mining p.12/12

University of Applied Sciences Braunschweig/Wolfenbuettel

A very simple decision tree

Assignment of a drug to a patient:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.1/79

Classication with decision trees


Recursive Descent:

Start at the root node. If the current node is an leaf node:

Return the class assigned to the node. Test the attribute associated with the node. Follow the branch labeled with the outcome of the test. Apply the algorithm recursively.

If the current node is an inner node:

University of Applied Sciences Braunschweig/Wolfenbuettel

Intuitively: Follow the path corresponding to the case to be classied.


Data Mining p.2/79

Classication with decision trees


Assignment of a drug to a 30 year old patient with normal blood pressure and:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.3/79

Classication with decision trees


Assignment of a drug to a 30 year old patient with normal blood pressure and:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.4/79

Classication with decision trees


Assignment of a drug to a 30 year old patient with normal blood pressure and:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.5/79

Induction of decision trees


Top-down approach

Greedy selection of a test attribute

Build the decision tree from top to bottom (from the root to the leaves). Compute an evaluation measure for all attributes. Select the attribute with the best evaluation. Divide the example cases according to the values of the test attribute. Apply the procedure recursively to the subsets. Terminate the recursion if all cases belong to the same class no more test attributes are available

Divide and conquer / recursive descent

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.6/79

Decision tree induction: Example


Patient database

No 1 2 3 4 5 6 7 8 9 10 11 12 Sex male female female male female male female male male female female male Age 20 73 37 33 48 29 52 42 61 30 26 54 Blood pr. normal normal high low high normal normal low normal normal low high Drug A B A B A A B B B A B A

example cases descriptive attributes class attribute

Assignment of drug

(without patient attributes) always drug A or always drug B: 50% correct (in 6 of 12 cases)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.7/79

Decision tree induction: Example

Sex of the patient

No

Sex male male male male male male female female female female female female

Drug A A A B B B A A A B B B

Division male/female.

w.r.t.

1 6 12 4 8 9 3

Assignment of drug

male: female: total:

50% correct 50% correct


50% correct

(in 3 of (in 3 of

cases) cases)

5 10 2 7 11

(in 6 of 12 cases)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.8/79

Decision tree induction: Example


Age of the patient

No 1 Age 20 26 29 30 33 37 42 48 52 54 61 73 Drug A B A A B A B A B A B B

Sort according to age. Find best age split. here: ca. 40 years

11 6 10 4 3 8 5

Assignment of drug

: A 67% correct : B 67% correct

(in 4 of (in 4 of

cases) cases)

12 9 2

total:

67% correct

(in 8 of 12 cases)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.9/79

Decision tree induction: Example


Blood pressure of the patient

No 3 Blood pr. high high high normal normal normal normal normal normal low low low Drug A A A A A A B B B B B B

Division high/normal/low.

w.r.t.

5 12 1 6

Assignment of drug

10

high: normal: low: total:

A 100% correct 50% correct B 100% correct


75% correct

(in 3 of cases) (in 3 of cases)

2 7 9 4 8 11

(in 3 of cases) (in 9 of 12 cases)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.10/79

Decision tree induction: Example

Current decision tree:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.11/79

Decision tree induction: Example


Blood pressure and sex

No 3 Blood pr. high high high normal normal normal normal normal normal low low low male male male female female female Sex Drug A A A A A B B B A B B B

Only patients with normal blood pressure. Division w.r.t. male/female.

5 12 1 6 9 2 7 10 4

Assignment of drug

male: female: total:

A 67% correct B 67% correct


67% correct

(2 of 3) (2 of 3) (4 of 6)

8 11

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.12/79

Decision tree induction: Example


Blood pressure and age

No 3 Blood pr. high high high normal normal normal normal normal normal low low low 20 29 30 52 61 73 Age Drug A A A A A A B B B B B B

Only patients with normal blood pressure. Sort according to age. Find best age split. here: ca. 40 years

5 12 1 6 10 7 9 2 11 4

Assignment of drug

: A 100% correct : B 100% correct

(3 of 3) (3 of 3) (6 of 6)

total:

100% correct

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.13/79

Decision tree induction: Example

Resulting decision tree:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.14/79

Decision tree induction: Notation

a set of case or object descriptions

the class attribute other attributes (index dropped in the following)


total number of case or object descriptions i.e. absolute frequency of the class absolute frequency of the attribute value absolute frequency of the combination of the class and value . . relative frequency of the class ,

, ,

: number of classes : number of attribute values

and the attribute

relative frequency of the attribute value

relative frequency of the combination of class

relative frequency of the class

and attribute value ,

in cases having attribute value

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.15/79

Principle of decision tree induction

function begin

grow_tree (S : set of cases) : node;

:= WORTHLESS; for all untested attributes do compute frequencies , , for and compute value of an evaluation measure using , , ; if _ then best v := ; best A := ; end;
best v end

_ WORTHLESS then create leaf node ; assign majority class of to ; else create test node ; assign test on attribute best A to ; for all _ do .child[ ] := grow_tree( end; return ; end;
if University of Applied Sciences Braunschweig/Wolfenbuettel

); end;

Data Mining p.16/79

Evaluation Measures

Evaluation measure used in the above example: rate of correctly classied example cases. Advantage: simple to compute, easy to understand. Disadvantage: works well only for two classes. If there are more than two classes, the rate of misclassied example cases neglects a signicant amount of the available information. Only the majority classthat is, the class occurring most often in (a subset of) the example casesis really considered. The distribution of the other classes has no inuence. However, a good choice here can be important for deeper levels of the decision tree.
Data Mining p.17/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Evaluation measures
Therefore:

Study also other evaluation measures. and its various normalisations.

Here:
Information gain
measure

(well-known in statistics).

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.18/79

An Information-theoretic Evaluation Measure


Information Gain

(Kullback/Leibler 1951, Quinlan 1986)

Based on Shannon Entropy

Entropy of the class distribution ( : class attribute) Expected entropy of the class distribution if the value of the attribute becomes known Expected entropy reduction or information gain
Data Mining p.19/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Example
ID 1 2 3 4 5 6 7 8 9 10
Height

m s t s t s s m m t

Weight Long hair Sex n n m l y f h n m n y f n y f l n f h n m n n f l y f n n m

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.20/79

Example
ID 1 2 3 4 5 6 7 8 9 10
Height

Height

m s t s t s s m m t

Weight Long hair Sex n n m l y f h n m n y f n y f l n f h n m n n f l y f n n m



University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.21/79

Example
ID 1 2 3 4 5 6 7 8 9 10
Height

Height

m s t s t s s m m t

Weight Long hair Sex n n m l y f h n m n y f n y f l n f h n m n n f l y f n n m

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.22/79

Example
ID 1 2 3 4 5 6 7 8 9 10
Height

Height

m s t s t s s m m t

Weight Long hair Sex n n m l y f h n m n y f n y f l n f h n m n n f l y f n n m

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.23/79

Example

Height

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.24/79

Example
ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m

Weight

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.25/79

Example
ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m

Weight

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.26/79

Example
ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m

Weight

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.27/79

Example

Weight

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.28/79

Example
ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m

long_hair

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.29/79

Example
ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m

long_hair

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.30/79

Example

long_hair

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.31/79

Example

The attribute Weight yields the largest reduction of entropy.

Weight

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.32/79

Example
The remaining data table to be considered in the node : ID 1 4 5 8 10 Height Long hair Sex m n m s y f t y f m n f t n m

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.33/79

Example

ID 1 4 5 8 10

Height

m s t m t

Long hair Sex n m y f y f n f n m

Height

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.34/79

Example

ID 1 4 5 8 10

Height

m s t m t

Long hair Sex n m y f y f n f n m

Height

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.35/79

Example

ID 1 4 5 8 10

Height

m s t m t

Long hair Sex n m y f y f n f n m

Height

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.36/79

Example

ID 1 4 5 8 10

Height Long hair Sex m n m s y f t y f m n f t n m

long_hair

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.37/79

Example

ID 1 4 5 8 10

Height Long hair Sex m n m s y f t y f m n f t n m

long_hair

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.38/79

Example

The attribute long hair yields the largest reduction of entropy.

Weight

Long hair

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.39/79

Example

For the remaining node, only the attribute Height is left with the remaining data table: ID 1 8 10 Height Sex m m m f t m

Therefore, the resulting decision tree is:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.40/79

Example

Weight

Long hair

Height

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.41/79

A complex tree

long Hair

Height

Weight

Weight

Weight

? ?

Long hair


Data Mining p.42/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Interpretation of Shannon entropy

Let be a nite set of alternatives , having positive probabilities , satisfying .

Shannon Entropy:

Intuitively: Expected number of yes/no questions that


have to be asked in order to determine the obtaining alternative.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.43/79

Interpretation of Shannon entropy

Suppose there is an oracle, which knows the obtaining alternative, but responds only if the question can be answered with yes or no. A better question scheme than asking for one alternative after the other can easily be found: Divide the set into two subsets of about equal size. Ask for containment in an arbitrarily chosen subset. Apply this scheme recursively questions bounded by . number of
Data Mining p.44/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Question/Coding Schemes

Shannon entropy:
Linear Traversal

bit/symbol
Equal Size Subsets

0.10 0.15 0.16 0.19 0.40


1 2 3 4


2
0.25


0.75 0.59

0.10 0.15


2 2

0.16 0.19 0.40

Code length: 3.24 bit/symbol Code efciency: 0.664

Code length: 2.59 bit/symbol Code efciency: 0.830

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.45/79

Question/Coding Schemes

Splitting into subsets of about equal size can lead to a bad arrangement of the alternatives into subsets high expected number of questions. Good question schemes take the probability of the alternatives into account. Shannon-Fano Coding (1948)

Build the question/coding scheme top-down. Sort the alternatives w.r.t. their probabilities. Split the set so that the subsets have about equal probability (splits must respect the probability order of the alternatives).
Data Mining p.46/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Question/Coding Schemes
Huffman Coding

(1952)

Build the question/coding scheme bottom-up. Start with one element sets. Always combine those two sets that have the smallest probabilities.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.47/79

Question/Coding Schemes

Shannon entropy:
ShannonFano Coding

bit/symbol

(1948)

Huffman Coding

(1952)


3
0.25 0.10 0.15 0.16 0.41 0.59


0.60


0.19 0.40

0.25

0.35


3 1


3 2

0.10 0.15


3 3

0.16 0.19 0.40

Code length: 2.25 bit/symbol Code efciency: 0.955


University of Applied Sciences Braunschweig/Wolfenbuettel

Code length: 2.20 bit/symbol Code efciency: 0.977


Data Mining p.48/79

Question/Coding Schemes

It can be shown that Huffman coding is optimal if we have to determine the obtaining alternative in a single instance. (No question/coding scheme has a smaller expected number of questions.) Only if the obtaining alternative has to be determined in a sequence of (independent) situations, this scheme can be improved. Idea: Process the sequence not instance by instance, but combine two, three or more consecutive instances and ask directly for the obtaining combination of alternatives.
Data Mining p.49/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Question/Coding Schemes

Although this enlarges the question/coding scheme, the expected number of questions per identication is reduced (because each interrogation identies the obtaining alternative for several situations). However, the expected number of questions per identication cannot be made arbitrarily small. Shannon showed that there is a lower bound, namely the Shannon entropy.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.50/79

Interpretation of Shannon Entropy

Shannon entropy:

bit/symbol

If the probability distribution allows for a perfect Huffman code (code efciency 1), the Shannon entropy can easily be interpreted as follows:

Perfect Question Scheme

occurrence path length probability in tree

Code length: 1.875 bit/symbol Code efciency: 1

Expected number yes/no questions.

of

needed

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.51/79

Other evaluation measures from information theory


Normalized Information Gain

Information gain is biased towards many-valued attributes. Normalization removes / reduces this bias. (Quinlan 1986 / 1993)

Information Gain Ratio

1991)

Symmetric Information Gain Ratio

(Lpez de Mntaras

or

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.52/79

Bias of information gain


Information gain is biased towards many-valued

attributes,

i.e., of two attributes having about the same information content it tends to select the one having more values.

The reasons are quantization effects caused by the nite number of example cases (due to which only a nite number of different probabilities can result in estimations) in connection with the following theorem:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.53/79

Bias of information gain

Let , , and be three attributes with nite domains and let their joint probability distribution be strictly positive, i.e., . Then
Theorem:

with equality holding only if the attributes conditionally independent given , i.e., if .

and

are

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.54/79

-measure

Compares the actual joint distribution with a hypothetical independent distribution. Uses absolute comparison. Can be interpreted as a difference measure.

Side remark: Information gain can also be interpreted as a difference measure.


Data Mining p.55/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Contingency tables

. . . . . .


. . . . . .

. . . . . .


. . . . . .

. . . . . .

marginal of
. . . . . . . . . . . .

marginal of

The random variable can take the values the random variable the values .

is the (absolute) frequency of occurrences of the observation .


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.56/79

Contingency tables

and

are the marginal (absolute) frequencies. If and are independent, then the expected absolute frequencies are

for all

and all

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.57/79

independence test

1000 people were asked which political party they voted for in order to nd out whether the choice of the party and the sex of the voter are independent. pol. party sex female male sum SPD 200 170 370 CDU/CSU 200 200 400 Grne 45 35 80 FDP 25 35 70 PDS 20 30 50 Others 22 5 27 No answer 8 5 13 sum 520 480 1000
Example.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.58/79

independence test

Expected frequencies: pol. party sex SPD CDU/CSU Grne FDP PDS O/NA female male 192.4 177.6 208.0 192.0 41.6 38.4 31.2 28.4 26.0 24.0 20.8 19.2

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.59/79

independence test

pol. party sex SPD CDU/CSU Grne FDP PDS O/NA sum For instance:
CDU/CSU, female

female male 200 170 200 200 45 35 25 35 20 30 30 10 520 480


sum 370 400 80 70 50 40 1000


Data Mining p.60/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Treatment of numerical attributes


General Approach: Discretization
Preprocessing I

Form equally sized or equally populated intervals. Build a decision tree using only the numeric attribute. Flatten the tree to obtain a multi-interval discretization.

Preprocessing II / Multisplits during tree construction

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.61/79

Treatment of numerical attributes


During the tree construction

Sort the example cases according to the attributes values. Construct a binary symbolic attribute for every possible split (values: threshold and threshold). Compute the evaluation measure for these binary attributes. Possible improvements: Add a penalty depending on the number of splits.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.62/79

Treatment of numerical attributes


Consider a numerical attribute should be split into intervals. whose domain

cut points

are needed.

These cutpoints are chosen in such a way that the entropy induced by this partition is minimised. domian of the attribute

and denote the left and right boundary of the


.
) of the data objects fall Assume that ( into the interval between and for the considered attribute .

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.63/79

Treatment of numerical attributes


denotes the number of data objects among the objects belonging to class . Then the entropy in the interval between and is

The entropy induced by the partition into he corresponding intervals is

which should be minimised by a suitable choice of the points .


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.64/79

Treatment of numerical attributes


A value in the domain of the attribute is boundary point if the following holds for the sequence of sorted values for attribute: There are two objects and belonging to different and there is no object classes satisfying with . For the minimisation of the entropy, it is sufcient to consider boundary points only.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.65/79

Treatment of numerical attributes


The boundary points are marked by lines.
Value: Class:

For binary splits (only one cut point) all boundary points are considered and the one with the smallest entropy is chosen. For multiple splits a recursive procedure is applied.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.66/79

Treatment of missing values

Induction

Weight the evaluation measure with the fraction of cases with known values. Idea: The attribute provides information only if it is known. Try to nd a surrogate test attribute with similar properties (CART, Breiman et al. 1984) Assign the case to all branches, weighted in each branch with the relative frequency of the corresponding attribute value (C4.5, Quinlan 1993).
Data Mining p.67/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Treatment of missing values

Classication

Use the surrogate test attribute found during induction. Follow all branches of the test attribute, weighted with their relative number of cases, aggregate the class distributions of all leaves reached, and assign the majority class of the aggregated class distribution.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.68/79

Pruning decision trees


Pruning serves the purpose

to simplify the tree (improve interpretability), to avoid overtting (improve generalization). Replace bad branches (subtrees) by leaves. Replace a subtree by its largest branch if it is better. Reduced error pruning Pessimistic pruning Condence level pruning Minimum description length pruning
Data Mining p.69/79

Basic ideas:

Common approaches:

University of Applied Sciences Braunschweig/Wolfenbuettel

Reduced error pruning

Classify a set of new example cases with the decision tree. (These cases must not have been used for the induction!) Determine the number of errors for all leaves. The number of errors of a subtree is the sum of the errors of all of its leaves. Determine the number of errors for leaves that replace subtrees. If such a leaf leads to the same or fewer errors than the subtree, replace the subtree by the leaf. If a subtree has been replaced, recompute the number of errors of the subtrees it is part of.
Data Mining p.70/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Reduced error pruning

Advantage:

Very good pruning, effective avoidance of overtting. Additional example cases needed. Number of cases in a leaf has no inuence.

Disadvantage:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.71/79

Pessimistic pruning

Classify a set of example cases with the decision tree. (These cases may or may not have been used for the induction.) Determine the number of errors for all leaves and increase this number by a xed, user-specied amount . The number of errors of a subtree is the sum of the errors of all of its leaves. Determine the number of errors for leaves that replace subtrees (also increased by ). If such a leaf leads to the same or fewer errors than the subtree, replace the subtree by the leaf and recompute subtree errors.
Data Mining p.72/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Pessimistic pruning

Advantage: Disadvantage:

No additional example cases needed. Number of cases in a leaf has no inuence.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.73/79

Condence level pruning

Like pessimistic pruning, but the number of errors is computed as follows: See classication in a leaf as a Bernoulli experiment (error/no error). Estimate an interval for the error probability based on a user-specied condence level . (use approximation of the binomial distribution by a normal distribution) Increase error number to the upper level of the condence interval times the number of cases assigned to the leaf. Formal problem: Classication is not a random experiment.
Data Mining p.74/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Condence level pruning

Advantage:

No additional example cases needed, good pruning. Statistically dubious foundation.

Disadvantage:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.75/79

Decision tree pruning: An example

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.76/79

Decision tree pruning: An example

A decision tree for the Iris data

(induced with information gain ratio, unpruned)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.77/79

Decision tree pruning: An example


(pruned with condence level pruning, pessimistic pruning, )
A decision tree for the Iris data

, and

Left: 7 instead of 11 nodes, 4 instead of 2 misclassications. Right: 5 instead of 11 nodes, 6 instead of 2 misclassications. The right tree is minimal for the three classes.
Data Mining p.78/79

University of Applied Sciences Braunschweig/Wolfenbuettel

Predictive vs. descriptive tasks

Predictive tasks:

The decision tree (or more generally, the classier) is constructed in order to apply it to new unclassied data.

Decriptive tasks:

The purpose of the tree construction is to understand, how classication has been carried out so far.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.79/79

Bayes theroem


Proof:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.1/48

Bayes theroem
Interpretation: The probability that a hypothesis is true given event has occured, can be derived from the probability

of the hypothesis itself, of the event and the conditional probability of the event given the hypothesis.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.2/48

Bayes classiers
Principle of Bayes classiers: The value of the nominal attribute should be predicted based on the values of the attributes , i.e. the attribute vector . If is one of the possible values of attribute and the other attribute have taken the values , then Bayes theorem yields the probaility for given :


Data Mining p.3/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Bayes classiers

Compute this probability for all possible values (classes) of the nominal attribute and choose the class with the highest probability. (A cost matrix can also be incorporated.) Since the denominator is independent of , it does not have any inuence on the decision for the class. Therefore, usually only the likelihoods

are considered.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.4/48

Bayes classiers
The probability can be estimated easily based on a given data:

no. of data from class no. of data

In principle, the probability could be determined analogously:

no. of data from class with values no. of data from class
University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.5/48

Bayes classiers
For nominal attributes , each having three possible values, we would need data objects to have at least one example per combination. Therefore, the computation is carried out under the (nave, unrealistic) asumption that the attributes are independent given the class, i.e.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.6/48

Bayes classiers

can be computed easily:

no. of data from class with no. of data from class

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.7/48

Nave Bayes classier


given: A data set with only nominal attributes. Based on the values of the attributes a prediciton for the value of the attribute should be derived. For each class (each value in the domain of ) compute the likelihood

under the assumption that the independent given the class.


University of Applied Sciences Braunschweig/Wolfenbuettel

are
Data Mining p.8/48

Nave Bayes classier


Assign likelihood.
to the class

with the highest

This Bayes classiers is called wird als nave because of the (conditional) independence assumption for the attributes .

Although this assumption is unrealistic in most cases, the classier often yields good results, when not too many attributes are correlated.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.9/48

Example
How does a nave Bayes classier classify the object ? We need to calculate

Sex

Height

Weight long_hair

Height Sex Weight Sex long_hair Sex Sex


and
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.10/48

Example

Sex

Height Weight

Long_hair

Height Sex Weight Sex Long_hair Sex Sex

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.11/48

Example

Height
ID 1 2 3 4 5 6 7 8 9 10

Sex

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m


Data Mining p.12/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Height
ID 1 2 3 4 5 6 7 8 9 10

Sex

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m


Data Mining p.13/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Height
ID 1 2 3 4 5 6 7 8 9 10

Sex

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m


Data Mining p.14/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Weight
ID 1 2 3 4 5 6 7 8 9 10

Sex

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m


Data Mining p.15/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Long_hair
ID 1 2 3 4 5 6 7 8 9 10

Sex

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m


Data Mining p.16/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Sex

ID 1 2 3 4 5 6 7 8 9 10

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m


Data Mining p.17/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Sex

Height

Weight Long_hair

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.18/48

Example

Height
ID 1 2 3 4 5 6 7 8 9 10

Sex

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m


Data Mining p.19/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Height
ID 1 2 3 4 5 6 7 8 9 10

Sex

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m


Data Mining p.20/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Height
ID 1 2 3 4 5 6 7 8 9 10

Sex

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m


Data Mining p.21/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Weight
ID 1 2 3 4 5 6 7 8 9 10

Sex

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f g n n m


Data Mining p.22/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Long_hair
ID 1 2 3 4 5 6 7 8 9 10

Sex

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m


Data Mining p.23/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Sex

ID 1 2 3 4 5 6 7 8 9 10

Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m


Data Mining p.24/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example

Sex

Height Weight

Long_hair

Sex

Height Weight

Long_hair

Classication of : female (f)


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.25/48

Example

The object was classied by the urde durch den naven Bayes-Klassikator klassiziert. The data set does not contain any object with this combination of values. A full Bayes classier would not be able to classify this object.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.26/48

More examples
Input

Class ? m m

The object is classied by the nave Bayes classier as although the data sets contains two such objects, one from class and one from class . The main impact comes from the attribut Long hair = , having probability 1 in class , but a low probability in class .
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.27/48

Laplace correction
If a single likelihood is zero, then the overall likelihood is zero automtaically, even the when the other likelihoods are high. Therefore: Laplace correction:

is called Laplace correction. : Maximum likelihood estimation. Common choices: or .


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.28/48

Laplace correction
Laplace correction for Height

Sex

with

Height s m t

# Laplace Laplace 1 2 1/4 2/7 1 2 1/4 2/7 2 3 2/4 3/7

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.29/48

Nave Bayes classier: Implementation

The counting of the frequencies should be carried out once when the nave Bayes classier is constructed. The probablity dsitribution for the single attributes should be stored in a table. When the nave Bayes classier is applied to new data, only corresponding values in the table need to be multiplied.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.30/48

Treatment of missing values

During learning:

The missing values are simply not counted for the frequencies of the corresponding attribute.

During classication:

Only the probabilities (likelihoods) of those attributes are multiplied for which a value is available.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.31/48

Numerical attributes
Estimation of probabilities:
Numerical attributes:

Assume a normal distribution.

Estimation of the mean value



University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.32/48

Numerical attributes

Estimation of the variance

: Maximum likelihood estimation : Unbiased estimation

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.33/48

Example

100 data points, 2 classes Small squares: mean values Inner ellipses: one standard deviation Outer ellipses: two standard deviations Classes overlap: classication is not perfect
Nave Bayes classier

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.34/48

Example

20 data points, 2 classes Small squares: mean values Inner ellipses: one standard deviation Outer ellipses: two standard deviations Attributes are not conditionally independent given the class

Nave Bayes classier

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.35/48

Conditional independence

Reminder: stochastic independence (unconditional)

(Joint probability is the product of the individual probabilities.) Comparison to the product rule

shows that this is equivalent to

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.36/48

Conditional independence

The same formulae hold conditionally, i.e.

and

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.37/48

Conditional independence: Example

Group 1

Group 2

Data Mining p.38/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Conditional independence: Example

Group 1

University of Applied Sciences Braunschweig/Wolfenbuettel


Data Mining p.39/48

Conditional independence: Example

Group 2

Data Mining p.40/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Example: Iris data

150 data points, 3 classes Iris setosa (red) Iris versicolor (green) Iris virginica (blue)

Shown: 2 out of 4 attributes sepal length sepal width petal length (horizontal) petal width (vertical) 6 misclassications on the training data (with all 4 attributes)
Nave Bayes classier

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.41/48

Full Bayes classiers

Restricted to metric/numeric attributes (only the class is nominal/symbolic). Each class can be described by a multivariate normal distribution.

Simplifying Assumption:

: mean value vector for class : covariance matrix for class


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.42/48

Full Bayes classiers


Intuitively: Each class has a bell-shaped probability density. Naive Bayes classiers: Covariance matrices are diagonal matrices. (Details about this relation are given below.)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.43/48

Full Bayes classiers


Estimation of Probabilities:

Estimation of the mean value vector



Estimation of the covariance matrix

: Maximum likelihood estimation : Unbiased estimation


Data Mining p.44/48

University of Applied Sciences Braunschweig/Wolfenbuettel

Nave vs. full Bayes Classiers


where are the density functions used by a nave Bayes classier.


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.45/48

Nave vs. full Bayes Classiers

Nave Bayes classiers for numerical data are equivalent to full Bayes classiers with diagonal covariance matrices.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.46/48

Nave vs. full Bayes Classiers

Nave Bayes classier

Full Bayes classier

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.47/48

Full Bayes classier: Iris data

150 data points, 3 classes Iris setosa (red) Iris versicolor (green) Iris virginica (blue)

Shown: 2 out of 4 attributes sepal length sepal width petal length (horizontal) petal width (vertical) 2 misclassications on the training data (with all 4 attributes)
Full Bayes classier

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.48/48

-Nearest neighbour classiers

Use the given data set as set of example cases ( ). Dene a suitable distance measure To classify a new objekct distances

on

, compute the

to all example cases.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.1/26

-Nearest neighbour classiers

Find the

closest example cases


among Assign to the most frequent class the closest example cases.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.2/26

Distance measures

For a nominal attribute two objects can only have the same value (distance=0) or different values (usually distance=1). For ordinal attributes the distance should increase depending on the distances of the ranks in the corresponding lienar order. For numerical attributes the absolute or the squared difference between values is very common.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.3/26

Distance measures
Unless the importance or weight of each attribute for the distance is known, the distances for the single attributes should yield similar values. For numerical attributes, the distance depends on the measurement unit.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.4/26

Distance measures
Example:

Weight and height of persons.

Measuring the weight in kg and the height in cm leads to distances (differences) approximately in the same range. Measuring the weight in g and the height in m leads to almost neglectable differences for the height and to very large differences for the weight.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.5/26

Distance measures
Unless attributes are known to have different importance, all attributes contribute roughly in the same way to the overall distance. Normalisation techniques for numerical attributes

(extremely sensitive to outliers) (sensitive to outliers)


interquartile_range

median

(robust against ouliers)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.6/26

Distance measures
distance normalisation for arbitrary attributes given distance measure
norm

and data

In this way: Average distance is 1 for all attributes.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.7/26

-Nearest neighbour classiers

Treatment of missing values: Training:

Ignore the corresponding attributes for the computation of the distances. very fast slow for large sample databases

Classication: Interpretability:

Justication of the classication based on similar known cases. Adapt the distance and/or select a suitable subset of cases from the sample database.

Adaptive nearest neighbour classiers:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.8/26

Support vector machines (SVM)


Consider a two-class linearly separable classication problem. (The two classes can be separated by a (hyper-)plane.) How to nd the best separating hyperplane?

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.9/26

SVM
training data set: where indicating to which of the two classes belongs. separating hyperplane given in the form

: (non-normalised) normal vector of the


hyperplane the normal vector

: offset of the hyperplane from the origin along

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.10/26

SVM
Choose and such that the separating hyperplane yields a maximum margin, i.e. such that the distance to the closest points from the each of the two classes is as large as possible. For this, introduce the constraints

if if

and are then the margin hyperplanes parallel to the separating hyperplanes through the closest points of the two classes.
Distance between the margin hyperplanes:
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.11/26

SVM

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.12/26

SVM
Optimization problem: Choose and to minimise under the constraints


for all

involves a square root.


Therefore, instead of minimise

Can be solved by standard quadratic programming techniques.


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.13/26

SVM
Dual form of the quadratic programming problem: Maximise

under the constraints

and for all

.
Data Mining p.14/26

University of Applied Sciences Braunschweig/Wolfenbuettel

SVM

only for those that lie on one of the two margin hyperplanes. The separating hyperplane depends only on the support
vectors

on the margin hyperplanes.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.15/26

SVM
Relaxation of the restriction to linearly separable classication problems: Allow for misclassied objects between the two margin hyperplanes. Introduce slack variables measuring the degree of misclassication of object :


Penalise nonzero .

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.16/26

SVM
For example, linear penalty function: Minimise

subject to


for all

is a constant specifying how strong misclassications are penalised.


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.17/26

SVM: Kernel trick


Rpelace the dot product by a suitable kernel function . This corresponds to a (possibly nonlinear) transformation of the data to another space (of possibly innite dimension) where the separation of the two classes might be carried out easier.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.18/26

SVM: Kernel trick


Common kernel functions:

(homogeneous) polynomial: (inhomogeneous) polynomial:


radial basis function: (


),

also called Gaussian kernel for


sigmoid kernel: (for suitable choices of


).

and

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.19/26

SVM: Multiclass problems

SVM for multiclass problems with more than two classes:

Construct an SVM for each class against all other classes. Assign an object to the class with the highest output (largest distance to the (virtual) separating hyperplane). Construct an SVM for each pair of classes. Assign an object to the class which has won the most competitions against other classes.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.20/26

SVM

Often a very good classier. Strongly dependent on the choice of a suitable kernel. Training has very high computational costs, especially for large data sets and for multiclass problems.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.21/26

Logistic regression

: class attribute, -dimensional random vector , . Given: A set of data points each of which belongs to one of the two classes and . Desired: A simple description of the function . Approach: Describe by a logistic function:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.22/26

Classication: Logistic regression


Apply logit transformation to :

The values may be obtained by kernel estimation.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.23/26

Kernel estimation
Idea:

Dene an inuence function (kernel), which describes how strongly a data point inuences the probability estimate for neighboring points.

Gaussian kernel

Kernel estimate of probability density given a data set :


University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.24/26

Kernel estimation

Kernel estimation applied to a two class problem:

( if belongs to class otherwise.)

and

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.25/26

Classication: Logistic regression

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.26/26

Regression

Supervised learning:

dom Find a function minimizing the error . objects


Classication: Regression:

Given a data set with attributes

dom for all given data


is a nominal attribute. is a numerical attribute.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.1/95

Regression line
given: A data set for two continuous attributes and . It is assumed that there is an approximate linear dependency between and : Find a regression line (i.e. determine the parameters and ) such the line ts the data as good as possible. What is a good t?

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.2/95

Regression

-distance

Euclidean distance

Usually, the mean square error in -direction is chosen as error measure (to be minimized). It is equivalent to minimize the sum of squared errors in -direction.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.3/95

Regression

Other reasonable error measures:


mean absolute distance in -direction mean Euclidean distance maximum absolute distance in -direction (or equivalently: the maximum squared distance in -direction) maximum Euclidean distance

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.4/95

Regression line
Given data function is

), the least squares error

(If at least two different -values exist,) and are uniquely determined by the the necessary conditions for a minimum.

and

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.5/95

Least squares and MLE


A regression line can be interpreted as a maximum
likelihood estimator (MLE): Assumption:

The data generation process can be described well by the model

where is normally distributed with mean 0 and (unknown) variance . ( independent of , i.e. same dispersion of for all .)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.6/95

Least squares and MLE


Therefore,

leading to the likelihood function

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.7/95

Least squares and MLE

To simplify the computation of derivatives for nding the maximum, we compute the logarithm:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.8/95

Least squares and MLE

From this expression it becomes clear that (provided is independent of , , and ) maximizing the likelihood function is equivalent to minimizing

Interpreting the method of least squares as a maximum likelihood estimator works also for the generalisations to polynomials and multilinear functions discussed later on.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.9/95

Multilinear regression

Minimization of the error function based on partial derivatives w.r.t. the parameters does not work in the other examples of error functions, since

the absolute value and the maximum are not everywhere differentiable and the distance in the case of the Euclidean distance leads to system of nonlinear equation for which no analytical solution is known.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.10/95

Nonlinear regression
For nonlinear dependencies (in the parameters) taking partial derivatives leads to nonlinear equations: (radioactive decay, (unlimited) Example: growth of bacteria, )

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.11/95

Linear regression
The least squared approach leads to an analytical solution, when the regression function is linear in the coefcients (parameters):

Note that the attributes other attributes.

can also be derived from

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.12/95

Examples

quadratic dependency (for instance, constant , acceleration: : time, : distance): i.e.

linear dependency for different variables (for instance, electricity consumption of a suburban area based on the number of ats with one ( ), two ( ), three ( ) and four or more persons ( ) living in them):

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.13/95

Linear regression
linear regression function

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.14/95

Linear regression

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.15/95

Linear regression

implies

for

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.16/95

Linear regression

. . .

. . .

. . .


Data Mining p.17/95

University of Applied Sciences Braunschweig/Wolfenbuettel

Linear regression

normal equation:

. . .

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.18/95

Linear regression

The coefcients are then given by


. . .

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.19/95

Model vs. black box


When the principal functional dependency between the predictor variables and the predictor variables is known, an explicit parameterised (possibly nonlinear) regression function can be specied. If such a (model) is not known, one can still try to construct a suitable regression function.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.20/95

Model vs. black box


When the functional dependency between the predictor variables is not known, one can try a

linear

quadratic


Data Mining p.21/95

or cubic approach.

University of Applied Sciences Braunschweig/Wolfenbuettel

Model vs. black box


The coefcients can be interpreted as weighting factors, at least when the predictor variables have been normalised. They also provide information of a positive or negative correlation of the predictor variables with the dependent variable . Usually, complex regression functions yield black box models, which might provide a good approximation of the data, but do not admit a useful interpretation (of the coefcients).
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.22/95

Generalisation
Considering a data set as a collection of examples, describing the dependency between the predictor variables and the dependent variable, the regression function should learn this dependency from the data and to generalise it to new data to make correct predictions. To achieve this, the regression function must be universal (exible) enough to be able to learn the dependency. This does not mean that a more complex regression function with more parameters leads to better results than a simple one.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.23/95

Overtting

Complex regression functions can lead to overtting:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.24/95

Overtting

Complex regression functions can lead to overtting:

The regression function learns a description of the data, not of the structure inherent in the data. Predicition can be worse than for a simpler regression function.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.25/95

Approximation vs. extrapolation


Distinction between
approximation extrapolation,

of the data and

corresponding to a prediction in regions where no data points are available.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.26/95

Approximation vs. extrapolation


Distinction between
approximation extrapolation,

of the data and

corresponding to a prediction in regions where no data points are available.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.27/95

Approximation vs. extrapolation


Distinction between
approximation extrapolation,

of the data and

corresponding to a prediction in regions where no data points are available.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.28/95

Approximation vs. extrapolation


Distinction between
approximation extrapolation,

of the data and

corresponding to a prediction in regions where no data points are available.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.29/95

Robust regression
linear model:

computed model:

objective function:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.30/95

M-estimators
least squares method: For other choices of ,

should satisfy at least

, ,

, .

, if

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.31/95

M-estimators
Dene

Computing derivatives of

and

leads to

Solution of this system of equations is the same as for the weighted least squares problem

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.32/95

M-estimators
Problem:

The weights depend on the errors the errors

depend on the weights .

and

Solution strategy: Alternating optimisation.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.33/95

Robust regression

Method least squares Huber Tukey

if if

if f

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.34/95

M-estimators: Least squares

40 1 35

30

0.8

25 0.6 rho 20 w 0.4 0.2 5 0 -6 -4 -2 0 e 2 4 6 -6 -4 -2 0 e


Data Mining p.35/95

15

10

University of Applied Sciences Braunschweig/Wolfenbuettel

M-estimators: Huber

7 0.8 6

5 0.6 rho 4 w 0.4 0.2 1 0 -6 -4 -2 0 e 2 4 6 -6 -4 -2 0 e


Data Mining p.36/95

University of Applied Sciences Braunschweig/Wolfenbuettel

M-estimators: Tukey

3.5

3 0.8 2.5

2 rho w 1.5

0.6

0.4

1 0.2 0.5

0 -6 -4 -2 0 e 2 4 6

0 -6 -4 -2 0 e
Data Mining p.37/95

University of Applied Sciences Braunschweig/Wolfenbuettel

Least squares vs. robust regression

2 4

0 X

least squares and robust regression

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.38/95

Robust regression: Weights

Huber weight

0.2

0.4

0.6

0.8

1.0

3 Index

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.39/95

Robust regression with R

> library(MASS) > reg.huber <- rlm(y x1 + x2 + ..., data=...) > summary(reg.huber)

Plotting the weights and enable clicking interesting points (here: size of the data set = 100):
> plot(reg.huber$w, ylab="Huber Weight") > identify(1:100, reg.huber$w)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.40/95

Robust regression with R

Tukeys approach requires the package lqs. Otherwise, analogous to Hubers approach:
... > reg.bisq <- rlm(y x1 + x2 + ..., data=..., method=MM) ...

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.41/95

Regression & nominal attributes

If most of the predictor variables are numerical and the few nominal attributes have small domains, a regression function can be constructed for each possible combination of the values of the nominal attributes, given that the data set is sufciently large and covers all combinations.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.42/95

Regression & nominal attributes


Attribute Type/Domain Sex F/M Vegetarian Yes/No Age numerical Height numerical Weight numerical

Example:

Task: Predict the weight based on the other attributes. Possible solution: Construct four separate regression functions for (F,Yes),(F,No),(M,Yes),(M,No) using only age and height as predictor variables.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.43/95

Regression & nominal attributes

Alternative approach: Encode the nominal attributes as numerical attributes.


Binary attributes can be encoded as 0/1 or

For nominal attributes with more than two values, a 0/1 or numerical attribute should be introduced for each possible value of the nominal attribute. Do not encode nominal attributes with more than two values in one numerical attribute, unless the nominal attribute is actually ordinal.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.44/95

Regression trees

Like decision trees, but target variable is not a class, but a numeric quantity.

Simple regression trees: Predict constant values in leaves. (blue line)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.45/95

Regression trees

More complex regression trees: Predict linear functions in leaves. (red line)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.46/95

Regression trees: Attribute selection

The variance/standard deviation is compared to the variance/standard deviation in the branches. The attribute that yields the highest reduction is selected.
Data Mining p.47/95

University of Applied Sciences Braunschweig/Wolfenbuettel

Regression Trees: An Example

A regression tree for the Iris data (petal width) (induced with reduction of sum of squared errors)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.48/95

Neural networks
McCulloch-Pitts model of a neuron (1943)

A neuron is a binary switch, being either active or inactive. Each neuron has a xed threshold value. A neuron receives input signals from excitatory (positive) synapses (connections to other neuron). A neuron receives input signals from inhibitory (negative) synapses (connections to other neuron).

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.49/95

McCulloch-Pitts model

The inputs of a neuron are accumulated (integrated) for a certain time. When the threshold value of the neuron is exceeded, the neuron becomes active and sends signals to its neighbouring neurons via its synapses.

Aim of the McCulloch-Pitts model: neurobiological modelling and simulation to understand very elementary functions of neurons and the brain.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.50/95

The simple perceptron


The perceptron was introduced by Frank Rosenblatt for modelling pattern recognition abilities in 1958.

A simplied retina is equipped with receptors (input neurons) that are activated by an optical stimulus. The stimulus is passed on to an output neuron via a weighted connection (synapse). When the threshold of the output neuron is exceeded, the output is 1, otherwise 0.

Aim: Automatic learning of the weights and the threshold to classify objects shown on the retina correctly.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.51/95

The simple perceptron

A perceptron for identifying the letter F. Two positive and one negative example.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.52/95

The simple perceptron


schematic model of a perceptron
w e ig h ts

o u tp u t n e u ro n in p u t la y e r
University of Applied Sciences Braunschweig/Wolfenbuettel

if otherwise

Data Mining p.53/95

The simple perceptron


Perceptron: classication for two-class problems. (Numerical) input attributes , output class (0 or 1).

For multiclass problems use one perceptron per class.

Two parallel perceptrons


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.54/95

Perceptron learning algorithm


Initialise the weights and the threshold value randomly. For each data object in the training data set, check whether the perceptron predicts the correct class. Each time, the perceptron predicts the wrong class, adjust the weights and the threshold value. Repeat this until no changes occur.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.55/95

The delta rule


When the perceptrons makes a wrong classication, change the weights and the threshold at least in the correct direction.

If the desired output is 1 and the perceptrons output is 0, the threshold is not exceeded, although it should be. Therefore, lower the threshold and adjust the weights depending on the sign and magnitude of the inputs. If the desired output is 0 and the perceptrons output is 0, the threshold is exceeded, although it should not be. Therefore, increase the threshold and adjust the weights depending on the sign and magnitude of the inputs.
Data Mining p.56/95

University of Applied Sciences Braunschweig/Wolfenbuettel

The delta rule


: A weight of the perceptron


: The threshold value of the output neuron

: desired output for input : The perceptrons output for input


An input

: Learning rate

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.57/95

The delta rule


The delta rule recommends to adjust the weights and the threshold value according to:

new
new

old

old

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.58/95

The delta rule


if if if if if if

and and

and and

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.59/95

Example
Learning of the logical operator AND. Training data:

Learning rate: Initialisation: 1. Epoch 0 0 01 10 11


0 0 0 0 0 0 1 0

0 0 0 1

0 0 0 1


0 0 0 1 0 0 0 1

0 0 0

0 0 0

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.60/95

Example


2. Epoch 0 0 01 10 11 3. Epoch 0 0 01 10 11 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1 0

0 0 0 1 0 0


1 1 1 2 2 2 1 2 1 0 0 1 1 0 0 1 0 1 1 0 0 1 2 1

0 1 0

1 1 0

0 1

0 1 1

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.61/95

Example


4. Epoch 0 0 01 10 11 5. Epoch 0 0 01 10 11 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1

0 0

0 0 0 1 0


2 2 1 2 2 2 2 2 1 1 1 2 2 1 1 1 1 1 2 1 1 2 2 2

1 0 0 0 0

0 0 1

0 0

0 1 0 0

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.62/95

Example


6. Epoch 0 0 01 10 11 0 0 0 0 0 0 1 1

0 0 0 0

0 0 0 0


2 2 2 2 1 1 1 1 2 2 2 2

0 0 0 0

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.63/95

Perceptron convergence theorem

If, for a given data set with two classes, there exists a perceptron that can classify all patterns correctly, then the delta rule will adjust the weights and the threshold after a nite number of steps in such way that all patterns are classied correctly.

Which kind of classication problems can be solved by a perceptron?

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.64/95

Linear separability
Consider a perceptron with two input neurons. Let be the output of the perceptron for input . Then

The perceptrons output is 1 if and only if the input pattern is above the line

Data Mining p.65/95

University of Applied Sciences Braunschweig/Wolfenbuettel

Linear separability
y = - w1 x + w2

class 0 class 1
Klasse 1 Klasse 0

The parameters determine a line. All input patterns above this line are assigned to class, the input patterns below the line to class 0.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.66/95

Linear separability
More generally: A perceptron with input neurons can classify all examples from a data set with input variables and two classes correctly, if there exists a hyperplane separating the two classes. Such classication problems are called linearly separable.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.67/95

Linear separability

Example:

The exclusive OR (XOR) denes a classication task which is not linearly separable.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.68/95

XOR with hidden layer


1

1 1 1

1 .5

-1 0 .5

0 .5 1

A perceptron with a hidden layer of neurons: The hidden layer carries out a transformation. The output neuron can solve the linearly separable problem in the transformed space.
1

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.69/95

Learning algorithm?

How to adjust the weights (and thresholds) for the neurons the hidden layer?
Problem:

Solution: Multilayer perceptrons with gradient descent Does not work with binary (non-differentiable) threshold function as activation function for the neurons. Must be replaced by a differentiable function.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.70/95

Sigmoidal activation function


sigmoidal functions

and steepness

with bias :

net

net

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 1/(1+exp(-0.5*x)) 1/(1+exp(-2*(x-3)))

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.71/95

Multilayer perceptrons
A multilayer perceptron is a neural network with an input layer, one or more hidden layers and an output layer. Connections exist only between neurons from one layer to the next layer. Activation functions for neurons are usually sigmoidal functions.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.72/95

Multilayer perceptrons

in p u t la y e r h id d e n la y e r h id d e n la y e r o u tp u t la y e r

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.73/95

Error function
(mean) squared error:

out

out: the set of output neurons : target output of output neuron for input : output (activation) of output neuron for input

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.74/95

Learning rule
Activation of a neuron:

net

Input for a neuron including bias values: net


: set of neurons in the layer before neuron

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.75/95

Learning rule
Consider the input/output pattern . Adjust the weights based on a gradient descent technique, i.e. proportional to the gradient of the gradient of the error function.

: learning rate

net

net

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.76/95

Learning rule

net

Dene the error signal


net

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.77/95

Learning rule
Therefore:

i.e.

net

net

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.78/95

Learning rule

net

net

To compute

, consider the two cases:


is an output neuron:

i.e.

net

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.79/95

Learning rule

is a hidden neuron in layer net net

net

net

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.80/95

Backpropagation
Leading to:

net

Result: Recursive equation for updating the weights: Update the weights to the neuron in the output layer rst and then go back layer by layer and update the corresponding weights.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.81/95

Backpropagation
where

net

net

if out

if

This learning rule is called Backpropagation or generalised delta rule. Backpropagation can also be applied in case there are connections between from layers that are not neighboured as long as the neural network represent a directed acyclic graph. These networks are also called feedforward networks.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.82/95

Training the bias values


The bias values can be considered as special weights when an articial input neuron with constant input 1 is introduced:
u 1 u 0 u 2

G 


G G 

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.83/95

Training the bias values

The bias values can be learned in the same manner as the weights based on the following considerations:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.84/95

Backpropagation

Backpropagation as a gradient descent technique can only nd a local minimum. Training the networks with different random initialisations will usually lead to different results. The learning rate denes the stepwidth of the gradient descent technique.

A very large leads to skipping minima or oscillation. A very small leads to starving, i.e. slow convergence or even convergence before the (local) minimum is reached.
Data Mining p.85/95

University of Applied Sciences Braunschweig/Wolfenbuettel

Backpropagation

Introduce a momentum term: For the weight change, the previous weight change is taken into account:

is the weight change in the previous step of the gradient descent algorithm. If weight is changed continuously in the same direction, the weight change increases, otherwise it decreases. , . Typical choices for and :

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.86/95

Backpropagation

Usually, weights are not updated after the presentation of each input pattern, but after a whole epoch, i.e. after all patterns have been presented once. There is no general rule, how to choose the number of hidden layers and the size of the hidden layers. Small neural networks might not be exible enough to t the data. Large neural networks tend to overtting. The steepness of the activation function is usually xed and is not adjusted. The multilayer perceptron learns only in those regions where the activation function is not close to zero or one, otherwise the derivative is almost zero.
Data Mining p.87/95

University of Applied Sciences Braunschweig/Wolfenbuettel

Other learning algorithms

Weight decay is sometimes included, pushing all weights in the direction of zero. Only those weights will survive that are really needed. Approximate the error function locally by a quadratic curve and nd in each step the minimum of the quadratic curve.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.88/95

Nonlinear PCA
Dimension reduction with multilayer perceptrons:

Input and output are identical, i.e. the neural network should learn the identity function. (Autoassociative network) Introduce a hidden layer with only two neurons (representing the two dimensions for the graphical representation of the dimension reduction), the bottleneck. Train the neural network with the data. After training, input the data into the network and use the outputs of the bottleneck neurons for the graphical representation.
Data Mining p.89/95

University of Applied Sciences Braunschweig/Wolfenbuettel

Autoassociative bottleneck neural networks

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.90/95

Other regression models


Radial basis function networks (RBF networks)

have only one hidden layer, usually with Gaussian (unimodal) activation functions.

Support vector regression (SVR):

Similar idea as SVMs. Those points that are approximated well, have little inuence on the regression function. Mathematical approximation method for multivariate functions.

Multivariate adaptive regression splines (MARS):

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.91/95

Classication as regression

A two-class classication problem (with classes 0 and 1) can be viewed as regression problem. The regression function will usually not yield exact outputs 0 and 1, but the classication decision can be made by considering 0.5 as a cut-off value. Problem: The objective functions aims at minimizing the function approximation error (for example, the mean squared error), but not misclassications.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.92/95

Classication as regression

Example: 1000 data objects, 500 belonging to class 0, 500 to class 1.


Regression function yields 0.1 for all data from class 0 and 0.9 for all data from class 1. Regression function always yields the exact and correct values 0 and 1, except for 9 data objects where it yields 1 instead of 0 and vice versa.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.93/95

Classication as regression

Regression function Misclassications MSQE 0 0.01 9 0.009

From the viewpoint of regression is better than , from the viewpoint of misclassications should be preferred.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.94/95

Classication as regression

For multiclass problems do not enumerate the classes and learn a single regression function. This leads to interpolation errors. For example: A data object between class 1 and 3 might be classied as class 2 by the regression function. Train a classier (regression function) for each class against all other classes.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.95/95

Model validation

Does the model t the data at all? What is the most appropriate model? Are the ndings from the model signicant at all? How will the model performance be for new data?

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.1/25

Model tting
Fitting a linear model to a sample
7 plength 1 2 3 4 5 6

0.5

1.0 pwidth

1.5

2.0

2.5

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.2/25

Model tting
Fitting a linear model to a sample with (almost) no correlation
8.0 slength 4.5 5.0 5.5 6.0 6.5 7.0 7.5

2.0

2.5

3.0 swidth

3.5

4.0

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.3/25

Model tting
Fitting a linear model to a sample with nonlinear dependency
100 y 0 20 40 60 80

4 x

10

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.4/25

Model validation
Fitting a model to the data is (almost) always possible. (A regression line can always be found for two-dimensional data.) It does not mean that the model that ts best to the data ts to the data at all. (The regression line does not really reect any meaningful relation between the variables.) Therefore, model validation is needed.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.5/25

Simple vs. complex models

How complex should a model be? A complex model like


a decision tree with many nodes or a nonlinear regression function

will usually t the data better than a simple model. Are complex models always better? No, they tend to overtting.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.6/25

Simple vs. complex model


Example:

Assume that there is an unknown noisy functional dependency between attributes and

Which of the models


( is an unknown random noise.)

, ,

is the best one?


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.7/25

Simple vs. complex model


The error on the training data set will decrease with increasing model complexity (here: higher degree of the polynomial). For data points the error will be zero for . Tends to overtting. One way for choosing the model: crossvalidation (will be discussed later on) Alternative: Minimum description length principle (MDL)
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.8/25

MDL
Basic idea of the minimum description length principle: The data should be send over a channel or should be stored in a compressed form in a le. Use as few bit as possible. Similar problem as for data compression: Find a coding which is as compact as possible.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.9/25

MDL
The compressed le consists of the compressed data and a rule how to decompress the data (the model). The more complex the rule for decompression, the more the data can be compressed. But a complex compression scheme will need larger memory itself.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.10/25

MDL
Goal: Find a compression for which compressed data + description of the (de-)compression rule is minimal. For the regression example:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.11/25

MDL
Store (or transmit) the data precision of one decimal place. For a model of the form

with

the values and the errors (instead of the values ) must be transmitted.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.12/25

MDL


for example: For the model Instead of transmit

.
)

In addition, the model parameters must be transmitted.


University of Applied Sciences Braunschweig/Wolfenbuettel

und

Data Mining p.13/25

MDL
Minimum description length principle:

Choose the model for the data for which length of the model description + length of the data (model error) description becomes under the considered model becomes minimal.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.14/25

MDL
Similar for decision trees:

Complex trees need a longer description, But less classication errors need to be transmitted for complex trees.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.15/25

Training and test data

The error on the training data set is called resubstitution error. The data should be split into training and test data. Only the data from the training set willbe used to construct the model. Estimation of the error is based on the test data.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.16/25

Validation data
Sometimes the data set is partitioned into three data set.
Training data

for learning models to choose the best model

Validation data Test data

for estimation of the predcition error.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.17/25

Validation data
Example: Construction of a classier. The data set is partioned into training, validation and test data. Based on the training data a decision tree, a nave Bayes classier and a nearest neighbour classier are constructed. Choose the classier which is best on the validation data. Estimate the prediction error of the chosen classier based on the test data set.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.18/25

Validation data

Example: A neural network (multilayer perceptron) should be trained for a regression or classication problem. The data set is partioned into training, validation and test data. The learning algorithm (for instance, backpropagation) is carried out with the training data. The validation data is used to compute the error during learning without inuencing the training.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.19/25

Validation data
Training of the neural network is stopped, when the error on the validation data has dropped to a minimum and is increasing again.
E rro r

The prediction error of the neural network is calculated based on the test data.

v a lid a tio n d a ta tra in in g d a ta s to p tra in in g

L e a rn e p o c h s

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.20/25

Crossvalidation
In order to have a more robust estimation of the prediction error, the data set is split into disjoint subset of approimately the same size and the model is trained times. Each time, one of the subsets is left for testing, while the others are used for traing. In this way, a prediction error can be computed times. Usually the mean of th vlues is taken as the prediction error. This is called -fold crossvalidation.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.21/25

Leave-One-Out
For small data sets with data objects, the Leave-One-Out or jackknife method is applied meaning -fold crossvalidation. Only one data object is left for testing each time-

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.22/25

Bootstrapping

For the Bootstrap method the training data set is a drawn as a sample with replacement from the whole data set.
0.632 bootstrap

draws a sample of data objects for training from a data set of size .

Since sampling is carried out with replacement, some data objects are in the training set with multiple copies, other might not be in the training set.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.23/25

Bootstrapping
The probability that a data object is not chosen in a single draw is The probability that a data set is not selected in the sample for the training data is


for large .

This means that the training set consists in average of 63.2% of the original data.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.24/25

Bootstrapping

Testing for bottstrapping is carried with the training and the testing data. The error is a weighted sum of the resubstitution error and the error on the test data set. estimated overall error error on the test data set error on the trainig data set

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.25/25

Bayesian networks

Basic idea of Bayesian networks: Representation of a probability distribution over a high dimensional space. Reasoning: How does the change of a single or some marginal distributions change the other marginal distributions?

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.1/37

Bayesian networks: Example

Customer questionnaires:

Each question has a nite number of answers like: Are you satised with

1: very satised 2: satised . . . 6: very dissatised 7: not applicable/dont now

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.2/37

Bayesian networks: Example


Each question represents a nominal attribute. The set of questionnaires lled in by customers induces a probability distribution on the product space of all attributes (questions) with dependencies between the questions. How do these dependencies look like? How can this be used to improve customer satisfaction?

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.3/37

Bayesian networks: Example


Assume, the management provides 1 million Euros to improve customer satisfaction. How should this money be spent? Answers to target questions like

Are you satised with our service technicians? Are you satised with our hotline?

cannot be inuenced directly.

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.4/37

Bayesian networks: Example


Answers to other questions with correlations to the target questions can be inuenced. For instance:

Did the service technician show up on time? Was the service technician able to solve your problem immediately? How long did you have to wait at the hotline? Could the staff at the hotline answer your questions?

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.5/37

Bayesian networks: Example

Knowing the inuence of such questions on the target questions, suitable actions can be taken to improve the answers to the target questions like

employ more technicians/staff at the hotline provide better training for the technicians/staff at the hotline

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.6/37

Bayesian networks: Example

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.7/37

High-dimensional data spaces

Number of possible combinations for 200 binary attributes:


Gigabyte

Most of the combinations will not be admissible or relevant.

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.8/37

Decomposition

6 5 4

1 2 3


4 5

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.9/37

Decomposition

6 5 4 1

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.10/37

Decomposition

6 5 4 1


and 2 3 4 5

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.11/37

Decomposition

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.12/37

Decomposition

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.13/37

Decomposition

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.14/37

Decomposition

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.15/37

Decomposition
Table size

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.16/37

Decomposition
Table size

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.17/37

Decomposition
Table size

(in general)

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.18/37

Decomposition of probability distributions


Bayes theorem:

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.19/37

Decomposition of probability distributions


Bayes theorem:

iterative application:

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.20/37

Decomposition of probability distributions


Bayes theorem:

iterative application:

in case of conditional independence:

Data Mining p.21/37

University of Applied Sciences Braunschweig/Wolfenbuttel

Conditional independence

and

are conditional independent given

if

holds.

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.22/37

Conditional independence

Example:

Age Work experience Salary

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.23/37

Conditional independence

Example:

Age Work experience Salary

All three variables are correlated and pairwise dependent.

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.24/37

Conditional independence

Example:

Age Work experience Salary

All three variables are correlated and pairwise dependent. But:


Salary Age, Work experience

Salary Work experience

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.25/37

Conditional independence

A g e

W o r k e x p e r ie n c e

S a la r y

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.26/37

Conditional independence

A g e

A g e

W o r k e x p e r ie n c e

S a la r y

W o r k e x p e r ie n c e

S a la r y

industry
University of Applied Sciences Braunschweig/Wolfenbuttel

public service
Data Mining p.27/37

Directed acyclic graph (DAG)


A B C

D E

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.28/37

Moral graph

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.29/37

Moral graph

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.30/37

Moral graph

S B L A T T L E B L E

X E

D B E

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.31/37

Hypergraph representation

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.32/37

Learning from data

Estimation of the distributions: corresponding relative frequencies Learning structure: High computational costs.

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.33/37

Structure learning

Dene a measure for the quality of approximation of the data, for instance:
Strategy 1:

Maximum likelihood estimator: Maximize


: Structure of the network

network (estimated by the relative frequencies of the data for a given structure)

: Probability distributions in the nodes of the


: Data

University of Applied Sciences Braunschweig/Wolfenbuttel

Data Mining p.34/37

Structure learning

Leads to combinatorial explosion. Number of possible network structure increases exponentially with the number of attributes. Therefore: Use greedy strategies

Start with network with no edges (all attributes are assumed to be independent) and add edges step by step. Add the edge that increases (maximum likelihood, K2, ) most. Or:

Start with fully connected network and remove the edge that leads to the smallest decrease of University of Applied Sciences Braunschweig/Wolfenbuttel .

Data Mining p.35/37

Structure learning
Apply (conditional) independence test (for instance ) in order to decide which edges should be included in in the Bayesian network.
Strategy 2:

Compute the strength of the dependencies of pairs of attributes. Attributes with high dependency are usually neighbouring nodes in a Bayesian network. Apply heuristic search strategies (like genetic algorithms or tabu search).
Strategy 3:

In all cases: Find a compromise between a simple structure with a larger error and a complex structure with overtting (use criteria like AIC, BIC oder MDL).
University of Applied Sciences Braunschweig/Wolfenbuttel
Data Mining p.36/37

Propagation
S B L A T T L E B L E

X E

D B E

In a Bayesian network, arbitrary attributes can instantiated (with a single value or a probability distribution). The computation of the (marginal) distributions of the other attributes is carried out based on message exchange algorithm.
University of Applied Sciences Braunschweig/Wolfenbuttel
Data Mining p.37/37

Cluster analysis

Aim:

Find groups (clusters) of similar objects in a data

set. Objects within the same cluster should be similar. Objects from different clusters should be dissimilar.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.1/76

Cluster analysis: Applciations


Customer segmentation: Find groups of customers Gene clustering: Find groups of genes with similar properties (for instance, expression proles) Clustering of solar stars: Find groups of starts based on attributes like size, characteristic of the spectrum, Identication of social or economic groups based on attributes like income, age, education level,

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.2/76

Unsupervised classication

Cluster analysis is an unsupervised classifcation technique. In contrast to supervised classifcation, the classes (clusters) are not known in advance in the data set.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.3/76

Distance measures
Cluster analysis requires a distance or (dis)similarity measure for the grouping of the data. The choice of a suitable distance measures has a strong inuence on the cluster structure. For continuous attributes, a normalisation technique should be applied in order to balance the inuence of the single attributes on the overall distance. (See also nearest neighbour classiers.)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.4/76

Inuence of scaling

m illi u n its

k ilo u n its
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.5/76

Discrete attributes
When discrete attributes are involved in the distance measure, clusters tend to be split based on the discrete attributes, since they automatically lead to well-separated clusters. Therefore, most clustering algorithm focus on continuous attributes.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.6/76

Hierarchical agglomerative clustering

Start with every data point in its own cluster. (i.e., start with so-called singletons: single element clusters) In each step merge those two clusters that are closest to each other. Keep on merging clusters until all data points are contained in one cluster. The result is a hierarchy of clusters that can be visualized in a tree structure, a so-called dendrogram.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.7/76

Measuring the distances

The distance between singletons is simply the distance between the (single) data points contained in them. However: How do we compute the distance between clusters that contain more than one data point?

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.8/76

Measuring the distance between clusters


Centroid

(red) Distance between the centroids (mean value vectors) of the two clusters.

Average Linkage

Average distance between two points of the two clusters.


Single Linkage

(green) Distance between the two closest points of the two clusters. (blue) Distance between the two farthest points of the two clusters.
Data Mining p.9/76

Complete Linkage

University of Applied Sciences Braunschweig/Wolfenbuettel

Measuring the distance between clusters

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.10/76

Measuring the distance between clusters

Single linkage can follow chains in the data (may be desirable in certain applications). Complete linkage leads to very compact clusters. Average linkage also tends clearly towards compact clusters.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.11/76

Measuring the distance between clusters

Single linkage

Complete linkage

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.12/76

Dendrograms

The cluster merging process arranges the data points in a binary tree. Draw the data tuples at the bottom or on the left (equally spaced if they are multi-dimensional). Draw a connection between clusters that are merged, with the distance to the data points representing the distance between the clusters.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.13/76

Dendrograms

distance between clusters

data tuples

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.14/76

Agglomerative clustering

Example: Clustering of the 1-dimensional data set . All three approaches to measure the distance between clusters lead to different dendrograms.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.15/76

Agglomerative clustering

Centroid

Single linkage

Complete linkage

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.16/76

Heatmaps

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.17/76

Heatmaps

One axis: Attributes Other axis: Data objects Colours (colour intensities): Attribute values

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.18/76

Implementation aspects

Hierarchical agglomerative clustering can be implemented by processing the matrix containing the pairwise distances of the data points. (The data points themselves are actually not needed.) In each step the rows and columns corresponding to the two clusters that are closest to each other are deleted. A new row and column corresponding to the cluster formed by merging these clusters is added to the matrix.
Data Mining p.19/76

University of Applied Sciences Braunschweig/Wolfenbuettel

Implementation aspects

The elements of this new row/column are computed according to

indices of the two clusters that are merged indices of the old clusters that are not merged index of the new cluster (result of merger) parameters specifying the method (single linkage etc.)
Data Mining p.20/76

University of Applied Sciences Braunschweig/Wolfenbuettel

Implementation aspects

The parameters dening the different methods are ( are the numbers of data points in the clusters):
method centroid method median method single linkage complete linkage average linkage Wards method

0 0

0 0 0 0


0 0

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.21/76

Choosing the clusters


Simplest Approach:

Specify a minimum desired distance between

clusters. Stop merging clusters if the closest two clusters are farther apart than this distance.
Visual Approach:

Merge clusters until all data points are

combined into one cluster. Draw the dendrogram and nd a good cut level. Advantage: Cut need not be strictly horizontal.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.22/76

Choosing the clusters


More Sophisticated Approaches:

Analyze the sequence of distances in the

merging process. Try to nd a step in which the distance between the two clusters merged is considerably larger than the distance of the previous step. Several heuristic criteria exist for this step selection.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.23/76

Agglomerative clustering: Complexity

The computational complexity of agglomerative clustering is


quadratic

in the number of data, which is not acceptable for larger data sets.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.24/76

-Means clustering

Choose a number of clusters to be found (user input). Initialize the cluster centres randomly (for instance, by randomly selecting data points). Assign each data point to the cluster centrethat is closest to it (i.e. closer than any other cluster center).

Data point assignment:

Cluster centre update:

Compute new cluster centres as the mean vectors of the assigned data points. (Intuitively: centre of gravity if each data point has unit weight.)
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.25/76

-Means clustering

Repeat these two steps (data point assignment and cluster centre update) until the clusters centres do not change anymore. It can be shown that this scheme must converge, i.e., the update of the cluster centres cannot go on forever.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.26/76

-Means clustering

Aim: Minimize the objective function

under the constraints and

for all

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.27/76

Alternating optimization

Assuming the cluster centres to be xed, should be chosen for the cluster to which data object has the smallest distance in order to minimize the objective function. Assuming the assignments to the clusters to be xed, each cluster centre should be chosen as the mean vector of the data objects assigned to the cluster in order to minimize the objective function.

This is again a greedy algorithm.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.28/76

-Means clustering: Example

Data set to be clustered. Choose clusters. (From visual inspection, can be difcult to determine in general.)

Initial position of cluster centres. Randomly selected data points.


Data Mining p.29/76

University of Applied Sciences Braunschweig/Wolfenbuettel

Delaunay triangulations and Voronoi diagrams

Dots represent cluster centres (quantization vectors). Left: Delaunay Triangulation (The circle through the corners of a triangle does not contain another point.) Right: Voronoi Diagram (Midperpendiculars of the Delaunay triangulation: boundaries of the regions of points that are closest to the enclosed cluster centre (Voronoi cells)).
Data Mining p.30/76

University of Applied Sciences Braunschweig/Wolfenbuettel

Delaunay triangulations and Voronoi diagrams


Delaunay Triangulation:

simple triangle (shown in

grey on the left)


Voronoi Diagram:

midperpendiculars of the triangles edges (shown in blue on the left, in grey on the right)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.31/76

-Means clustering: Example

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.32/76

-Means clustering: Local minima

Clustering is successful in this example: The clusters found are those that would have been formed intuitively. Convergence is achieved after only 5 steps. (This is typical: convergence is usually very fast.) However: The clustering result is fairly sensitive to the initial positions of the cluster centers. With a bad initialisation clustering may fail (the alternating update process gets stuck in a local minimum).

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.33/76

-Means clustering: Local minima

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.34/76

Learning vector quantization (LVQ)

Adaptation of reference vectors/codebook vectors


Like online -means clustering (update after each data point). For each training pattern nd the closest reference vector. Adapt only this reference vector (winner neuron). For classied data the class may be taken into account. (reference vectors are assigned to classes)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.35/76

Learning vector quantization

Attraction rule

(data point and reference vector have


same class)

(data point and reference vector have different class)


Repulsion rule

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.36/76

Learning vector quantization

Adaptation of reference vectors/codebook vectors

attraction rule

repulsion rule

: data point, : reference vector

(learning rate)
Data Mining p.37/76

University of Applied Sciences Braunschweig/Wolfenbuettel

LVQ: Learning rate decay


Problem: xed learning rate can lead to oscillations

Solution: time dependent learning rate

or

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.38/76

LVQ: Example

Adaptation of reference vectors / codebook vectors

Left: Online training with learning rate Right: Batch training with learning rate

, .
Data Mining p.39/76

University of Applied Sciences Braunschweig/Wolfenbuettel

Self-organizing maps (SOM)

Self-organizing maps

(also called Kohonen map) is an

LVQ model

where a topological structure is assumed on the reference vectors (for instance, a grid in the plane), not only the closest reference vector is updated in each step, but also the reference vectors in the neighbourhood, the neighbourhood (and the learning rate) become smaller over time.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.40/76

Unfolding SOM

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.41/76

Fuzzy clustering

Minimize the objective function

under the constraints

for all

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.42/76

Parameters

is the membership degree of data object

to the th cluster.

is some distance measure specifying the distance between data object and cluster

, called fuzzier, controls how much clusters

may overlap.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.43/76

Parameters to be optimized

the membership degrees the cluster parameters (not given explicitly here, but hidden in the distances ).

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.44/76

Parameters to be optimized

the membership degrees the cluster parameters (not given explicitly here, but hidden in the distances ).

Leads to a non-linear optimization problem.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.45/76

Fuzzy c-means algorithm

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.46/76

Fuzzy c-means algorithm

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.47/76

Other cluster shapes


ellipsoidal clusters (Gustafson/Kessel 1979) clusters as lines/planes/hyperplanes (Bock 1979, Bezdek 1981) cluster as shells of circles (Dav 1990, Krishnapuram/Nasraoui/Frigui 1992) clusters in the form of arbitrary quadrics (Krishnapuram/Frigui/Nasraoui 1991-1995) adaptable cluster volumes (Keller/Klawonn 1999)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.48/76

Example

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.49/76

Gaussians mixture models


0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -3 -2 -1 0 1 2 3 4

Two normal distributions

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.50/76

Gaussians mixture models


0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0 -3 -2 -1 0 1 2 3 4

Mixture model (both normal distrubutions contribute 50%)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.51/76

Gaussians mixture models


0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0 -3 -2 -1 0 1 2 3 4

Mixture model (one normal distrubutions contributes 10%, the other 90%)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.52/76

Gaussians mixture models


Assumption:

Data was generated by sampling a set of normal distributions. (The probability density is a mixture of Gaussian distributions.)

Formally:

We assume that the probability density can be described as


University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.53/76

Gaussians mixture models


is the set of cluster parameters is a random vector that has the data space as its domain is a random variable that has the cluster indices as possible values (i.e., and )

is the probability that a data point belongs to (is generated by) the -th component of the mixture

is the conditional probability density function of a data point given the cluster (specied by the cluster index )
Data Mining p.54/76

University of Applied Sciences Braunschweig/Wolfenbuettel

Expectation maximization
Basic idea: Problem:

Do a maximum likelihood estimation of the cluster parameters. The likelihood function,


is difcult to optimize, even if one takes the natural logarithm (cf. the maximum likelihood estimation of the parameters of a normal distribution), because

contains the natural logarithms of complex sums.


University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.55/76

Expectation maximization
Approach:

Assume that there are hidden variables stating the clusters that generated the data points , so that the sums reduce to one term. Since the their values. are hidden, we do not know

Problem:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.56/76

Expectation maximization
Formally:

Maximize the likelihood of the completed data set , where combines the values of the variables . That is,

Problem:

Since the unknown

are hidden, the values are


cannot be

(and thus the factors computed).

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.57/76

Expectation maximization
Approach to nd a solution nevertheless: See the as random variables (the

values

are not xed) and consider a probability distribution over the possible values. As a consequence becomes a random variable, even for a xed data set and xed cluster parameters . Try to maximize the expected value of or

(hence the name expectation maximization).

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.58/76

Expectation maximization
Formally:

Find the cluster parameters as


that is, maximize the expected likelihood

or, alternatively, maximize the expected log-likelihood

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.59/76

Expectation maximization

Unfortunately, these functionals are still difcult to optimize directly. Use the equation as an iterative scheme, xing in some terms (Make sure that the iteration scheme converges at least to a local maximum.)

Solution:

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.60/76

Expectation maximization
Iterative scheme for expectation maximization:

Choose some initial set and then compute

of cluster parameters

It can be shown that each EM iteration increases the likelihood of the data and that the algorithm converges to a local maximum of the likelihood function (i.e., EM is a safe way to maximize the likelihood function).
Data Mining p.61/76

University of Applied Sciences Braunschweig/Wolfenbuettel

Expectation maximization
Justication of the last step on the previous slide:

Data Mining p.62/76

University of Applied Sciences Braunschweig/Wolfenbuettel

Expectation maximization

The probabilities

are computed as

that is, as the relative probability densities of the different clusters (as specied by the cluster parameters) at the location of the data points .

The are the posterior probabilities of the clusters given the data point and a set of cluster parameters .

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.63/76

Expectation maximization

They can be seen as case weights of a completed data set: Split each data point into data points , . Distribute the unit weight of the data point according to the above probabilities, i.e., assign to the weight , .

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.64/76

EM: Cookbook recipe

Core Iteration Formula

Expectation Step

For all data points : Compute for each normal distribution the probability that the data point was generated from it (ratio of probability densities at the location of the data point). weight of the data point for the estimation.
Data Mining p.65/76

University of Applied Sciences Braunschweig/Wolfenbuettel

EM: Cookbook recipe

Maximization Step

For all normal distributions: Estimate the parameters by standard maximum likelihood estimation using the probabilities (weights) assigned to the data points w.r.t. the distribution in the expectation step.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.66/76

EM: Gaussians mixtures


Expectation Step:

Use Bayes rule to compute

weight of the data point for the estimation.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.67/76

EM: Gaussians mixtures


Maximization Step:

Use maximum likelihood estimation


to compute

and

Iterate until convergence

(checked, e.g., by change of

mean vector).
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.68/76

EM: Technical problems

If a fully general mixture of Gaussian distributions is used, the likelihood function is truly optimized if all normal distributions except one are contracted to single data points and the remaining normal distribution is the maximum likelihood estimate for the remaining data points. This undesired result is rare, because the algorithm gets stuck in a local optimum.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.69/76

EM: Technical problems

Nevertheless it is recommended to take countermeasures, which consist mainly in reducing the degrees of freedom, like Fix the determinants of the covariance matrices to equal values. Use a diagonal instead of a general covariance matrix. Use an isotropic variance instead of a covariance matrix. Fix the prior probabilities of the clusters to equal values.
Data Mining p.70/76

University of Applied Sciences Braunschweig/Wolfenbuettel

Other clustering approaches


Density-based clustering: DBSCAN, DENCLUE, Grid-based clustering (division of the space into a nite number of cells): STING, WaveCluster, Clustering with nominal attributes: ROCK,

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.71/76

Determining the number of clusters

Global validity measures:


Dene a measure that evaluates clustering results. Cluster the data with different numbers of clusters. Choose the result (number of clusters) with the best validity measure.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.72/76

Separation index
To be maximized:

cluster cluster diamcluster

where diamcluster

cluster

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.73/76

Partition coefcient
(for fuzzy clustering, to be maximized)

The largest value 1 is assumed when the partition is not fuzzy at all, i.e. . The smallest value is assumed when all data are assigned with the same membership degree to all clusters.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.74/76

Partition entropy

(for fuzzy clustering, to be minimized)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.75/76

Crossvalidation

Partition the data into the same size.

subsets of approximately

Apply clustering times, each time leaving out one of the subsets. In the ideal case, each pair of data objects should in always be either assigned to the same cluster or always to different clusters (coherence). This provides a measures of how stable (how acceptable) a clustering result is. Apply this scheme with different numbers of clusters and choose the one where the coherence is best.
Data Mining p.76/76

University of Applied Sciences Braunschweig/Wolfenbuettel

Association rule mining


Association rule induction: Originally designed for market basket analysis. Aims at nding patterns in the shopping behaviour of customers of supermarkets, mail-order companies, on-line shops etc. More specically:
Find sets of products that are frequently bought together.

Example of an association rule: If a customer buys bread and wine, then she/he will probably also buy cheese.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.1/37

Association rule mining

Possible applications of found association rules: Improve arrangement of products in shelves, on a catalogs pages. Support of cross-selling (suggestion of other products), product bundling. Fraud detection, technical dependence analysis. Finding business rules and detection of data quality problems.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.2/37

Association rules

Assessing the quality of association rules: Support of an item set: Fraction of transactions (shopping baskets/carts) that contain the item set. Support of an association rule : Either: Support of (more common: rule is correct) Or: Support of (more plausible: rule is applicable) Condence of an association rule : Support of divided by support of (estimate of ).

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.3/37

Association rules

Two step implementation of the search for association rules: Find the frequent item sets (also called large item sets), i.e., the item sets that have at least a user-dened minimum support. Form rules using the frequent item sets found and select those that have at least a user-dened minimum condence.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.4/37

Finding frequent item sets


Subset lattice and a prex tree for ve items:

It is not possible to determine the support of all possible item sets, because their number grows exponentially with the number of items. Efcient methods to search the subset lattice are needed.
Data Mining p.5/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Item set trees

A (full) item set tree for the ve items


and .

Based on a global order of the items. The item sets counted in a node consist of all items labeling the edges to the node (common prex) and one item following the last edge label.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.6/37

Item set tree pruning


In applications item set trees tend to get very large, so pruning is needed.
Structural Pruning:

Make sure that there is only one counter for each possible item set. Explains the unbalanced structure of the full
item set tree.

Size Based Pruning:

Prune the tree if a certain depth (a certain size of the item sets) is reached. Idea: Rules with too many items are difcult to
interpret.
Data Mining p.7/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Item set tree pruning


Support Based Pruning:

No superset of an infrequent item set can be frequent. No counters for item sets having an infrequent
subset are needed.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.8/37

Searching the subset lattice


between frequent (blue) and infrequent (white) item sets:
Boundary

Apriori:

Breadth-rst search (item sets of same

size).
Eclat:

Depth-rst search (item sets with same prex).


Data Mining p.9/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Apriori: Breadth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Example transaction database with 5 items and 10 transactions. Minimum support: 30%, i.e., at least 3 transactions must contain the item set. All one item sets are frequent full second level is needed.
Data Mining p.10/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Apriori: Breadth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Determining the support of item sets: For each item set traverse the database and count the transactions that contain it (highly inefcient). Better: Traverse the tree for each transaction and nd the item sets it contains (efcient: can be implemented as a simple double recursive procedure).
Data Mining p.11/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Apriori: Breadth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Minimum support: 30%, i.e., at least 3 transactions must contain the item set. Infrequent item sets: , , . The subtrees starting at these item sets can be pruned.
Data Mining p.12/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Apriori: Breadth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Generate candidate item sets with 3 items (parents must be frequent).

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.13/37

Apriori: Breadth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Before counting, check whether the candidates contain an infrequent item set. An item set with items has subsets of size . The parent is only one of these subsets.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.14/37

Apriori: Breadth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

The item sets and can be pruned, because contains the infrequent item set and contains the infrequent item set .

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.15/37

Apriori: Breadth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Only the remaining four item sets of size 3 are evaluated.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.16/37

Apriori: Breadth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Minimum support: 30%, i.e., at least 3 transactions must contain the item set. Infrequent item set: .

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.17/37

Apriori: Breadth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Generate candidate item sets with 4 items (parents must be frequent). Before counting, check whether the candidates contain an infrequent item set.
Data Mining p.18/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Apriori: Breadth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

The item set can be pruned, because it . contains the infrequent item set Consequence: No candidate item sets with four items. Fourth access to the transaction database is not necessary.
Data Mining p.19/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Eclat: Depth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Form a transaction list for each item. Here: bit vector representation. grey: item is contained in transaction white: item is not contained in transaction

Transaction database is needed only once (for the single item transaction lists).
Data Mining p.20/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Eclat: Depth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Intersect the transaction list for item transaction lists of all other items.

with the

Count the number of set bits (containing transactions). The item set is infrequent and can be pruned.
Data Mining p.21/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Eclat: Depth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Intersect the transaction list for , transaction lists of

with the .

Result: Transaction lists for the item sets and .


Data Mining p.22/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Eclat: Depth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Intersect the transaction list for Result: Transaction list for the item set

and .

With Apriori this item set could be pruned before counting, because it was known that is infrequent.
Data Mining p.23/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Eclat: Depth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Backtrack to the second level of the search tree and intersect the transaction list for and . Result: Transaction list for .

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.24/37

Eclat: Depth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Backtrack to the rst level of the search tree and intersect the transaction list for with the transaction lists for , , and . Result: Transaction lists for the item sets , , and . Only one item set with sufcient support prune all subtrees.
Data Mining p.25/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Eclat: Depth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Backtrack to the rst level of the search tree and intersect the transaction list for with the transaction lists for and . Result: Transaction lists for the item sets . and
Data Mining p.26/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Eclat: Depth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Intersect the transaction list for Result: Transaction list for Infrequent item set: . .

and

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.27/37

Eclat: Depth rst search


1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Backtrack to the rst level of the search tree and intersect the transaction list for with the transaction list for . Result: Transaction list for the item set With this step the search is nished.
Data Mining p.28/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Frequent item sets


1 item

: 70%
30%

2 items

: 70% : 60% : 70%

: 40% : 50% : 60% : 30% : 40%

: 40%
: 40%

3 items

: 30% : 30% : 40%

Types of frequent item sets


Free Item Set:

Any frequent item set (support is higher than the minimal support).

Closed Item Set

(marked with ): A frequent item set is called closed if no superset has the same support.

Maximal Item Set

(marked with ): A frequent item set is called maximal if no superset is frequent.


Data Mining p.29/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Generating association rules

For each frequent item set :

Consider all pairs of sets with and . Common restriction: only one item in consequent (then-part). Form the association rule condence. conf

i.e.

and compute its


supp supp

supp supp

Report rules with a condence higher than the minimum condence.


Data Mining p.30/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Generating association rules

Further rule ltering


can rely on:

Require a minimum difference between rule condence and consequent support. Compute information gain or for antecedent (if-part) and consequent.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.31/37

Generating association rules


,

Example:

, supp supp

conf

Minimum condence: 80%


association rule : : : : : : support of all items 30% 50% 60% 60% 40% 40% support of antecedent 30% 60% 70% 70% 40% 50% 100% 83.3% 85.7% 85.7% 100% 80%
Data Mining p.32/37

condence

University of Applied Sciences Braunschweig/Wolfenbuettel

Summary association rules


Association Rule Induction is a Two Step Process Find the frequent item sets (minimum support).

Form the relevant association rules (minimum condence). lattice / item set

Finding the Frequent Item Sets Top-down search in the subset

tree. Apriori: Breadth rst search; Eclat: Depth rst search. Other algorithms: FP-growth, H-Mine, LCM, Maa, Relim etc. Search Tree Pruning: No superset of an infrequent
item set can be frequent.
Data Mining p.33/37

University of Applied Sciences Braunschweig/Wolfenbuettel

Summary association rules


Generating the Association Rules

Form all possible association rules from the frequent item sets. Filter interesting association rules.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.34/37

Structured itemsets
Sometimes, an additional structure is imposed on the item sets.

The item sets are sequences of events.


For instance: Customer contact (buying, complaint, questionnaire, ) Association rules have the form: If and then happens, then probably happens next.

Items sets are molecules: Find frequent substructures.

The additional structure leads to different tree structure, but the principal algorithm remains the same.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.35/37

Finding frequent molecule structures

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.36/37

Other applications

Finding business rules and detection of data quality problems.


Association rules with condence close to 100% could be business rules. Exceptions might be caused by data quality problems. Search for association rules with a given conclusion part. If , then the customer probably buys the product.
Data Mining p.37/37

Construction of partial classiers.


University of Applied Sciences Braunschweig/Wolfenbuettel

Subgroup discovery

Classication: Find a global model (classier) that assigns the correct class to all objects (or at least to as many as possible). Find interesting subgroups in the data set.

Subgroup discovers:

A subgroup is usually described by a few attribute values. A subgroup is interesting w.r.t. a (binary) target attribute if the distribution of the values of the target attribute differs signicantly from the distribution in the whole population.
Data Mining p.1/11

University of Applied Sciences Braunschweig/Wolfenbuettel

Subgroup discovery
Example: For a marketing campaign nd subgroups of customers with a high(er) chance to buy a certain product. Target attribute: buys insurance = YES Possible result of the subgroup discovery process similar to association rules with predened consequent parts of the rules:

buys insurance = YES in the whole population: 5% buys insurance = YES in the subgroup Age = YOUNG & marital status = MARRIED 15%
Data Mining p.2/11

University of Applied Sciences Braunschweig/Wolfenbuettel

Measures of interest for subgroups

Binomial test:
BT
:

relative frequency of the target variable in the whole popualtion : relative frequency of the target variable in the subgroup : size of the whole population : size of the subgroups
-test

(Corresponds to the value of the


University of Applied Sciences Braunschweig/Wolfenbuettel

for indepenData Mining p.3/11

dence w.r.t. to the subgroup and the target variable.)

Measures of interest for subgroups

Weighted relative accuracy:


WRACC

True positives vs. false positives:


TP

: weighting parameter

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.4/11

Measures of interest for subgroups

Relative gain:
RG

if MC otherwise

MC: Minimum coverage (support) of the subgroup

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.5/11

Subgroup discovery algorithms

The problem of subdiscovery is similar to nding association rules. Therefore, algorithms for frequent item set and association rule mining are often adapted to subgroup discovery like

Apriori-SD FP-growth

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.6/11

Feature selection
Very often, some attributes (features) can be

irrelevant or redundant.

Depending on the method or model, such attributes can lead to bad results in the data mining process.

Nave Bayes classiers willbe strongly affected when attributes are highly dependent. Decision trees and especially nearest neighbour classiers are sensitive to irrelevant attributes.
Data Mining p.7/11

University of Applied Sciences Braunschweig/Wolfenbuettel

Feature selection
Two basic strategies for feature selection:
Filtering

refers to preselecting attributes before the corresponding model is built. refers to feature selection methods that select attributes in combination with the construction of the model.

Wrapping

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.8/11

Feature selection
Filtering methods:

Independence tests between arbitrary attributes (to nd dependent attributes) between the target attribute and othet attributes (to nd irrelevant attributes) Construct a (strictly-pruned) decision tree and use only those attributes occuring in the decision tree.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.9/11

Feature selection
Wrapping methods:

Exhaustive search (try all combinations of attributes) Greedy search (add (delete) attributes step by step, choose in each step the one that leads to the best increase (least decreae) of the performance)

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.10/11

Other topics

Change detection Evolving systems (online adaptive learning for data streams) Active learning: Classication (or regression) problems where labeling the data objects is expensive or complicated and most of the data objects are not labeled. The active learning model tries to select those data objects for labeling which best increase the performance of the model.

University of Applied Sciences Braunschweig/Wolfenbuettel

Data Mining p.11/11

You might also like