Professional Documents
Culture Documents
Frank Klawonn
f.klawonn@fh-wolfenbuettel.de
Basic references
Data Mining: Introductory and Advanced Topics. Prentice Hall, Upper Saddle River, NJ (2002). J. Han, M. Kamber: Data Mining (2nd ed.). Morgan Kaufmann, San Mateo, CA (2005). D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, Cambridge, MA (2001). D.T. Larose: Data Mining Methods and Models. Wiley, Chichester (2006). D.T. Larose: Discovering Knowledge in Data: An Introduction to Data Mining. Wiley, Chichester (2006). I.H. Witten, E. Frank: Data Mining (2nd ed.). Morgan Kaufmann, San Mateo, CA (2005).
M.H. Dunham:
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.2/39
Web references
http://www.cs.waikato.ac.nz/ ml/
Witten/Frank with
(UCI repository). A collection of example data sets with various properties. A powerful statistics software.
http://www.r-project.org/
Data mining is the analysis of (often large) observational data sets to nd unsuspected relationships and to summerize the data in novel ways that are both understandable and useful to the data owner. (Hand/Mannila/Smyth)
Data Mining
Data Mining
Data Mining
Data Mining
Characteristics of data
Characteristics of knowledge
(Customer) data ID Name Age Sex Income . . . . . . . . . . . . . . . 2448 Miller 35 Male 6000 2449 Smith 39 Female 7000 . . . . . . . . . . . . . . .
Knowledge:
80% of our customers are between 30 and 40 years old and earn between 5000$ and 9000$ per month.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.11/39
Not all statements are equally important, equally substantial, equally useful. Knowledge must be assessed. Assessment Criteria
Correctness (probability, success in tests) Generality (range of validity, conditions of validity) Usefulness (relevance, predictive power) Comprehensibility (simplicity, clarity, parsimony) Novelty (previously unknown, unexpected)
Data Mining p.12/39
Examples
A bank has decades of experience in giving loans to customers. Large amounts of data of customers who have or have not returned their loans are available. Is it possible to derive an algorithm from the data that decides for a new customer desiring a loan, whether the loan should be granted or not?
Examples
Problems/questions
Are the data complete and correct? Is the complete available customer information
(age, income, address, prediction? customer available? ) needed for the
Is all the necessary information about the Are the data representative? Can an unambiguous prediction/decision be
made?
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.14/39
Examples
A producer of porcelain wants to install an automatic quality check device that sorts out broken parts. The produced parts are stimulated by an acoustic signal and a frequency spectrum is measured. The parts are manually classied as broken or OK. Is it possible to derive an algorithm from the data that classies a new part as broken or OK, given the measured frequency spectrum?
Examples
A bank provides credit cards for customers and wants to detect as many fraud transactions as possible.
Problems/questions
Are the data complete and correct? Is the complete available customer information
(amount, location of the transaction, customer address, previous customer history, ) needed for the prediction?
Examples
Churn detection: Given customer data and history, nd the possible candidates for churners.
Examples
Various properties of garbarge particles are measured (colour, weight, magnetism, ). Is it possible to sort the particles automtically into groups like paper, glass, metal, plastic?
(Supervised) Classication
Common property of all previous examples: Based on certain measurements/information, an assignment of an object (customer, porcelain part, garbage particle) to one group of a nite number of classes (grant loan yes/no, broken/OK, churner/nonchurner, paper/glass/metal/plastic) is needed. Such tasks are called (supervised) classication problems.
Examples
Given (selected) stock prices and indices of today, predict a specic stock price or index value for tomorrow. Given todays weather conditions, predict the (local) temperature for tomorrow. In both cases, historical data for decades are available.
Regression
These problems are also supervised learning problems. Supervised refers to the fact that the outcome/prediction for the given/historical data is known. In contrast to classication problems, the variable to be predicted (stock price, temperature) is continuous. It can take arbitrary real values. Such tasks are called regression problems.
Examples
Given a customer database. Are the properties of the customers just randomly distributed or can the be grouped into typical customer segments? (poor young students, rich yuppies, OAPs with medium capital)
Examples
Given the train tickets bought in Germany. Are there main connections that are used typically (at certain times/days)? (like from region Berlin to region Hamburg, from region Frankfurt to region Munich, )
(Unsupervised) Classication
In the two previous examples, the data should be grouped into classes. Similar data should be put into the same class. The classes are not known in advance. Such tasks are called unsupervised classication problems, usually solved by cluster analysis.
Examples
Market basket analysis: Are there typical combinations of items that customers tend to buy together? For instance: Customers who buy wallpaper and paint also buy wallpaper paste in 90% of the cases.
Examples
Such tasks are called frequent item set mining and association rule mining. One is only interested in substructures of the data set, not in describing or covering the whole data set.
Subgroup discovery
aims at nding subsets where a given class (e.g. customers who tend to buy certain products) is signicantly over- or underrepresented.
Subgroup discovery
Subgroup discovery can be viewed as a partial classication problem. Not the whole data set needs to be classied. It is sufcient to nd subsets with good classication rates.
Change detection
Are their signicant changes in the data over time? e.g. Does the company loose a customer group or is it winning a new one? Is the quality of certain materials or products changing?
Analysing the data online while they arrive in a ontinuous data stream. It is assumed that the source/model generating the data does not change over time.
incremental learning:
Changes in the model parameters over time might be possible. The information derived from old might not be applicable to new data.
evolving systems:
molecular structures (2D/3D, graphs) images (How to retrieve images from an image
database?)
CRISP
Business understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem denition, and a preliminary plan designed to achieve the objectives.
Data understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover rst insights into the data, or to detect interesting subsets to form hypotheses for hidden information.
Data preparation
The data preparation phase covers all activities to construct the nal dataset (data that will be fed into the modelling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modelling tools.
Modelling
In this phase, various modelling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specic requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.
Evaluation
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to nal deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufciently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organised and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions will need to be carried out in order to actually make use of the created models.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.38/39
feature extraction, transformation, treatment of missing values Choice of the model (classier, regression model, cluster analysis technique, ) and estimation of the model parameters Model validation and selection. Is the derived model suitable to make reliable predictions for new data?
Evaluation:
Statistical reasoning
Denition.
Let
and
supports
), if
speaks against
holds.
), if
holds.
is irrelevant for
), if
Statistical reasoning
Data Mining p.2/106
Statistical reasoning
" ":
Proof.
" ":
Statistical reasoning
Theorem.
Statistical reasoning
Theorem.
Statistical reasoning
,
Proof.
Choose
for
(a)
, ,
, ,
Statistical reasoning
Theorem.
Statistical reasoning
Proof.
for
Statistical reasoning
Denition.
The event supports the event under the general condition (event) ( ), if holds where
Theorem.
Statistical reasoning
Proof.
, ,
for
Within :
Player one chooses on of the following wheels, then player two will choose a wheel. Then both turn their wheel. The one with the higher number is the winner.
0 .5 2 1 0 5 4 3
0 .2 5
Causality
"A statistical survey has shown that students receiving a grant perform better in their exams."
Are the grants the reason that the students perform better, since they do not have to earn money for their studies and can spent more time for learning? Or do only students with better results in school or early university years receive a grant?
Causality
Two balls are drawn without replacement from a bag with two white and two black balls. What is the probability that the second ball will be white, when the rst ball that was drawn is white?
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.14/106
Causality
The rst ball was drawn, but hidden from the observer. What is the probability that the rst ball was white, when the second ball that was drawn is white? Wrong answer: 0.5, since the second ball cannot have any inuence on the rst ball.
Causality
Correct answer:
1 /3 1 /2 1 /2 B W 2 /3 B W B 2 /3 1 /3 W
Data Mining p.16/106
Causality
Another reason, why the answer 0.5 is wrong: Assume that there are two black balls and one white ball in the bag in the beginning. What is the probability that the rst ball is white? Answer: What is the probability that the rst ball was white, when the second ball is white? Answer: 0
Hidden variables
The University of Fantasia provides grants for students based on an individual selection scheme. No. of applications No. of grants 3000 373 3000 1304 6000 1677
Further investigations have shown that the female students applying for grants have more or less the same school marks as the male applicants. Does this statistic prove that the selection scheme favours male students?
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.18/106
Hidden variables
Looking at the distributions over the subjects, it turns out that natural sciences and engineering students are favoured compared to social sciences students. No. of applications Natural Sci. Engineering Social Sci. female 400 300 2300 male 1400 1200 400 No. of grants Natural Sci. Engineering Social Sci. 200 150 23 700 600 4
Data Mining p.19/106
female male
Simpsons paradox
A New Old
B New
250
1050
1050
250
No effect Cured
70 180 (72%)
180 70 (28%)
Simpsons paradox
Old
New
No effect Cured
Simpsons paradox
Europe We Competitor
Asia We
250
1050
1050
250
2005 2006
70 180 (+157%)
180 70 (61%)
Simpsons paradox
We
1300
2005 2006
slength 5.1 3.5 ... ... 5.0 3.3 7.0 3.2 ... ... 5.1 2.5 5.7 2.8 ... ... 5.9 3.0
The header (rst line) species names for the attributes. The following lines contain the data with the values for the dened attributes separated by blanks or tabs. Therefore, each of the lines with the data must contain 5 values.
Statistics tool R
R uses a type-free command language. Assignments are written in the form
x <- y y is assigned to x.
The object y must be dened (generated), before it can be assigned to x. Declaration of x is not required.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.28/106
R: Reading a le
iris<-read.table(file.choose(),header=T)
pw <- iris$pwidth
assigns the column named pwidth of the data set contained in the object iris to the object pw. The command
print(pw)
Empirical mean
Note the difference between the mean of a random variable and the empirical mean of a sample.
(Empirical) median
if is odd if is even
(Empirical) median
m e d ia n n o d d n e v e n m e d ia n
The mean and median can also be applied to data objects consisting of more than one (numerical) column, yielding a vector of mean/median values.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.35/106
Empirical variance
The sample variance or empirical variance is dened by
Denition.
standard deviation.
R: Empirical variance
is called span.
Interquartile range
The empirical variance or the deviation as well as the span are measures of dispersion. The span is extremely sensitive to outliers, but also the emperical variance is sensitive to outliers. Therefore, the interquartile range
is often used as a measure of dispersion. Interquartile range in R (for the data in pw): IQR(pw)
Visualisation
Bar charts
A bar chart shows the relative or absolute frequencies for the values of an attribute with a nite domain. In R, the package lattice is required.
Installation of R-packages
Once the computer is connected to the Internet, packages (e.g. lattice) can be downloaded and installed by the command
install.packages()
After choosing the mirror site for download, available packages are shown in alphabetical order. (Packages need to be downloaded only once, but they might need to be installed again.)
Bar charts
example:
cl <- iris$species barchart(cl)
Irisvirginica
Irisversicolor
Irissetosa
10
20
30
40
50
Freq
Histograms
A histogram shows the absolute number of data or the relative frequency of data in different classes. For numerical samples, bins (intervals) representing the classes must be dened. In most cases, intervals of the same length are chosen. The area of each rectangle is proportional to the number of data in the corresponding range.
R: Histograms
The function
hist(pw,breaks=6,prob=T,main="petal width")
generates a histogram for the (numerical) data in pw, partitioning the domain of pw using breaks=6 (5 intervals of the same length), showing relative frequencies (prob=T) with the caption "petal width". Using the command postscript("outputfile.eps") the generated graphics will not be shown. It is stored in the PostScript le "outputle.eps".
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.45/106
R: Histograms
petal width
Density
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.0
0.5
1.0 pw
1.5
2.0
2.5
Empirical cdf
The empirical cumulative distribution function is
Denition.
given by
plot.ecdf(pw,main="petal width")
generates the empirical cdf for the data set pw, including the title "petal width" in the graphics.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.47/106
Empirical cdf
petal width
1.0 Fn(x) 0.0 0.2 0.4 0.6 0.8
0.0
0.5
1.0 x
1.5
2.0
2.5
Boxplots
Boxplots
m e d ia n
Boxplots
in te rq u a rtile ra n g e
~ 5 0 % o f th e d a ta
Boxplots
~ 1 .5 in te r q a r tile r a n g e
{ {
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.52/106
Boxplots
o u tlie r
Boxplots
or box and whiskers plots summarise important characteristics of a sample.
Boxplots
Usually, boxplots for different samples are compared. (Here: The sepal length for setosa, versicolor and virginica.)
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.54/106
Boxplots
1. Determine the median. Draw a thick line at the position of the median. 2. Determine the 25%- and the 75%-quartiles and for the sample. Draw a box limited by these two quartiles. The other dimension of the box can be chosen arbitrarily. 3. iqr is the interquartile range. The inner fence is dened by the two values iqr and iqr.
Boxplots
4. Find the smallest data point greater than and the largest data point smaller than . Add "whiskers" to the box extending to these two data points. 5. Data points lying outside the box and the whsikers are called outliers. Enter these data points in the diagram, for instance by circles. 6. Sometimes, extreme outliers (out of the outer fence dened by iqr and iqr) are drawn in a different way than mild outliers outside the whiskers, but within the inner fence.
Scatterplots with R
plot(iris)
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 7.5
slength
4.0
2.0
3.0
swidth
plength
1.5
2.5
pwidth
0.5
4.5
5.5
6.5
7.5
4.5
5.5
6.5
Scatterplots with R
species <- which(colnames(iris)=="species") super.sym <- trellis.par.get("superpose.symbol") splom(iris[1:4], groups = species,data = iris, panel = panel.superpose)
Scatterplots of the numerical attributes using different symbols for the classes.
Scatterplots with R
2.5 2.0 1.5
pwidth
7 6 5 4
plength
4 3 2
swidth
3.0 2.5
2.0
slength
6 5
Scatterplots with R
splom(iris[1:4], groups = species, data = iris,panel = panel.superpose, key = list(title = "Three Varieties of Iris", columns = 3, points = list(pch = super.sym$pch[1:3], col = super.sym$col[1:3]), text = list(c("Setosa", "Versicolor", "Virginica"))))
Note: This needs the additional R package lattice which has to be installed and loaded rst.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.60/106
Scatterplots with R
3D scatterplots with R
scatterplot3d(iris2$pw,iris2$pl,iris2$sw, pch=iris2$species)
3.5
4.0
4.5
iris2$sw
iris2$pw
iris2$pl
Enriched 3D scatterplots
PCA
is a technique for dimension reduction, which can also be used for visualisation.
Principal component analysis (PCA)
PCA aims at nding a linear mapping of the data to a lower-dimensional space that maintains as much of the original variance as possible of the original data without stretching during the projection.
PCA
A projection of the two-dimensional data to the (one-dimensional) line (principal component) leads to a dimension reduction with little loss of variance.
PCA
Projection of -dimensional data to -dimensional space where . First, the data are centred in the origin of the coordinate system. Then a -matrix is needed for the projection:
Covariance
The empirical covariance of two attributes dened as and is
In case and
should hold.
are independent,
Covariance
The empirical correlation coefcient of as the normalised covariance: and is dened
The larger
always holds.
and
correlate.
PCA
The projection matrix for PCA is given by
where are the normalised eigenvectors of the covariance matrix of the data
.
Data Mining p.69/106
PCA
The sum of the variances of the projected data is the sum of these eigenvalues:
When PCA is used for dimension reduction, it is important to know how much of the original variance is covered by the projection to dimensions.
Data Mining p.70/106
PCA with R
species <- which(colnames(iris)=="species") iris_pca <- prcomp(iris[,-species], center=T,scale=T)
The nominal attribute species must be omitted for PCA. This is carried out here as well. If all attributes of a data set dset are numerical, then
dset_pca <- prcomp(dset,center=T,scale=T)
is sufcient.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.71/106
PCA with R
center: a logical value indicating whether the variables should be shifted to be zero centred. Alternately, a vector of length equal the number of columns of the data set can be supplied. The value is passed to scale. scale: a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. The default is FALSE, but in general scaling is advisable. Alternatively, a vector of length equal the number of columns of the data set can be supplied.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.72/106
PCA with R
> print(iris_pca) Standard deviations: [1] 1.7061120 0.9598025 0.3838662 0.1435538 Rotation: PC1 slength 0.5223716 swidth -0.2633549 plength 0.5812540 pwidth 0.5656110 PC2 PC3 PC4 -0.37231836 0.7210168 0.2619956 -0.92555649 -0.2420329 -0.1241348 -0.02109478 -0.1408923 -0.8011543 -0.06541577 -0.6338014 0.5235463
PCA with R
> summary(iris_pca) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.706 0.960 0.3839 0.14355 Proportion of Variance 0.728 0.230 0.0368 0.00515 Cumulative Proportion 0.728 0.958 0.9949 1.00000 > iris_pca$center slength swidth plength pwidth 5.843333 3.054000 3.758667 1.198667
PCA with R
plot(predict(iris_pca)[])
PC2
2 3
0 PC1
Digression: Regression
Given: data (for example: two real-valued attributes) Choose a model. (for example:
Dene an error or goodness of t function (for example: the mean squared error ) Find the optimum of the tting function (for example: standard least squares regression)
Digression: Regression
0.00
0.05
0.10
0.15
0.20
0.0
0.5
1.0
1.5 x
2.0
2.5
3.0
Digression: Regression
140 120 100 80 60 40 20 0 2 1.5 1 0.5 -2 0 -1.5 -1 -0.5 -0.5 0 -1 0.5 1 -1.5 1.5 2-2
Multidimensional scaling
tries to map the high-dimensional data to low-(2- or 3-)dimensional space with the aim to preserve the distances between the data as much as possible. The denition of a suitable error measure is required.
MDS
Typical error mesures:
Data Mining p.80/106
MDS
The error is often called stress. Optimisation w.r.t. is called Sammon method. For none of these error functions an analytical solution for the minimum is known. Therefore a gradient ascent is applied for minimising the error function.
To minimise or maximise a function multiple variables, gradient techniques are one possibility.
University of Applied Sciences Braunschweig/Wolfenbuettel
in
Data Mining p.81/106
Gradient method
The gradient (the vector give by the partial derivatives ) points in the direction of the steepest w.r.t. ascend of . In order to maximise , one starts in an arbitrary point, computes the gradient, goes into the direction of the gradient, computes the gradient in the new point and continuous until convergence. In order to minimise a function, one simply goes into the opposite direction of the gradient. A gradient method can only nd local extrema at best!
MDS
Determining the gradient for error function :
if otherwise
MDS algorithm
1. Given: Data set projection dimension , step width , threshold value , maximum number of steps . 2. Compute
). ).
(
old
).
( ). and the
> d.iris <- dist(iris[,-species]) > iris.sammon <- sammon(d.iris,k=2) Initial stress : 0.00691 stress after 10 iters: 0.00492,magic=0.092 stress after 20 iters: 0.00447,magic=0.213 stress after 30 iters: 0.00408,magic=0.030 stress after 40 iters: 0.00405,magic=0.500 > plot(iris.sammon)
Note that duplicate tuples must be removed in advance. Otherwise zero distances lead to division by zero.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.85/106
MDS
1.5 iris.sammon$points[,2] 1.0 0.5 0.0 0.5 1.0
0 iris.sammon$points[,1]
Scatterplots
0.0
0.2
0.4
0.6
0.8
0.6
0.8
0.0
0.2
0.4
X2
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
X3
0.6
0.8
0.0
0.2
0.4
X1
0.6
0.8
PCA
PC2
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.4
0.2
0.0 PC1
0.2
0.4
MDS
sampo[,2]
0.5 0.8
0.0
0.5
0.6
0.4
0.2 sampo[,1]
0.0
0.2
0.4
0.6
Parallel coordinates
parallel(iris)
species
pwidth
plength
swidth
Such a data table is called data set. The columns are called attribute, variablen or features. A single line refers to an individual, instance, case, object, datum, record or entity.
Types of attributes
Examples:
Sex (male/female) (binary attribute) Major subjects (statistics, databases, Nationality (German/English/French/ ) )
Types of attributes
Ordinale attributes
Types of attributes
Interval variables
are numbers measured in the same unit. However, there is no absolute denition of zero. Differences can be calculated, but sums or products are meaningless. Examples:
Types of attributes
Ratio attributes
have a well-dened location of zero, in contrast to interval attributes. Quotients and sums are meaningful. Example:
Types of attributes
can be interval or ratio attributes.
Integer attributes
Examples:
Types of attributes
Continuous attributes
Examples:
Temperature (interval attribute) Distance (Ratio attribute) There are special attributes like angles that behave only locally as an ordinal or interval attribute.
Missing Values
For some instances values of single attributes might be missing. Causes for missing values:
broken sensors refusal to answer a question irrelevante attribut for the corresponding object (pregnant (yes/no) for men)
When there are only few missing values: Remove objects with missing values. Imputation of missing values (mean value, median, most frequent value, or estimation based on other attributes) Treatment of missing values before or during the application of data mining algorithms (depending on the problem and the algorithm)
obs .
A missing value is
if
obs
Let be the (multivariate) (random) variable denoting the other attributes apart from .
The probability that a value for is missing does neither depend on the true value of nor on other variables.
obs
obs
Example: The maintenance staff sometimes forgets to change the batteries of a sensor, so that the sensor sometimes does not provide any measurements. MCAR is also called Observed At Random (OAR).
University of Applied Sciences Braunschweig/Wolfenbuettel
The probability that a value for is missing does not depend on the true value of .
obs
obs
Example: The maintenance staff does not change the batteries of a sensor when it is raining, so that the sensor does not always provide measurements when it is raining.
Nonignorable:
The probability that a value for missing depends on the true value of .
is
Example: A sensor for the temperature will not work when there is frost. In the cases of MCAR and MAR, the missing values can be estimated at least in principle, when the data set is large enough based on the values of the other attributes. (The cause for the missing values is ignorable.) In the extreme case of the sensor for the temperature, it is impossible to make provide any statement concerning temperatures below C.
Missing values
In the case of MCAR, it can be assumed that the missing values follow the same distribution as the observed values of . In the case of MAR, the missing values might not follow the distribution of . But by taking the other attributes into account, it is possible to derive reasonable imputations for the missing values. In the case of nonignorable missing values it is impossible to provide sensible estimations for the missing values.
A classication problem
A classication problem
A classication problem
A classication problem
Darts results
Darts results
1D darts distributions
cost matrix
for misclassications
Data Mining p.8/12
General assumption:
posting costs for resending the item, loss of reputation, payment for compensation, cost matrix for misclassication rate predicted class
true class
. . .
1 1
1 ... 1
1 0
Data Mining p.10/12
Expected loss
and predicting
loss
predicted
argmin loss
Expected loss
rloss
Return the class assigned to the node. Test the attribute associated with the node. Follow the branch labeled with the outcome of the test. Apply the algorithm recursively.
Build the decision tree from top to bottom (from the root to the leaves). Compute an evaluation measure for all attributes. Select the attribute with the best evaluation. Divide the example cases according to the values of the test attribute. Apply the procedure recursively to the subsets. Terminate the recursion if all cases belong to the same class no more test attributes are available
Assignment of drug
(without patient attributes) always drug A or always drug B: 50% correct (in 6 of 12 cases)
No
Sex male male male male male male female female female female female female
Drug A A A B B B A A A B B B
Division male/female.
w.r.t.
1 6 12 4 8 9 3
Assignment of drug
(in 3 of (in 3 of
cases) cases)
5 10 2 7 11
(in 6 of 12 cases)
Sort according to age. Find best age split. here: ca. 40 years
11 6 10 4 3 8 5
Assignment of drug
(in 4 of (in 4 of
cases) cases)
12 9 2
total:
67% correct
(in 8 of 12 cases)
No 3 Blood pr. high high high normal normal normal normal normal normal low low low Drug A A A A A A B B B B B B
Division high/normal/low.
w.r.t.
5 12 1 6
Assignment of drug
10
2 7 9 4 8 11
No 3 Blood pr. high high high normal normal normal normal normal normal low low low male male male female female female Sex Drug A A A A A B B B A B B B
5 12 1 6 9 2 7 10 4
Assignment of drug
(2 of 3) (2 of 3) (4 of 6)
8 11
No 3 Blood pr. high high high normal normal normal normal normal normal low low low 20 29 30 52 61 73 Age Drug A A A A A A B B B B B B
Only patients with normal blood pressure. Sort according to age. Find best age split. here: ca. 40 years
5 12 1 6 10 7 9 2 11 4
Assignment of drug
(3 of 3) (3 of 3) (6 of 6)
total:
100% correct
total number of case or object descriptions i.e. absolute frequency of the class absolute frequency of the attribute value absolute frequency of the combination of the class and value . . relative frequency of the class ,
, ,
function begin
:= WORTHLESS; for all untested attributes do compute frequencies , , for and compute value of an evaluation measure using , , ; if _ then best v := ; best A := ; end;
best v end
_ WORTHLESS then create leaf node ; assign majority class of to ; else create test node ; assign test on attribute best A to ; for all _ do .child[ ] := grow_tree( end; return ; end;
if University of Applied Sciences Braunschweig/Wolfenbuettel
); end;
Evaluation Measures
Evaluation measure used in the above example: rate of correctly classied example cases. Advantage: simple to compute, easy to understand. Disadvantage: works well only for two classes. If there are more than two classes, the rate of misclassied example cases neglects a signicant amount of the available information. Only the majority classthat is, the class occurring most often in (a subset of) the example casesis really considered. The distribution of the other classes has no inuence. However, a good choice here can be important for deeper levels of the decision tree.
Data Mining p.17/79
Evaluation measures
Therefore:
Here:
Information gain
measure
(well-known in statistics).
Entropy of the class distribution ( : class attribute) Expected entropy of the class distribution if the value of the attribute becomes known Expected entropy reduction or information gain
Data Mining p.19/79
Example
ID 1 2 3 4 5 6 7 8 9 10
Height
m s t s t s s m m t
Example
ID 1 2 3 4 5 6 7 8 9 10
Height
Height
m s t s t s s m m t
Example
ID 1 2 3 4 5 6 7 8 9 10
Height
Height
m s t s t s s m m t
Example
ID 1 2 3 4 5 6 7 8 9 10
Height
Height
m s t s t s s m m t
Example
Height
Example
ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m
Weight
Example
ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m
Weight
Example
ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m
Weight
Example
Weight
Example
ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m
long_hair
Example
ID 1 2 3 4 5 6 7 8 9 10 Height Weight Long hair Sex m n n m s l y f t h n m s n y f t n y f s l n f s h n m m n n f m l y f t n n m
long_hair
Example
long_hair
Example
Weight
Example
The remaining data table to be considered in the node : ID 1 4 5 8 10 Height Long hair Sex m n m s y f t y f m n f t n m
Example
ID 1 4 5 8 10
Height
m s t m t
Height
Example
ID 1 4 5 8 10
Height
m s t m t
Height
Example
ID 1 4 5 8 10
Height
m s t m t
Height
Example
ID 1 4 5 8 10
long_hair
Example
ID 1 4 5 8 10
long_hair
Example
Weight
Long hair
Example
For the remaining node, only the attribute Height is left with the remaining data table: ID 1 8 10 Height Sex m m m f t m
Example
Weight
Long hair
Height
A complex tree
long Hair
Height
Weight
Weight
Weight
? ?
Long hair
Data Mining p.42/79
Shannon Entropy:
Suppose there is an oracle, which knows the obtaining alternative, but responds only if the question can be answered with yes or no. A better question scheme than asking for one alternative after the other can easily be found: Divide the set into two subsets of about equal size. Ask for containment in an arbitrarily chosen subset. Apply this scheme recursively questions bounded by . number of
Data Mining p.44/79
Question/Coding Schemes
Shannon entropy:
Linear Traversal
bit/symbol
Equal Size Subsets
2
0.25
0.75 0.59
0.10 0.15
2 2
Question/Coding Schemes
Splitting into subsets of about equal size can lead to a bad arrangement of the alternatives into subsets high expected number of questions. Good question schemes take the probability of the alternatives into account. Shannon-Fano Coding (1948)
Build the question/coding scheme top-down. Sort the alternatives w.r.t. their probabilities. Split the set so that the subsets have about equal probability (splits must respect the probability order of the alternatives).
Data Mining p.46/79
Question/Coding Schemes
Huffman Coding
(1952)
Build the question/coding scheme bottom-up. Start with one element sets. Always combine those two sets that have the smallest probabilities.
Question/Coding Schemes
Shannon entropy:
ShannonFano Coding
bit/symbol
(1948)
Huffman Coding
(1952)
3
0.25 0.10 0.15 0.16 0.41 0.59
0.60
0.19 0.40
0.25
0.35
3 1
3 2
0.10 0.15
3 3
Question/Coding Schemes
It can be shown that Huffman coding is optimal if we have to determine the obtaining alternative in a single instance. (No question/coding scheme has a smaller expected number of questions.) Only if the obtaining alternative has to be determined in a sequence of (independent) situations, this scheme can be improved. Idea: Process the sequence not instance by instance, but combine two, three or more consecutive instances and ask directly for the obtaining combination of alternatives.
Data Mining p.49/79
Question/Coding Schemes
Although this enlarges the question/coding scheme, the expected number of questions per identication is reduced (because each interrogation identies the obtaining alternative for several situations). However, the expected number of questions per identication cannot be made arbitrarily small. Shannon showed that there is a lower bound, namely the Shannon entropy.
Shannon entropy:
bit/symbol
If the probability distribution allows for a perfect Huffman code (code efciency 1), the Shannon entropy can easily be interpreted as follows:
of
needed
Information gain is biased towards many-valued attributes. Normalization removes / reduces this bias. (Quinlan 1986 / 1993)
1991)
(Lpez de Mntaras
or
attributes,
i.e., of two attributes having about the same information content it tends to select the one having more values.
The reasons are quantization effects caused by the nite number of example cases (due to which only a nite number of different probabilities can result in estimations) in connection with the following theorem:
Let , , and be three attributes with nite domains and let their joint probability distribution be strictly positive, i.e., . Then
Theorem:
with equality holding only if the attributes conditionally independent given , i.e., if .
and
are
-measure
Compares the actual joint distribution with a hypothetical independent distribution. Uses absolute comparison. Can be interpreted as a difference measure.
Data Mining p.55/79
Contingency tables
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
marginal of
. . . . . . . . . . . .
marginal of
The random variable can take the values the random variable the values .
Contingency tables
and
are the marginal (absolute) frequencies. If and are independent, then the expected absolute frequencies are
for all
and all
independence test
1000 people were asked which political party they voted for in order to nd out whether the choice of the party and the sex of the voter are independent. pol. party sex female male sum SPD 200 170 370 CDU/CSU 200 200 400 Grne 45 35 80 FDP 25 35 70 PDS 20 30 50 Others 22 5 27 No answer 8 5 13 sum 520 480 1000
Example.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.58/79
independence test
Expected frequencies: pol. party sex SPD CDU/CSU Grne FDP PDS O/NA female male 192.4 177.6 208.0 192.0 41.6 38.4 31.2 28.4 26.0 24.0 20.8 19.2
independence test
pol. party sex SPD CDU/CSU Grne FDP PDS O/NA sum For instance:
CDU/CSU, female
Data Mining p.60/79
Form equally sized or equally populated intervals. Build a decision tree using only the numeric attribute. Flatten the tree to obtain a multi-interval discretization.
Sort the example cases according to the attributes values. Construct a binary symbolic attribute for every possible split (values: threshold and threshold). Compute the evaluation measure for these binary attributes. Possible improvements: Add a penalty depending on the number of splits.
cut points
are needed.
These cutpoints are chosen in such a way that the entropy induced by this partition is minimised. domian of the attribute
For binary splits (only one cut point) all boundary points are considered and the one with the smallest entropy is chosen. For multiple splits a recursive procedure is applied.
Induction
Weight the evaluation measure with the fraction of cases with known values. Idea: The attribute provides information only if it is known. Try to nd a surrogate test attribute with similar properties (CART, Breiman et al. 1984) Assign the case to all branches, weighted in each branch with the relative frequency of the corresponding attribute value (C4.5, Quinlan 1993).
Data Mining p.67/79
Classication
Use the surrogate test attribute found during induction. Follow all branches of the test attribute, weighted with their relative number of cases, aggregate the class distributions of all leaves reached, and assign the majority class of the aggregated class distribution.
to simplify the tree (improve interpretability), to avoid overtting (improve generalization). Replace bad branches (subtrees) by leaves. Replace a subtree by its largest branch if it is better. Reduced error pruning Pessimistic pruning Condence level pruning Minimum description length pruning
Data Mining p.69/79
Basic ideas:
Common approaches:
Classify a set of new example cases with the decision tree. (These cases must not have been used for the induction!) Determine the number of errors for all leaves. The number of errors of a subtree is the sum of the errors of all of its leaves. Determine the number of errors for leaves that replace subtrees. If such a leaf leads to the same or fewer errors than the subtree, replace the subtree by the leaf. If a subtree has been replaced, recompute the number of errors of the subtrees it is part of.
Data Mining p.70/79
Advantage:
Very good pruning, effective avoidance of overtting. Additional example cases needed. Number of cases in a leaf has no inuence.
Disadvantage:
Pessimistic pruning
Classify a set of example cases with the decision tree. (These cases may or may not have been used for the induction.) Determine the number of errors for all leaves and increase this number by a xed, user-specied amount . The number of errors of a subtree is the sum of the errors of all of its leaves. Determine the number of errors for leaves that replace subtrees (also increased by ). If such a leaf leads to the same or fewer errors than the subtree, replace the subtree by the leaf and recompute subtree errors.
Data Mining p.72/79
Pessimistic pruning
Advantage: Disadvantage:
Like pessimistic pruning, but the number of errors is computed as follows: See classication in a leaf as a Bernoulli experiment (error/no error). Estimate an interval for the error probability based on a user-specied condence level . (use approximation of the binomial distribution by a normal distribution) Increase error number to the upper level of the condence interval times the number of cases assigned to the leaf. Formal problem: Classication is not a random experiment.
Data Mining p.74/79
Advantage:
Disadvantage:
, and
Left: 7 instead of 11 nodes, 4 instead of 2 misclassications. Right: 5 instead of 11 nodes, 6 instead of 2 misclassications. The right tree is minimal for the three classes.
Data Mining p.78/79
Predictive tasks:
The decision tree (or more generally, the classier) is constructed in order to apply it to new unclassied data.
Decriptive tasks:
The purpose of the tree construction is to understand, how classication has been carried out so far.
Bayes theroem
Proof:
Bayes theroem
Interpretation: The probability that a hypothesis is true given event has occured, can be derived from the probability
of the hypothesis itself, of the event and the conditional probability of the event given the hypothesis.
Bayes classiers
Principle of Bayes classiers: The value of the nominal attribute should be predicted based on the values of the attributes , i.e. the attribute vector . If is one of the possible values of attribute and the other attribute have taken the values , then Bayes theorem yields the probaility for given :
Data Mining p.3/48
Bayes classiers
Compute this probability for all possible values (classes) of the nominal attribute and choose the class with the highest probability. (A cost matrix can also be incorporated.) Since the denominator is independent of , it does not have any inuence on the decision for the class. Therefore, usually only the likelihoods
are considered.
Bayes classiers
The probability can be estimated easily based on a given data:
no. of data from class with values no. of data from class
University of Applied Sciences Braunschweig/Wolfenbuettel
Bayes classiers
For nominal attributes , each having three possible values, we would need data objects to have at least one example per combination. Therefore, the computation is carried out under the (nave, unrealistic) asumption that the attributes are independent given the class, i.e.
Bayes classiers
are
Data Mining p.8/48
This Bayes classiers is called wird als nave because of the (conditional) independence assumption for the attributes .
Although this assumption is unrealistic in most cases, the classier often yields good results, when not too many attributes are correlated.
Example
How does a nave Bayes classier classify the object ? We need to calculate
Sex
Height
Weight long_hair
Example
Sex
Height Weight
Long_hair
Example
Height
ID 1 2 3 4 5 6 7 8 9 10
Sex
Example
Height
ID 1 2 3 4 5 6 7 8 9 10
Sex
Example
Height
ID 1 2 3 4 5 6 7 8 9 10
Sex
Example
Weight
ID 1 2 3 4 5 6 7 8 9 10
Sex
Example
Long_hair
ID 1 2 3 4 5 6 7 8 9 10
Sex
Example
Sex
ID 1 2 3 4 5 6 7 8 9 10
Example
Sex
Height
Weight Long_hair
Example
Height
ID 1 2 3 4 5 6 7 8 9 10
Sex
Example
Height
ID 1 2 3 4 5 6 7 8 9 10
Sex
Example
Height
ID 1 2 3 4 5 6 7 8 9 10
Sex
Example
Weight
ID 1 2 3 4 5 6 7 8 9 10
Sex
Example
Long_hair
ID 1 2 3 4 5 6 7 8 9 10
Sex
Example
Sex
ID 1 2 3 4 5 6 7 8 9 10
Example
Sex
Height Weight
Long_hair
Sex
Height Weight
Long_hair
Example
The object was classied by the urde durch den naven Bayes-Klassikator klassiziert. The data set does not contain any object with this combination of values. A full Bayes classier would not be able to classify this object.
More examples
Input
Class ? m m
The object is classied by the nave Bayes classier as although the data sets contains two such objects, one from class and one from class . The main impact comes from the attribut Long hair = , having probability 1 in class , but a low probability in class .
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.27/48
Laplace correction
If a single likelihood is zero, then the overall likelihood is zero automtaically, even the when the other likelihoods are high. Therefore: Laplace correction:
Laplace correction
Laplace correction for Height
Sex
with
Height s m t
The counting of the frequencies should be carried out once when the nave Bayes classier is constructed. The probablity dsitribution for the single attributes should be stored in a table. When the nave Bayes classier is applied to new data, only corresponding values in the table need to be multiplied.
During learning:
The missing values are simply not counted for the frequencies of the corresponding attribute.
During classication:
Only the probabilities (likelihoods) of those attributes are multiplied for which a value is available.
Numerical attributes
Estimation of probabilities:
Numerical attributes:
Numerical attributes
Example
100 data points, 2 classes Small squares: mean values Inner ellipses: one standard deviation Outer ellipses: two standard deviations Classes overlap: classication is not perfect
Nave Bayes classier
Example
20 data points, 2 classes Small squares: mean values Inner ellipses: one standard deviation Outer ellipses: two standard deviations Attributes are not conditionally independent given the class
Conditional independence
(Joint probability is the product of the individual probabilities.) Comparison to the product rule
Conditional independence
and
Group 1
Group 2
Group 1
Group 2
150 data points, 3 classes Iris setosa (red) Iris versicolor (green) Iris virginica (blue)
Shown: 2 out of 4 attributes sepal length sepal width petal length (horizontal) petal width (vertical) 6 misclassications on the training data (with all 4 attributes)
Nave Bayes classier
Restricted to metric/numeric attributes (only the class is nominal/symbolic). Each class can be described by a multivariate normal distribution.
Simplifying Assumption:
Intuitively: Each class has a bell-shaped probability density. Naive Bayes classiers: Covariance matrices are diagonal matrices. (Details about this relation are given below.)
Nave Bayes classiers for numerical data are equivalent to full Bayes classiers with diagonal covariance matrices.
150 data points, 3 classes Iris setosa (red) Iris versicolor (green) Iris virginica (blue)
Shown: 2 out of 4 attributes sepal length sepal width petal length (horizontal) petal width (vertical) 2 misclassications on the training data (with all 4 attributes)
Full Bayes classier
Use the given data set as set of example cases ( ). Dene a suitable distance measure To classify a new objekct distances
on
, compute the
Find the
among Assign to the most frequent class the closest example cases.
Distance measures
For a nominal attribute two objects can only have the same value (distance=0) or different values (usually distance=1). For ordinal attributes the distance should increase depending on the distances of the ranks in the corresponding lienar order. For numerical attributes the absolute or the squared difference between values is very common.
Distance measures
Unless the importance or weight of each attribute for the distance is known, the distances for the single attributes should yield similar values. For numerical attributes, the distance depends on the measurement unit.
Distance measures
Example:
Measuring the weight in kg and the height in cm leads to distances (differences) approximately in the same range. Measuring the weight in g and the height in m leads to almost neglectable differences for the height and to very large differences for the weight.
Distance measures
Unless attributes are known to have different importance, all attributes contribute roughly in the same way to the overall distance. Normalisation techniques for numerical attributes
interquartile_range
median
Distance measures
distance normalisation for arbitrary attributes given distance measure
norm
and data
Ignore the corresponding attributes for the computation of the distances. very fast slow for large sample databases
Classication: Interpretability:
Justication of the classication based on similar known cases. Adapt the distance and/or select a suitable subset of cases from the sample database.
SVM
training data set: where indicating to which of the two classes belongs. separating hyperplane given in the form
SVM
Choose and such that the separating hyperplane yields a maximum margin, i.e. such that the distance to the closest points from the each of the two classes is as large as possible. For this, introduce the constraints
if if
and are then the margin hyperplanes parallel to the separating hyperplanes through the closest points of the two classes.
Distance between the margin hyperplanes:
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.11/26
SVM
SVM
Optimization problem: Choose and to minimise under the constraints
for all
SVM
Dual form of the quadratic programming problem: Maximise
.
Data Mining p.14/26
SVM
only for those that lie on one of the two margin hyperplanes. The separating hyperplane depends only on the support
vectors
SVM
Relaxation of the restriction to linearly separable classication problems: Allow for misclassied objects between the two margin hyperplanes. Introduce slack variables measuring the degree of misclassication of object :
Penalise nonzero .
SVM
For example, linear penalty function: Minimise
subject to
for all
).
and
Construct an SVM for each class against all other classes. Assign an object to the class with the highest output (largest distance to the (virtual) separating hyperplane). Construct an SVM for each pair of classes. Assign an object to the class which has won the most competitions against other classes.
SVM
Often a very good classier. Strongly dependent on the choice of a suitable kernel. Training has very high computational costs, especially for large data sets and for multiclass problems.
Logistic regression
: class attribute, -dimensional random vector , . Given: A set of data points each of which belongs to one of the two classes and . Desired: A simple description of the function . Approach: Describe by a logistic function:
Kernel estimation
Idea:
Dene an inuence function (kernel), which describes how strongly a data point inuences the probability estimate for neighboring points.
Gaussian kernel
Kernel estimation
and
Regression
Supervised learning:
Regression line
given: A data set for two continuous attributes and . It is assumed that there is an approximate linear dependency between and : Find a regression line (i.e. determine the parameters and ) such the line ts the data as good as possible. What is a good t?
Regression
-distance
Euclidean distance
Usually, the mean square error in -direction is chosen as error measure (to be minimized). It is equivalent to minimize the sum of squared errors in -direction.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.3/95
Regression
mean absolute distance in -direction mean Euclidean distance maximum absolute distance in -direction (or equivalently: the maximum squared distance in -direction) maximum Euclidean distance
Regression line
Given data function is
(If at least two different -values exist,) and are uniquely determined by the the necessary conditions for a minimum.
and
where is normally distributed with mean 0 and (unknown) variance . ( independent of , i.e. same dispersion of for all .)
To simplify the computation of derivatives for nding the maximum, we compute the logarithm:
From this expression it becomes clear that (provided is independent of , , and ) maximizing the likelihood function is equivalent to minimizing
Interpreting the method of least squares as a maximum likelihood estimator works also for the generalisations to polynomials and multilinear functions discussed later on.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.9/95
Multilinear regression
Minimization of the error function based on partial derivatives w.r.t. the parameters does not work in the other examples of error functions, since
the absolute value and the maximum are not everywhere differentiable and the distance in the case of the Euclidean distance leads to system of nonlinear equation for which no analytical solution is known.
Nonlinear regression
For nonlinear dependencies (in the parameters) taking partial derivatives leads to nonlinear equations: (radioactive decay, (unlimited) Example: growth of bacteria, )
Linear regression
The least squared approach leads to an analytical solution, when the regression function is linear in the coefcients (parameters):
Examples
linear dependency for different variables (for instance, electricity consumption of a suburban area based on the number of ats with one ( ), two ( ), three ( ) and four or more persons ( ) living in them):
Linear regression
linear regression function
Linear regression
Linear regression
implies
for
Linear regression
. . .
. . .
. . .
Data Mining p.17/95
Linear regression
normal equation:
. . .
Linear regression
. . .
linear
quadratic
Data Mining p.21/95
or cubic approach.
Generalisation
Considering a data set as a collection of examples, describing the dependency between the predictor variables and the dependent variable, the regression function should learn this dependency from the data and to generalise it to new data to make correct predictions. To achieve this, the regression function must be universal (exible) enough to be able to learn the dependency. This does not mean that a more complex regression function with more parameters leads to better results than a simple one.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.23/95
Overtting
Overtting
The regression function learns a description of the data, not of the structure inherent in the data. Predicition can be worse than for a simpler regression function.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.25/95
Robust regression
linear model:
computed model:
objective function:
M-estimators
least squares method: For other choices of ,
, ,
, .
, if
M-estimators
Dene
Computing derivatives of
and
leads to
Solution of this system of equations is the same as for the weighted least squares problem
M-estimators
Problem:
and
Robust regression
if if
if f
40 1 35
30
0.8
15
10
M-estimators: Huber
7 0.8 6
M-estimators: Tukey
3.5
3 0.8 2.5
2 rho w 1.5
0.6
0.4
1 0.2 0.5
0 -6 -4 -2 0 e 2 4 6
0 -6 -4 -2 0 e
Data Mining p.37/95
2 4
0 X
Huber weight
0.2
0.4
0.6
0.8
1.0
3 Index
> library(MASS) > reg.huber <- rlm(y x1 + x2 + ..., data=...) > summary(reg.huber)
Plotting the weights and enable clicking interesting points (here: size of the data set = 100):
> plot(reg.huber$w, ylab="Huber Weight") > identify(1:100, reg.huber$w)
Tukeys approach requires the package lqs. Otherwise, analogous to Hubers approach:
... > reg.bisq <- rlm(y x1 + x2 + ..., data=..., method=MM) ...
If most of the predictor variables are numerical and the few nominal attributes have small domains, a regression function can be constructed for each possible combination of the values of the nominal attributes, given that the data set is sufciently large and covers all combinations.
Example:
Task: Predict the weight based on the other attributes. Possible solution: Construct four separate regression functions for (F,Yes),(F,No),(M,Yes),(M,No) using only age and height as predictor variables.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.43/95
For nominal attributes with more than two values, a 0/1 or numerical attribute should be introduced for each possible value of the nominal attribute. Do not encode nominal attributes with more than two values in one numerical attribute, unless the nominal attribute is actually ordinal.
Regression trees
Like decision trees, but target variable is not a class, but a numeric quantity.
Regression trees
More complex regression trees: Predict linear functions in leaves. (red line)
The variance/standard deviation is compared to the variance/standard deviation in the branches. The attribute that yields the highest reduction is selected.
Data Mining p.47/95
A regression tree for the Iris data (petal width) (induced with reduction of sum of squared errors)
Neural networks
McCulloch-Pitts model of a neuron (1943)
A neuron is a binary switch, being either active or inactive. Each neuron has a xed threshold value. A neuron receives input signals from excitatory (positive) synapses (connections to other neuron). A neuron receives input signals from inhibitory (negative) synapses (connections to other neuron).
McCulloch-Pitts model
The inputs of a neuron are accumulated (integrated) for a certain time. When the threshold value of the neuron is exceeded, the neuron becomes active and sends signals to its neighbouring neurons via its synapses.
Aim of the McCulloch-Pitts model: neurobiological modelling and simulation to understand very elementary functions of neurons and the brain.
A simplied retina is equipped with receptors (input neurons) that are activated by an optical stimulus. The stimulus is passed on to an output neuron via a weighted connection (synapse). When the threshold of the output neuron is exceeded, the output is 1, otherwise 0.
Aim: Automatic learning of the weights and the threshold to classify objects shown on the retina correctly.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.51/95
A perceptron for identifying the letter F. Two positive and one negative example.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.52/95
o u tp u t n e u ro n in p u t la y e r
University of Applied Sciences Braunschweig/Wolfenbuettel
if otherwise
Initialise the weights and the threshold value randomly. For each data object in the training data set, check whether the perceptron predicts the correct class. Each time, the perceptron predicts the wrong class, adjust the weights and the threshold value. Repeat this until no changes occur.
If the desired output is 1 and the perceptrons output is 0, the threshold is not exceeded, although it should be. Therefore, lower the threshold and adjust the weights depending on the sign and magnitude of the inputs. If the desired output is 0 and the perceptrons output is 0, the threshold is exceeded, although it should not be. Therefore, increase the threshold and adjust the weights depending on the sign and magnitude of the inputs.
Data Mining p.56/95
An input
: Learning rate
new
new
old
old
and and
and and
Example
Learning of the logical operator AND. Training data:
0 0 0 0 0 0 1 0
0 0 0 1
0 0 0 1
0 0 0 1 0 0 0 1
0 0 0
0 0 0
Example
2. Epoch 0 0 01 10 11 3. Epoch 0 0 01 10 11 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1 0
0 0 0 1 0 0
1 1 1 2 2 2 1 2 1 0 0 1 1 0 0 1 0 1 1 0 0 1 2 1
0 1 0
1 1 0
0 1
0 1 1
Example
4. Epoch 0 0 01 10 11 5. Epoch 0 0 01 10 11 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1
0 0
0 0 0 1 0
2 2 1 2 2 2 2 2 1 1 1 2 2 1 1 1 1 1 2 1 1 2 2 2
1 0 0 0 0
0 0 1
0 0
0 1 0 0
Example
6. Epoch 0 0 01 10 11 0 0 0 0 0 0 1 1
0 0 0 0
0 0 0 0
2 2 2 2 1 1 1 1 2 2 2 2
0 0 0 0
If, for a given data set with two classes, there exists a perceptron that can classify all patterns correctly, then the delta rule will adjust the weights and the threshold after a nite number of steps in such way that all patterns are classied correctly.
Linear separability
Consider a perceptron with two input neurons. Let be the output of the perceptron for input . Then
The perceptrons output is 1 if and only if the input pattern is above the line
Linear separability
y = - w1 x + w2
class 0 class 1
Klasse 1 Klasse 0
The parameters determine a line. All input patterns above this line are assigned to class, the input patterns below the line to class 0.
Linear separability
More generally: A perceptron with input neurons can classify all examples from a data set with input variables and two classes correctly, if there exists a hyperplane separating the two classes. Such classication problems are called linearly separable.
Linear separability
Example:
The exclusive OR (XOR) denes a classication task which is not linearly separable.
1 1 1
1 .5
-1 0 .5
0 .5 1
A perceptron with a hidden layer of neurons: The hidden layer carries out a transformation. The output neuron can solve the linearly separable problem in the transformed space.
1
Learning algorithm?
How to adjust the weights (and thresholds) for the neurons the hidden layer?
Problem:
Solution: Multilayer perceptrons with gradient descent Does not work with binary (non-differentiable) threshold function as activation function for the neurons. Must be replaced by a differentiable function.
and steepness
with bias :
net
net
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 1/(1+exp(-0.5*x)) 1/(1+exp(-2*(x-3)))
Multilayer perceptrons
A multilayer perceptron is a neural network with an input layer, one or more hidden layers and an output layer. Connections exist only between neurons from one layer to the next layer. Activation functions for neurons are usually sigmoidal functions.
Multilayer perceptrons
in p u t la y e r h id d e n la y e r h id d e n la y e r o u tp u t la y e r
Error function
(mean) squared error:
out
out: the set of output neurons : target output of output neuron for input : output (activation) of output neuron for input
Learning rule
Activation of a neuron:
net
Learning rule
Consider the input/output pattern . Adjust the weights based on a gradient descent technique, i.e. proportional to the gradient of the gradient of the error function.
: learning rate
net
net
Learning rule
net
net
Learning rule
Therefore:
i.e.
net
net
Learning rule
net
net
To compute
is an output neuron:
i.e.
net
Learning rule
net
net
Backpropagation
Leading to:
net
Result: Recursive equation for updating the weights: Update the weights to the neuron in the output layer rst and then go back layer by layer and update the corresponding weights.
Backpropagation
where
net
net
if out
if
This learning rule is called Backpropagation or generalised delta rule. Backpropagation can also be applied in case there are connections between from layers that are not neighboured as long as the neural network represent a directed acyclic graph. These networks are also called feedforward networks.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.82/95
G
G G
The bias values can be learned in the same manner as the weights based on the following considerations:
Backpropagation
Backpropagation as a gradient descent technique can only nd a local minimum. Training the networks with different random initialisations will usually lead to different results. The learning rate denes the stepwidth of the gradient descent technique.
A very large leads to skipping minima or oscillation. A very small leads to starving, i.e. slow convergence or even convergence before the (local) minimum is reached.
Data Mining p.85/95
Backpropagation
Introduce a momentum term: For the weight change, the previous weight change is taken into account:
is the weight change in the previous step of the gradient descent algorithm. If weight is changed continuously in the same direction, the weight change increases, otherwise it decreases. , . Typical choices for and :
Backpropagation
Usually, weights are not updated after the presentation of each input pattern, but after a whole epoch, i.e. after all patterns have been presented once. There is no general rule, how to choose the number of hidden layers and the size of the hidden layers. Small neural networks might not be exible enough to t the data. Large neural networks tend to overtting. The steepness of the activation function is usually xed and is not adjusted. The multilayer perceptron learns only in those regions where the activation function is not close to zero or one, otherwise the derivative is almost zero.
Data Mining p.87/95
Weight decay is sometimes included, pushing all weights in the direction of zero. Only those weights will survive that are really needed. Approximate the error function locally by a quadratic curve and nd in each step the minimum of the quadratic curve.
Nonlinear PCA
Dimension reduction with multilayer perceptrons:
Input and output are identical, i.e. the neural network should learn the identity function. (Autoassociative network) Introduce a hidden layer with only two neurons (representing the two dimensions for the graphical representation of the dimension reduction), the bottleneck. Train the neural network with the data. After training, input the data into the network and use the outputs of the bottleneck neurons for the graphical representation.
Data Mining p.89/95
have only one hidden layer, usually with Gaussian (unimodal) activation functions.
Similar idea as SVMs. Those points that are approximated well, have little inuence on the regression function. Mathematical approximation method for multivariate functions.
Classication as regression
A two-class classication problem (with classes 0 and 1) can be viewed as regression problem. The regression function will usually not yield exact outputs 0 and 1, but the classication decision can be made by considering 0.5 as a cut-off value. Problem: The objective functions aims at minimizing the function approximation error (for example, the mean squared error), but not misclassications.
Classication as regression
Regression function yields 0.1 for all data from class 0 and 0.9 for all data from class 1. Regression function always yields the exact and correct values 0 and 1, except for 9 data objects where it yields 1 instead of 0 and vice versa.
Classication as regression
From the viewpoint of regression is better than , from the viewpoint of misclassications should be preferred.
Classication as regression
For multiclass problems do not enumerate the classes and learn a single regression function. This leads to interpolation errors. For example: A data object between class 1 and 3 might be classied as class 2 by the regression function. Train a classier (regression function) for each class against all other classes.
Model validation
Does the model t the data at all? What is the most appropriate model? Are the ndings from the model signicant at all? How will the model performance be for new data?
Model tting
Fitting a linear model to a sample
7 plength 1 2 3 4 5 6
0.5
1.0 pwidth
1.5
2.0
2.5
Model tting
Fitting a linear model to a sample with (almost) no correlation
8.0 slength 4.5 5.0 5.5 6.0 6.5 7.0 7.5
2.0
2.5
3.0 swidth
3.5
4.0
Model tting
Fitting a linear model to a sample with nonlinear dependency
100 y 0 20 40 60 80
4 x
10
Model validation
Fitting a model to the data is (almost) always possible. (A regression line can always be found for two-dimensional data.) It does not mean that the model that ts best to the data ts to the data at all. (The regression line does not really reect any meaningful relation between the variables.) Therefore, model validation is needed.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.5/25
will usually t the data better than a simple model. Are complex models always better? No, they tend to overtting.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.6/25
Assume that there is an unknown noisy functional dependency between attributes and
, ,
MDL
Basic idea of the minimum description length principle: The data should be send over a channel or should be stored in a compressed form in a le. Use as few bit as possible. Similar problem as for data compression: Find a coding which is as compact as possible.
MDL
The compressed le consists of the compressed data and a rule how to decompress the data (the model). The more complex the rule for decompression, the more the data can be compressed. But a complex compression scheme will need larger memory itself.
MDL
Goal: Find a compression for which compressed data + description of the (de-)compression rule is minimal. For the regression example:
MDL
Store (or transmit) the data precision of one decimal place. For a model of the form
with
the values and the errors (instead of the values ) must be transmitted.
MDL
for example: For the model Instead of transmit
.
)
und
MDL
Minimum description length principle:
Choose the model for the data for which length of the model description + length of the data (model error) description becomes under the considered model becomes minimal.
MDL
Similar for decision trees:
Complex trees need a longer description, But less classication errors need to be transmitted for complex trees.
The error on the training data set is called resubstitution error. The data should be split into training and test data. Only the data from the training set willbe used to construct the model. Estimation of the error is based on the test data.
Validation data
Sometimes the data set is partitioned into three data set.
Training data
Validation data
Example: Construction of a classier. The data set is partioned into training, validation and test data. Based on the training data a decision tree, a nave Bayes classier and a nearest neighbour classier are constructed. Choose the classier which is best on the validation data. Estimate the prediction error of the chosen classier based on the test data set.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.18/25
Validation data
Example: A neural network (multilayer perceptron) should be trained for a regression or classication problem. The data set is partioned into training, validation and test data. The learning algorithm (for instance, backpropagation) is carried out with the training data. The validation data is used to compute the error during learning without inuencing the training.
Validation data
Training of the neural network is stopped, when the error on the validation data has dropped to a minimum and is increasing again.
E rro r
The prediction error of the neural network is calculated based on the test data.
L e a rn e p o c h s
Crossvalidation
In order to have a more robust estimation of the prediction error, the data set is split into disjoint subset of approimately the same size and the model is trained times. Each time, one of the subsets is left for testing, while the others are used for traing. In this way, a prediction error can be computed times. Usually the mean of th vlues is taken as the prediction error. This is called -fold crossvalidation.
Leave-One-Out
For small data sets with data objects, the Leave-One-Out or jackknife method is applied meaning -fold crossvalidation. Only one data object is left for testing each time-
Bootstrapping
For the Bootstrap method the training data set is a drawn as a sample with replacement from the whole data set.
0.632 bootstrap
draws a sample of data objects for training from a data set of size .
Since sampling is carried out with replacement, some data objects are in the training set with multiple copies, other might not be in the training set.
Bootstrapping
The probability that a data object is not chosen in a single draw is The probability that a data set is not selected in the sample for the training data is
for large .
This means that the training set consists in average of 63.2% of the original data.
Bootstrapping
Testing for bottstrapping is carried with the training and the testing data. The error is a weighted sum of the resubstitution error and the error on the test data set. estimated overall error error on the test data set error on the trainig data set
Bayesian networks
Basic idea of Bayesian networks: Representation of a probability distribution over a high dimensional space. Reasoning: How does the change of a single or some marginal distributions change the other marginal distributions?
Customer questionnaires:
Each question has a nite number of answers like: Are you satised with
Each question represents a nominal attribute. The set of questionnaires lled in by customers induces a probability distribution on the product space of all attributes (questions) with dependencies between the questions. How do these dependencies look like? How can this be used to improve customer satisfaction?
Are you satised with our service technicians? Are you satised with our hotline?
Did the service technician show up on time? Was the service technician able to solve your problem immediately? How long did you have to wait at the hotline? Could the staff at the hotline answer your questions?
Knowing the inuence of such questions on the target questions, suitable actions can be taken to improve the answers to the target questions like
employ more technicians/staff at the hotline provide better training for the technicians/staff at the hotline
Gigabyte
Decomposition
6 5 4
1 2 3
4 5
Decomposition
6 5 4 1
Decomposition
6 5 4 1
and 2 3 4 5
Decomposition
Decomposition
Decomposition
Decomposition
Decomposition
Table size
Decomposition
Table size
Decomposition
Table size
(in general)
iterative application:
iterative application:
Conditional independence
and
if
holds.
Conditional independence
Example:
Conditional independence
Example:
Conditional independence
Example:
Conditional independence
A g e
W o r k e x p e r ie n c e
S a la r y
Conditional independence
A g e
A g e
W o r k e x p e r ie n c e
S a la r y
W o r k e x p e r ie n c e
S a la r y
industry
University of Applied Sciences Braunschweig/Wolfenbuttel
public service
Data Mining p.27/37
A B C
D E
Moral graph
Moral graph
Moral graph
S B L A T T L E B L E
X E
D B E
Hypergraph representation
Estimation of the distributions: corresponding relative frequencies Learning structure: High computational costs.
Structure learning
Dene a measure for the quality of approximation of the data, for instance:
Strategy 1:
network (estimated by the relative frequencies of the data for a given structure)
Structure learning
Leads to combinatorial explosion. Number of possible network structure increases exponentially with the number of attributes. Therefore: Use greedy strategies
Start with network with no edges (all attributes are assumed to be independent) and add edges step by step. Add the edge that increases (maximum likelihood, K2, ) most. Or:
Start with fully connected network and remove the edge that leads to the smallest decrease of University of Applied Sciences Braunschweig/Wolfenbuttel .
Structure learning
Apply (conditional) independence test (for instance ) in order to decide which edges should be included in in the Bayesian network.
Strategy 2:
Compute the strength of the dependencies of pairs of attributes. Attributes with high dependency are usually neighbouring nodes in a Bayesian network. Apply heuristic search strategies (like genetic algorithms or tabu search).
Strategy 3:
In all cases: Find a compromise between a simple structure with a larger error and a complex structure with overtting (use criteria like AIC, BIC oder MDL).
University of Applied Sciences Braunschweig/Wolfenbuttel
Data Mining p.36/37
Propagation
S B L A T T L E B L E
X E
D B E
In a Bayesian network, arbitrary attributes can instantiated (with a single value or a probability distribution). The computation of the (marginal) distributions of the other attributes is carried out based on message exchange algorithm.
University of Applied Sciences Braunschweig/Wolfenbuttel
Data Mining p.37/37
Cluster analysis
Aim:
set. Objects within the same cluster should be similar. Objects from different clusters should be dissimilar.
Customer segmentation: Find groups of customers Gene clustering: Find groups of genes with similar properties (for instance, expression proles) Clustering of solar stars: Find groups of starts based on attributes like size, characteristic of the spectrum, Identication of social or economic groups based on attributes like income, age, education level,
Unsupervised classication
Cluster analysis is an unsupervised classifcation technique. In contrast to supervised classifcation, the classes (clusters) are not known in advance in the data set.
Distance measures
Cluster analysis requires a distance or (dis)similarity measure for the grouping of the data. The choice of a suitable distance measures has a strong inuence on the cluster structure. For continuous attributes, a normalisation technique should be applied in order to balance the inuence of the single attributes on the overall distance. (See also nearest neighbour classiers.)
Inuence of scaling
m illi u n its
k ilo u n its
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.5/76
Discrete attributes
When discrete attributes are involved in the distance measure, clusters tend to be split based on the discrete attributes, since they automatically lead to well-separated clusters. Therefore, most clustering algorithm focus on continuous attributes.
Start with every data point in its own cluster. (i.e., start with so-called singletons: single element clusters) In each step merge those two clusters that are closest to each other. Keep on merging clusters until all data points are contained in one cluster. The result is a hierarchy of clusters that can be visualized in a tree structure, a so-called dendrogram.
The distance between singletons is simply the distance between the (single) data points contained in them. However: How do we compute the distance between clusters that contain more than one data point?
(red) Distance between the centroids (mean value vectors) of the two clusters.
Average Linkage
(green) Distance between the two closest points of the two clusters. (blue) Distance between the two farthest points of the two clusters.
Data Mining p.9/76
Complete Linkage
Single linkage can follow chains in the data (may be desirable in certain applications). Complete linkage leads to very compact clusters. Average linkage also tends clearly towards compact clusters.
Single linkage
Complete linkage
Dendrograms
The cluster merging process arranges the data points in a binary tree. Draw the data tuples at the bottom or on the left (equally spaced if they are multi-dimensional). Draw a connection between clusters that are merged, with the distance to the data points representing the distance between the clusters.
Dendrograms
data tuples
Agglomerative clustering
Example: Clustering of the 1-dimensional data set . All three approaches to measure the distance between clusters lead to different dendrograms.
Agglomerative clustering
Centroid
Single linkage
Complete linkage
Heatmaps
Heatmaps
One axis: Attributes Other axis: Data objects Colours (colour intensities): Attribute values
Implementation aspects
Hierarchical agglomerative clustering can be implemented by processing the matrix containing the pairwise distances of the data points. (The data points themselves are actually not needed.) In each step the rows and columns corresponding to the two clusters that are closest to each other are deleted. A new row and column corresponding to the cluster formed by merging these clusters is added to the matrix.
Data Mining p.19/76
Implementation aspects
indices of the two clusters that are merged indices of the old clusters that are not merged index of the new cluster (result of merger) parameters specifying the method (single linkage etc.)
Data Mining p.20/76
Implementation aspects
The parameters dening the different methods are ( are the numbers of data points in the clusters):
method centroid method median method single linkage complete linkage average linkage Wards method
0 0
0 0 0 0
0 0
clusters. Stop merging clusters if the closest two clusters are farther apart than this distance.
Visual Approach:
combined into one cluster. Draw the dendrogram and nd a good cut level. Advantage: Cut need not be strictly horizontal.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.22/76
merging process. Try to nd a step in which the distance between the two clusters merged is considerably larger than the distance of the previous step. Several heuristic criteria exist for this step selection.
in the number of data, which is not acceptable for larger data sets.
-Means clustering
Choose a number of clusters to be found (user input). Initialize the cluster centres randomly (for instance, by randomly selecting data points). Assign each data point to the cluster centrethat is closest to it (i.e. closer than any other cluster center).
Compute new cluster centres as the mean vectors of the assigned data points. (Intuitively: centre of gravity if each data point has unit weight.)
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.25/76
-Means clustering
Repeat these two steps (data point assignment and cluster centre update) until the clusters centres do not change anymore. It can be shown that this scheme must converge, i.e., the update of the cluster centres cannot go on forever.
-Means clustering
for all
Alternating optimization
Assuming the cluster centres to be xed, should be chosen for the cluster to which data object has the smallest distance in order to minimize the objective function. Assuming the assignments to the clusters to be xed, each cluster centre should be chosen as the mean vector of the data objects assigned to the cluster in order to minimize the objective function.
Data set to be clustered. Choose clusters. (From visual inspection, can be difcult to determine in general.)
Dots represent cluster centres (quantization vectors). Left: Delaunay Triangulation (The circle through the corners of a triangle does not contain another point.) Right: Voronoi Diagram (Midperpendiculars of the Delaunay triangulation: boundaries of the regions of points that are closest to the enclosed cluster centre (Voronoi cells)).
Data Mining p.30/76
midperpendiculars of the triangles edges (shown in blue on the left, in grey on the right)
Clustering is successful in this example: The clusters found are those that would have been formed intuitively. Convergence is achieved after only 5 steps. (This is typical: convergence is usually very fast.) However: The clustering result is fairly sensitive to the initial positions of the cluster centers. With a bad initialisation clustering may fail (the alternating update process gets stuck in a local minimum).
Like online -means clustering (update after each data point). For each training pattern nd the closest reference vector. Adapt only this reference vector (winner neuron). For classied data the class may be taken into account. (reference vectors are assigned to classes)
Attraction rule
same class)
attraction rule
repulsion rule
(learning rate)
Data Mining p.37/76
or
LVQ: Example
Left: Online training with learning rate Right: Batch training with learning rate
, .
Data Mining p.39/76
Self-organizing maps
LVQ model
where a topological structure is assumed on the reference vectors (for instance, a grid in the plane), not only the closest reference vector is updated in each step, but also the reference vectors in the neighbourhood, the neighbourhood (and the learning rate) become smaller over time.
Unfolding SOM
Fuzzy clustering
for all
Parameters
to the th cluster.
is some distance measure specifying the distance between data object and cluster
may overlap.
Parameters to be optimized
the membership degrees the cluster parameters (not given explicitly here, but hidden in the distances ).
Parameters to be optimized
the membership degrees the cluster parameters (not given explicitly here, but hidden in the distances ).
ellipsoidal clusters (Gustafson/Kessel 1979) clusters as lines/planes/hyperplanes (Bock 1979, Bezdek 1981) cluster as shells of circles (Dav 1990, Krishnapuram/Nasraoui/Frigui 1992) clusters in the form of arbitrary quadrics (Krishnapuram/Frigui/Nasraoui 1991-1995) adaptable cluster volumes (Keller/Klawonn 1999)
Example
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 -3 -2 -1 0 1 2 3 4
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0 -3 -2 -1 0 1 2 3 4
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0 -3 -2 -1 0 1 2 3 4
Mixture model (one normal distrubutions contributes 10%, the other 90%)
Data was generated by sampling a set of normal distributions. (The probability density is a mixture of Gaussian distributions.)
Formally:
is the probability that a data point belongs to (is generated by) the -th component of the mixture
is the conditional probability density function of a data point given the cluster (specied by the cluster index )
Data Mining p.54/76
Expectation maximization
Basic idea: Problem:
is difcult to optimize, even if one takes the natural logarithm (cf. the maximum likelihood estimation of the parameters of a normal distribution), because
Expectation maximization
Approach:
Assume that there are hidden variables stating the clusters that generated the data points , so that the sums reduce to one term. Since the their values. are hidden, we do not know
Problem:
Expectation maximization
Formally:
Maximize the likelihood of the completed data set , where combines the values of the variables . That is,
Problem:
Expectation maximization
Approach to nd a solution nevertheless: See the as random variables (the
values
are not xed) and consider a probability distribution over the possible values. As a consequence becomes a random variable, even for a xed data set and xed cluster parameters . Try to maximize the expected value of or
Expectation maximization
Formally:
Expectation maximization
Unfortunately, these functionals are still difcult to optimize directly. Use the equation as an iterative scheme, xing in some terms (Make sure that the iteration scheme converges at least to a local maximum.)
Solution:
Expectation maximization
Iterative scheme for expectation maximization:
of cluster parameters
It can be shown that each EM iteration increases the likelihood of the data and that the algorithm converges to a local maximum of the likelihood function (i.e., EM is a safe way to maximize the likelihood function).
Data Mining p.61/76
Expectation maximization
Justication of the last step on the previous slide:
Expectation maximization
The probabilities
are computed as
that is, as the relative probability densities of the different clusters (as specied by the cluster parameters) at the location of the data points .
The are the posterior probabilities of the clusters given the data point and a set of cluster parameters .
Expectation maximization
They can be seen as case weights of a completed data set: Split each data point into data points , . Distribute the unit weight of the data point according to the above probabilities, i.e., assign to the weight , .
Expectation Step
For all data points : Compute for each normal distribution the probability that the data point was generated from it (ratio of probability densities at the location of the data point). weight of the data point for the estimation.
Data Mining p.65/76
Maximization Step
For all normal distributions: Estimate the parameters by standard maximum likelihood estimation using the probabilities (weights) assigned to the data points w.r.t. the distribution in the expectation step.
to compute
and
mean vector).
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.68/76
If a fully general mixture of Gaussian distributions is used, the likelihood function is truly optimized if all normal distributions except one are contracted to single data points and the remaining normal distribution is the maximum likelihood estimate for the remaining data points. This undesired result is rare, because the algorithm gets stuck in a local optimum.
Nevertheless it is recommended to take countermeasures, which consist mainly in reducing the degrees of freedom, like Fix the determinants of the covariance matrices to equal values. Use a diagonal instead of a general covariance matrix. Use an isotropic variance instead of a covariance matrix. Fix the prior probabilities of the clusters to equal values.
Data Mining p.70/76
Density-based clustering: DBSCAN, DENCLUE, Grid-based clustering (division of the space into a nite number of cells): STING, WaveCluster, Clustering with nominal attributes: ROCK,
Dene a measure that evaluates clustering results. Cluster the data with different numbers of clusters. Choose the result (number of clusters) with the best validity measure.
Separation index
To be maximized:
where diamcluster
cluster
Partition coefcient
(for fuzzy clustering, to be maximized)
The largest value 1 is assumed when the partition is not fuzzy at all, i.e. . The smallest value is assumed when all data are assigned with the same membership degree to all clusters.
Partition entropy
Crossvalidation
subsets of approximately
Apply clustering times, each time leaving out one of the subsets. In the ideal case, each pair of data objects should in always be either assigned to the same cluster or always to different clusters (coherence). This provides a measures of how stable (how acceptable) a clustering result is. Apply this scheme with different numbers of clusters and choose the one where the coherence is best.
Data Mining p.76/76
Association rule induction: Originally designed for market basket analysis. Aims at nding patterns in the shopping behaviour of customers of supermarkets, mail-order companies, on-line shops etc. More specically:
Find sets of products that are frequently bought together.
Example of an association rule: If a customer buys bread and wine, then she/he will probably also buy cheese.
Possible applications of found association rules: Improve arrangement of products in shelves, on a catalogs pages. Support of cross-selling (suggestion of other products), product bundling. Fraud detection, technical dependence analysis. Finding business rules and detection of data quality problems.
Association rules
Assessing the quality of association rules: Support of an item set: Fraction of transactions (shopping baskets/carts) that contain the item set. Support of an association rule : Either: Support of (more common: rule is correct) Or: Support of (more plausible: rule is applicable) Condence of an association rule : Support of divided by support of (estimate of ).
Association rules
Two step implementation of the search for association rules: Find the frequent item sets (also called large item sets), i.e., the item sets that have at least a user-dened minimum support. Form rules using the frequent item sets found and select those that have at least a user-dened minimum condence.
It is not possible to determine the support of all possible item sets, because their number grows exponentially with the number of items. Efcient methods to search the subset lattice are needed.
Data Mining p.5/37
and .
Based on a global order of the items. The item sets counted in a node consist of all items labeling the edges to the node (common prex) and one item following the last edge label.
Make sure that there is only one counter for each possible item set. Explains the unbalanced structure of the full
item set tree.
Prune the tree if a certain depth (a certain size of the item sets) is reached. Idea: Rules with too many items are difcult to
interpret.
Data Mining p.7/37
No superset of an infrequent item set can be frequent. No counters for item sets having an infrequent
subset are needed.
Apriori:
size).
Eclat:
Example transaction database with 5 items and 10 transactions. Minimum support: 30%, i.e., at least 3 transactions must contain the item set. All one item sets are frequent full second level is needed.
Data Mining p.10/37
Determining the support of item sets: For each item set traverse the database and count the transactions that contain it (highly inefcient). Better: Traverse the tree for each transaction and nd the item sets it contains (efcient: can be implemented as a simple double recursive procedure).
Data Mining p.11/37
Minimum support: 30%, i.e., at least 3 transactions must contain the item set. Infrequent item sets: , , . The subtrees starting at these item sets can be pruned.
Data Mining p.12/37
Before counting, check whether the candidates contain an infrequent item set. An item set with items has subsets of size . The parent is only one of these subsets.
The item sets and can be pruned, because contains the infrequent item set and contains the infrequent item set .
Minimum support: 30%, i.e., at least 3 transactions must contain the item set. Infrequent item set: .
Generate candidate item sets with 4 items (parents must be frequent). Before counting, check whether the candidates contain an infrequent item set.
Data Mining p.18/37
The item set can be pruned, because it . contains the infrequent item set Consequence: No candidate item sets with four items. Fourth access to the transaction database is not necessary.
Data Mining p.19/37
Form a transaction list for each item. Here: bit vector representation. grey: item is contained in transaction white: item is not contained in transaction
Transaction database is needed only once (for the single item transaction lists).
Data Mining p.20/37
Intersect the transaction list for item transaction lists of all other items.
with the
Count the number of set bits (containing transactions). The item set is infrequent and can be pruned.
Data Mining p.21/37
with the .
Intersect the transaction list for Result: Transaction list for the item set
and .
With Apriori this item set could be pruned before counting, because it was known that is infrequent.
Data Mining p.23/37
Backtrack to the second level of the search tree and intersect the transaction list for and . Result: Transaction list for .
Backtrack to the rst level of the search tree and intersect the transaction list for with the transaction lists for , , and . Result: Transaction lists for the item sets , , and . Only one item set with sufcient support prune all subtrees.
Data Mining p.25/37
Backtrack to the rst level of the search tree and intersect the transaction list for with the transaction lists for and . Result: Transaction lists for the item sets . and
Data Mining p.26/37
Intersect the transaction list for Result: Transaction list for Infrequent item set: . .
and
Backtrack to the rst level of the search tree and intersect the transaction list for with the transaction list for . Result: Transaction list for the item set With this step the search is nished.
Data Mining p.28/37
: 70%
30%
2 items
: 40%
: 40%
3 items
Any frequent item set (support is higher than the minimal support).
(marked with ): A frequent item set is called closed if no superset has the same support.
Consider all pairs of sets with and . Common restriction: only one item in consequent (then-part). Form the association rule condence. conf
i.e.
supp supp
supp supp
Require a minimum difference between rule condence and consequent support. Compute information gain or for antecedent (if-part) and consequent.
Example:
, supp supp
conf
condence
Form the relevant association rules (minimum condence). lattice / item set
tree. Apriori: Breadth rst search; Eclat: Depth rst search. Other algorithms: FP-growth, H-Mine, LCM, Maa, Relim etc. Search Tree Pruning: No superset of an infrequent
item set can be frequent.
Data Mining p.33/37
Form all possible association rules from the frequent item sets. Filter interesting association rules.
Structured itemsets
Sometimes, an additional structure is imposed on the item sets.
For instance: Customer contact (buying, complaint, questionnaire, ) Association rules have the form: If and then happens, then probably happens next.
The additional structure leads to different tree structure, but the principal algorithm remains the same.
University of Applied Sciences Braunschweig/Wolfenbuettel
Data Mining p.35/37
Other applications
Association rules with condence close to 100% could be business rules. Exceptions might be caused by data quality problems. Search for association rules with a given conclusion part. If , then the customer probably buys the product.
Data Mining p.37/37
Subgroup discovery
Classication: Find a global model (classier) that assigns the correct class to all objects (or at least to as many as possible). Find interesting subgroups in the data set.
Subgroup discovers:
A subgroup is usually described by a few attribute values. A subgroup is interesting w.r.t. a (binary) target attribute if the distribution of the values of the target attribute differs signicantly from the distribution in the whole population.
Data Mining p.1/11
Subgroup discovery
Example: For a marketing campaign nd subgroups of customers with a high(er) chance to buy a certain product. Target attribute: buys insurance = YES Possible result of the subgroup discovery process similar to association rules with predened consequent parts of the rules:
buys insurance = YES in the whole population: 5% buys insurance = YES in the subgroup Age = YOUNG & marital status = MARRIED 15%
Data Mining p.2/11
Binomial test:
BT
:
relative frequency of the target variable in the whole popualtion : relative frequency of the target variable in the subgroup : size of the whole population : size of the subgroups
-test
: weighting parameter
Relative gain:
RG
if MC otherwise
The problem of subdiscovery is similar to nding association rules. Therefore, algorithms for frequent item set and association rule mining are often adapted to subgroup discovery like
Apriori-SD FP-growth
Feature selection
Very often, some attributes (features) can be
irrelevant or redundant.
Depending on the method or model, such attributes can lead to bad results in the data mining process.
Nave Bayes classiers willbe strongly affected when attributes are highly dependent. Decision trees and especially nearest neighbour classiers are sensitive to irrelevant attributes.
Data Mining p.7/11
Feature selection
Two basic strategies for feature selection:
Filtering
refers to preselecting attributes before the corresponding model is built. refers to feature selection methods that select attributes in combination with the construction of the model.
Wrapping
Feature selection
Filtering methods:
Independence tests between arbitrary attributes (to nd dependent attributes) between the target attribute and othet attributes (to nd irrelevant attributes) Construct a (strictly-pruned) decision tree and use only those attributes occuring in the decision tree.
Feature selection
Wrapping methods:
Exhaustive search (try all combinations of attributes) Greedy search (add (delete) attributes step by step, choose in each step the one that leads to the best increase (least decreae) of the performance)
Other topics
Change detection Evolving systems (online adaptive learning for data streams) Active learning: Classication (or regression) problems where labeling the data objects is expensive or complicated and most of the data objects are not labeled. The active learning model tries to select those data objects for labeling which best increase the performance of the model.