Evaluating Data Mining Models

Data Mining – Credibility:
Evaluating What’s Been Learned

Chapter 5
Evaluation
• Performance on training data is not representative –
cheating – has seen all test instances during training
• If test involves testing on training data KNN with K=1
is the best technique !!!!!!
• Simplest fair evaluation = large training data AND
large test data
• We have been using 10-fold cross-validation
extensively – not just fair, also more likely to be
accurate – less chance of unlucky or lucky results
• Better – repeated cross validation (as in experimenter
environment in WEKA) – this allows statistical tests
Validation Data
• Some learning schemes involve testing what has been
learned on other data – AS PART OF THEIR TRAINING !!
• Frequently, this process is used to “tune” parameters that
can be adjusted in the method to obtain the best
performance (e.g. threshold for accepting rule in Prism)
• The test during learning cannot be done on training data or
test data
– Using training data would mean that the learning is being checked
against data it has already seen
– Using test data would mean that the test data would have already
been seen during (part of) learning
• Separate (3rd) data set should be used – “Validation”
Confidence Intervals
• If experiment shows 75% correct, we might be
interested in what the correctness rate can
actually be expected to be (the experiment is a
result of sampling)
• We can develop a confidence interval around the
result
• Skip Math
Cross-Validation
• Foundation is a simple idea – “holdout” – holds out a
certain amount for testing and uses rest for training
• Separation should NOT be “convenience”,
– Should at least be random
– Better – “stratified” random – division preserves relative
proportion of classes in both training and test data
• Enhanced : repeated holdout
– Enables using more data in training, while still getting a good test
• 10-fold cross validation has become standard
• This is improved if the folds are chosen in a “stratified”
random way
Repeated Cross Validation
• Folds in cross validation are not independent sample
– Contents of one fold are influenced by contents of other folds
• No instances in common
– So statistical tests (e.g. T Test) are not appropriate
• If you do repeated cross validation, the different cross
validations are independent samples – folds drawn for
one are different from others
– Will get some variation in results
– Any good / bad luck in forming of folds is averaged out
– Statistical tests are appropriate
• Becoming common to run 10 10-fold cross validations
• Supported by experimenter environment in WEKA
For Small Datasets
• Leave One Out
• Bootstrapping
• To be discussed in turn
Leave One Out
• Train on all but one instance, test on that one (pct correct
always equals 100% or 0%)
• Repeat until have tested on all instances, average results
• Really equivalent to N-fold cross validation where N =
number of instances available
• Plusses:
– Always trains on maximum possible training data (without
cheating)
– Efficient to run – no repeated (since fold contents not randomized)
– No stratification, no random sampling necessary
• Minuses
– Guarantees a non-stratified sample – the correct class will always
be at least a little bit under-represented in the training data
– Statistical tests are not appropriate
Bootstrapping
• Sampling done with replacement to form a training dataset
• Particular approach – 0.632 bootstrap
– Dataset of n instances is sampled n times
– Some instances will be included multiple times
– Those not picked will be used as test data
– On large enough dataset, .632 of the data instances will end up in the training
dataset, rest will be in test
• This is a bit of a pessimistic estimate of performance, since only using
63% of data for training (vs 90% in 10-fold cross validation)
• May try to balance by weighting in performance predicting training
data (p 129) <but this doesn’t seem fair>
• This procedure can be repeated any number of times, allowing
statistical tests
Comparing Data Mining
Methods Using T-Test
• Don’t worry about the math
– You probably should have had it (MATH 140?)
– WEKA will do it automatically for you – experimenter
environment
– Excel can do it easily
• See examplettest.xls file on my www site
• (formular =TTEST(A1:A8,B1:B8,2,1)
– two ranges being compared
– two-tailed test, since we don’t know which to expect to be higher
– 1 – indicates paired test – ok when results being compared are from th
same samples (same splits into folds)
– result is probability that differences are not chance
– generally accepted if below .05 but sometimes looking for .01 or better
5.6 Predicting Probabilities
• Skip
5.7 Counting the Cost
• Some mistakes are more costly to make than others
• Giving a loan to a defaulter is more costly than denying
somebody who would be a good customer
• Sending mail solicitation to somebody who won’t buy is
less costly than missing somebody who would buy
(opportunity cost)
• Looking at a confusion matrix, each position could have an
associated cost (or benefit from correct positions)
• Measurement could be average profit/ loss per prediction
• To be fair in cost benefit analysis, should also factor in cost
of collecting and preparing the data, building the model …
Lift Charts
• In practice, costs are frequently not known
• Decisions may be made by comparing possible
scenarios
• Book Example – Promotional Mailing
– Situation 1 – previous experience predicts that 0.1% of all
(1,000,000) households will respond
– Situation 2 – classifier predicts that 0.4% of the 100000 most
promising households will respond
– Situation 3 – classifier predicts that 0.2% of the 400000 most
promising households will respond
– The increase in response rate is the lift ( 0.4 / 0.1 = 4 in
situation 2 compared to sending to all)
– A lift chart allows for a visual comparison …
Figure 5.1 A hypothetical lift
chart.
Generating a lift chart
• Best done if classifier generates probabilities for its predictions
• Sort test instances based on probability of class we’re interested in
(e.g. would buy from catalog = yes)
Table 5.6
Rank Probability of Yes Actual class
1 .95 Yes
2 .93 Yes
3 .93 No
4 .88 Yes
5 .86 Yes …
• to get y-value (# correct) for a given x (sample size), read down sorted list to sample
size, counting number of instances that are actually the class we want
•(e.g. sample size = 5, correct = 4 – on lift chart shown, the sample size of 5 would be
converted to % or total sample)
Cost Sensitive Classification
• For classifiers that generate probabilities
of each class Costs Predicted
• If not cost sensitive, would predict most
probable class A B C
• With costs shown, and probabilities
A=.2 B= .3 C= .5 Act A 0 10 20
• Expected Costs of Predictions = ual
B 5 0 5
– A  .2 * 0 + .3 * 5 + .5 * 10 = 6.5
– B  .2 * 10 + .3 * 0 + .5 * 2 = 3.0 C 10 2 0
– C  .2 * 20 + .3 * 5 + .5 * 0 = 5.5
• Considering costs, B would be predicted
even though C is considered most likely
Cost Sensitive Learning
• Most learning methods are not sensitive to cost structures (e.g. higher
cost of false positive than false negative) (Naïve Bayes is, decision
tree learners not)
• Simple method for making cost sensitive –
– Change proportion of different classes in the data
– E.g. if have a dataset with 1000 yes, and 1000 no, but incorrectly predicting Yes
is 10 times more costly than incorrectly predicting No
– Filter and sample the data so that have 1000 No and 100 Yes
– A learning scheme trying to minimize errors is going to tend toward predicting
No
• If don’t have enough data to put some aside, “re-sample” No’s (bring
duplicates in) (if learning method can deal with duplicates (most can))
• With some methods, you can “Weight” instances so that some count
more than others. No’s could be more heavily weighted
Information Retrieval (IR) Measures
• E.g., Given a WWW search, a search engine
produces a list of hits supposedly relevant
• Which is better?
– Retrieving 100, of which 40 are actually relevant
– Retrieving 400, of which 80 are actually relevant
– Really depends on the costs
Information Retrieval (IR) Measures
• IR community has developed 3 measures:
– Recall = number of documents retrieved that are relevant
total number of documents that are relevant
– Precision = number of documents retrieved that are relevant
total number of documents that are retrieved
– F-measure = 2 * recall * precision
recall + precision
WEKA
• Part of the results provided by WEKA (that we’ve ignored so far)
• Let’s look at an example (Naïve Bayes on my-weather-nominal)
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
0.667 0.125 0.8 0.667 0.727 yes
0.875 0.333 0.778 0.875 0.824 no
=== Confusion Matrix ===
a b <-- classified as
4 2 | a = yes
1 7 | b = no
• TP rate and recall are the same = TP / (TP + FN)

– For Yes = 4 / (4 + 2); For No = 7 / (7 + 1)
• FP rate = FP / (FP + TN) – For Yes = 1 / (1 + 7); For No = 2 / (2 + 4)
• Precision = TP / (TP + FP) – For yes = 4 / (4 + 1); For No = 7 / (7 + 2)
• F-measure = 2TP / (2TP + FP + FN)
– For Yes = 2*4 / (2*4 + 1 + 2) = 8 / 11
– For No = 2 * 7 / (2*7 + 2 + 1) = 14/17
In terms of true positives etc
• True positives = TP; False positives = FP
• True Negatives = TN; False negatives = FN
• Recall = TP / (TP + FN) // true positives / actually positive
• Precision = TP / (TP + FP) // true positives / predicted
positive
• F-measure = 2TP / (2TP + FP + FN)
– This has been generated using algebra from the formula previous
– Easier to understand this way – correct predictions are double
counted – once for recall, once for precision.
denominator includes corrects and incorrects (either based on
recall or precision idea – relevant but not retrieved or retrieved but
not relevant)
• There is no mathematics that says recall and precision can
be combined this way – it is ad hoc – but it does balance
the two
Kappa Statistic
• A way of checking success against how hard the
problem is
• Compare to expected results from random prediction
…
– with predictions in the same proportion as the predictions
made by the classifier being evaluated
• This is different than predicting in proportion to the
actual values
– Which might be considered having an unfair advantage
– But which I would consider a better measure
Kappa Statistic
Predicted Predicted
A B C Total A B C Total
AA 88 10 2 100 A 60 30 10 100
C
TB 14 40 6 60 B 36 18 6 60
U
AC 18 10 12 40 C 24 12 4 40
L
Total 120 60 20 Total 120 60 20
Actual Results Expected Results with Stratified Random Prediction

WEKA
• For many occasions, this borders on “too much
information”, but it’s all there
• We can decide, are we more interested in
Yes , or No?
• Are we more interested in recall or precision?
WEKA – with more than two classes
• Contact Lenses with Naïve Bayes
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
0.8 0.053 0.8 0.8 0.8 soft
0.25 0.1 0.333 0.25 0.286 hard
0.8 0.444 0.75 0.8 0.774 none
=== Confusion Matrix ===
a b c <-- classified as
4 0 1 | a = soft
0 1 3 | b = hard
1 2 12 | c = none
• Class exercise – show how to calculate recall, precision, f-
measure for each class
Evaluating Numeric Prediction
• Here, not a matter of right or wrong, but rather,
“how far off”
• There are a number of possible measures, with
formulas shown in Table 5.6
WEKA
• IBK w/ k = 5 on baskball.arff
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.548
Mean absolute error 0.0715
Root mean squared error 0.0925
Relative absolute error 83.9481 %
Root relative squared error 85.3767 %
Total Number of Instances 96
Root Mean-Squared Error
• Squareroot of (Sum of Squares of Errors / number of
predictions)
• Algorithm:
– Initialize – especially subtotal = 0
– Loop through all test instances
• Make prediction,
• compare to actual – calculate difference
• Square difference; add to subtotal
– Divide subtotal by number of test instances
– Take squareroot to obtain root mean squared error
• Error is on same scale as predictions – root mean squared
error can be compared to mean of .42 and a range of .67,
seems decent
• Exaggerates effect of any bad predictions, since differences
are squared
Mean Absolute Error
• (Sum of Absolute Values of Errors / number of predictions)
• Algorithm:
– Initialize – especially subtotal = 0
• Take absolute value of difference; add to subtotal
– Divide subtotal by number of test instances to obtain mean
absolute error
• Error is on same scale as predictions –mean absolute error
can be compared to mean of .42 and a range of .67, seems
decent
• Does not exaggerate the effect of any bad predictions,
NOTE – this value is smaller in my example than the
squared version.
Relative Error Measures
• Results are divided by differences from mean
– Root Relative Squared Error
– Relative Absolute Error
• See upcoming slides
Root Relative Squared Error
• Squareroot of (Sum of Squares of Errors / Sum of Squares of
differences from mean)
• Gives idea of scale of error compared to how variable the actual
values are (the more variable the values are, really the harder the task)
• Algorithm:
– Initialize – especially numerator and denominator subtotals = 0
– Determine mean of actual test instances
• Square difference; add to numerator subtotal
• Compare actual to mean of actual – calculate difference
• Square difference; add to denominator subtotal
– Divide numerator subtotal by denominator subtotal
– Take squareroot of above result to obtain root relative squared error
• Error is nornalized
• Use of squares once again exaggerates
Relative Absolute Error
• Sum of Absolute Values of Errors / Sum of Absolute Values of
differences from mean)
• Gives idea of scale of error compared to how variable the actual
values are (the more variable the values are, really the harder the task)
• Algorithm:
– Initialize – especially numerator and denominator subtotals = 0
– Determine mean of actual test instances
• compare to actual – calculate difference;
• take absolute value of difference; add to numerator subtotal
• Compare actual to mean of actual – calculate difference
• take absolute value of difference; add to denominator subtotal
– Divide numerator subtotal by denominator subtotal
• Error is nornalized
• Does not exaggerate
Correlation Coefficient
• Tells whether the predictions and actual values “move
together” – one goes up when the other goes up …
• Not as tight a measurement as others
– E.g. if predictions are all double the actual, correlation is
perfect 1.0, but predictions are not that good
– We want to have a good correlation, but we want MORE than
that
• A little bit complicated, and well established (can do
easily in Excel), so let’s skip the math
What to use?
• Depends some on philosophy
– Do you want to punish bad predictions a lot? (then use a
root squared method)
– Do you want to compare performance on different data
sets and one might be “harder” (more variable) than
another? (then use a relative method)
• In many real world cases, any of these work fine
(comparisons between algorithms come out the
same regardless of which measurement is used)
• Basic framework same as with predicting category –
repeated 10-fold cross validation, with paired
sampling …
Minimum Description Length
Principle
• What is learned in Data Mining is a form of “theory”
• Occam’s Razor – in science, others things being equal,
simple theories are preferable to complex ones
• Mistakes a theory makes, really are exceptions, so to
keep other things equal they should be added to the
theory, making it more complex
• Simple example  a simple decision tree (other things
being equal) is preferred over a more complex decision
tree
• Details will be skipped (along with section 5.10)
End Chapter 5

Evaluating Data Mining Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluating Data Mining Models

Uploaded by

Copyright:

Available Formats

Data Mining – Credibility:

Evaluating What’s Been Learned

• TP rate and recall are the same = TP / (TP + FN)

Actual Results Expected Results with Stratified Random Prediction

You might also like