Professional Documents
Culture Documents
Jason Dew
September ,
Abstract
In this homework I will explore the efficacy of different parameters in
the k nearest neighbor algorithm, including the value of k, weighting tac-
tics, and the distance measure used to calculate similarity. e breast cancer
Wisconsin (diagnostic) data¹ from the UCI Machine Learning repository² is
used.
Weka installation
is was very straightforward on my platform of choice, Mac OS .. I also put
the weka.jar file in a standard location so that it can be used programmatically
via JRuby.
¹ hp://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
² hp://archive.ics.uci.edu/ml/index.html
Boxplots of each attribute
mitosis ● ● ● ● ● ● ● ●
normal_nucleoli ● ●
bland_chromatin ●
bare_nuclei
epithelial_cell_size ● ● ●
marginal_adhesion ● ●
cell_shape_uniformity
cell_size_uniformity
clump_thickness
2 4 6 8 10
Scale
standard
attribute mean
deviation
clump_thickness . .
cell_size_uniformity . .
cell_shape_uniformity . .
marginal_adhesion . .
epithelial_cell_size . .
bare_nuclei . .
bland_chromatin . .
normal_nucleoli . .
mitosis . .
. R
In order to learn more about how the k-NN classifier works, I varied several op-
tions in addition to k, including weighting the similarities and varying the distance
measures. e accuracy of a classifier is defined as
# correct
accuracy =
# of instances
Figure shows how the accuracy varies in k using the Euclidean distance measure
and no weighting. ere does not seem to be a clear paern here. Figure shows
how the distance metric used affects the accuracy achieved. e differences are
between these are slight and the Euclidean distance does the best overall. Figure
shows prey clearly that weighting the results either by using the inverse distance
or the similarity is a good idea. In this case, using the inverse distance does a beer
job.
k−NN accuracy for k ranging from 5 to 10
●
96.6
●
accuracy
96.4
● ●
96.2
5 6 7 8 9 10
Comparison of distance metrics
96.8
● Euclidean
Manhattan
●
Chebyshev
96.6
●
96.4
●
accuracy
● ●
96.2
●
96.0
5 6 7 8 9 10
Figure : Graph of the effect of the distance metric used on the k-NN algorithm.
Comparison of weighting options
97.2
● None
Inverse distance
Similarity
97.0
96.8
accuracy
●
96.6
●
96.4
● ●
96.2
5 6 7 8 9 10
Figure : Graph of the effect of the use of weighting on the k-NN algorithm.