Professional Documents
Culture Documents
ASSIGNMENT #1
Submitted to/ Eng. Eman Hossam
Team: #19
Submitted By:
Ahmed Mohamed Abd El-Hameed
Karem Mohamed El-Farra
KMEANS
d)
e) What can change each time you cluster the data? WHY?
ANSWER: the positions of the centers changed each time because the
initial centers is set in random way and that affect the result of the
clustering.
How do you prevent these changes from occurring?
ANSWER: By assign specific initial centers or by using set.seed(145)
f)
Best K = 5
h)
NO,
Reason: Because it is the elbow value
From the plot:
Best K = 5
i) Yes,there is outlier
K=4
END OF KMEANS
Assocation RULES
b)
# item13 is the most frequent item
# largest transaction has 25 item.
c)
iii - when the confidence increses the number of rules that achieve
that confidence decreases.
d)
i the interesting rules is the rules with more shade color darkest (high
lift value) and it is positon is on the top left of the plot.
e)
i - Observ. : No rules with high support and lift at the same time
ii - The rules on the top left of the plot is more interesting as there have
high lift darker colors rules
iii - These rules are located at: the positions with high support and low
lift
Lhs Rhs Support
Confidence
Lift Order
{}
{item5}
0.3699
0.3699
{}
{item13}
0.4948
0.4948
The expected observation of these itemsets is: the customer will buy
these items if he didnt buy any thing as the Lhs is empty{}.
The actual observation of the itemsets is: no thing , these rules doesnt
have any meaning so we it cant be a real rules.
iv Rule number
Lhs
Rhs
SupportConfidence
Lift
Order
3034
{item10}
{item13}
0.1492 0.4915980
0.9935287
3037
{item5}
{item30}
0.1276 0.3449581
1.0427996
3011
{item13}
{item92}
0.1290 0.2607114
0.9912981
f) i Rule number
Lhs
Rhs
SupportConfidence
Lift
Order
3042 {item10,item44}
3111 {item82,item99}
162 {item23}
g)
i
{item15,item30,item49} -> {item56} (lift: 16.58456)
{item15,item49} -> {item56} (lift: 14.88358)
{item49,item56} -> {item15} (lift: 9.153028)
{item30,item49,item56} -> {item15} (lift: 9.240199)
ii Darker blue rules : is the rules with highest confidence and lowest lift
h)
ilhs
rhs
3. Nave Bayes
a)
A-priori probabilities:
10-50K
50-80K
GT 80K
0.80266371 0.12563818
0.07169811
Conditional probabilities:
age
Y
20-30
31-45
GT 45
College
Others Prof/Phd
b)
predicted
actual
10-50K
787
50-80K
127
GT 80K
67
Accuracy = 0.795
Overall misclassification = 0. 205
10-50K misclassification rate = 0.007566204
50-80K misclassification rate = 1
GT 80K misclassification rate = 0.8933333
-
Explain the variation in the models predictive power across income classes
[10-50K]: has the highest priori prob. (0.8) so it is has the less miss-rate
[50-80K] and [GT 80K] has priori probs. with nearby values (0.12),(0.7)
but [50-80K] doesn't have highest condition probability in any
variable(sex,age,educ)
so the model will not classify any entry to to class [50-80K] so it has miss rate = 1
4. Nave Bayes
a)
actual
106 321
97 476
Accuracy = 0.582
Overall misclassification = 0. 418
As the male have priori probability higher than female it has less misclassification.
b)
A-priori probabilities:
Y
F
M
0.5 0.5
Conditional probabilities:
age
Y
20-30
31-45
GT 45
F 0.1865714 0.3411429 0.4722857
M 0.1891429 0.3454286 0.4654286
Y
educ
College
Others Prof/Phd
F 0.3228571 0.6531429 0.0240000
M 0.2848571 0.6791429 0.0360000
income
10-50K
50-80K
GT 80K
F 0.88171429 0.08314286 0.03514286
M 0.74885714 0.15628571 0.09485714
c)
Accuracy = 0.53
Misclassification rate = 0.47
Male Misclassification rate = 0.1358314
Female Misclassification rate = 0.7190227
- Female class is predicted better than male
- I think that is because the both has the same priori prob. And female has
better cond. Probability.
d) It has no effect on accuracy and confusion matrix
e) The classification of specific class become better when its priori prob.
become higher.
Decision tree
a)
b)
CLASS 1 is better as it has higher tpr
predicted
actual 0 1
0 314 26
1 19 241
tpr.0 = 0.923529411764706
tpr.1 = 0.926923076923077
c)
def . : It is the number of misclassification errors committed on training records
resub. = 0.075
- it is bad because we use the train data and we should use the test data to be
good predication performance
d)
Area under the curve = 0.925226
- The elbow of the curve is the best threshold forming TPR and FPR so we use the
curve to define this threshold.
e)
predicted
actual
0 1
76 10
6 58
Accuracy = 0.8933333
Resub. Error = 0.1066667
f)
g)
Features : (price), (Income)
h)
AFTER
predicted
actual 1 2
0 314 26
1 19 241
Accuracy = 0. 8866667
Resub. Error = 0. 1133333
BEFORE
predicted
actual 0 1
0 76 10
1 6 58
Accuracy = 0.8933333
Resub. Error = 0.1066667