You are on page 1of 16

BIG DATA

ASSIGNMENT #1
Submitted to/ Eng. Eman Hossam

Team: #19
Submitted By:
Ahmed Mohamed Abd El-Hameed
Karem Mohamed El-Farra

KMEANS
d)

e) What can change each time you cluster the data? WHY?
ANSWER: the positions of the centers changed each time because the
initial centers is set in random way and that affect the result of the
clustering.
How do you prevent these changes from occurring?
ANSWER: By assign specific initial centers or by using set.seed(145)

f)

From the plot:

Best K = 5

Reason: Because it is the elbow value

g) How has the clustering changed? Why?


ANSWER: The distance between clusters decreassed and the clusters changed
Reason: Because the values changed and became small after apply the log function
so the cluster changed

h)

NO,
Reason: Because it is the elbow value
From the plot:

Best K = 5

i) Yes,there is outlier
K=4

END OF KMEANS

Assocation RULES
b)
# item13 is the most frequent item
# largest transaction has 25 item.

c)

i - No. of rules for (sup = 0.01 and conf = 0.00) = 11524


ii - No. of rules for conf 50% = 1165

iii - when the confidence increses the number of rules that achieve
that confidence decreases.

d)
i the interesting rules is the rules with more shade color darkest (high
lift value) and it is positon is on the top left of the plot.

e)
i - Observ. : No rules with high support and lift at the same time

ii - The rules on the top left of the plot is more interesting as there have
high lift darker colors rules

iii - These rules are located at: the positions with high support and low
lift
Lhs Rhs Support

Confidence

Lift Order

{}

{item5}

0.3699

0.3699

{}

{item13}

0.4948

0.4948

The expected observation of these itemsets is: the customer will buy
these items if he didnt buy any thing as the Lhs is empty{}.
The actual observation of the itemsets is: no thing , these rules doesnt
have any meaning so we it cant be a real rules.
iv Rule number

Lhs

Rhs

SupportConfidence

Lift

Order

3034

{item10}

{item13}

0.1492 0.4915980

0.9935287

3037

{item5}

{item30}

0.1276 0.3449581

1.0427996

3011

{item13}

{item92}

0.1290 0.2607114

0.9912981

f) i Rule number

Lhs

Rhs

SupportConfidence

Lift

Order

10348 {item30,item95,item96} => {item13} 0.0118 0.8027211 1.622314


3051 {item23,item5}
92 {item83}

=> {item13} 0.0105 0.8400000 1.697656


=> {item13} 0.0119 0.8439716 1.705682

3042 {item10,item44}

=> {item13} 0.0101 0.8487395 1.715318

3111 {item82,item99}

=> {item13} 0.0154 0.8555556 1.729094

162 {item23}

=> {item13} 0.0292 0.8613569 1.740818

10444 {item3,item84,item95} => {item13} 0.0108 0.8780488 1.774553


10264 {item5,item82,item99} => {item13} 0.0134 0.8933333 1.805443
3048 {item20,item23}

=> {item13} 0.0114 0.9120000 1.843169

10328 {item16,item25,item77} => {item5} 0.0104 0.8062016 2.179512

ii NO, because these rules have lift greater than 1

g)
i
{item15,item30,item49} -> {item56} (lift: 16.58456)
{item15,item49} -> {item56} (lift: 14.88358)
{item49,item56} -> {item15} (lift: 9.153028)
{item30,item49,item56} -> {item15} (lift: 9.240199)

Rules with high lift and high confidence


so when we limit the confidence these rules still vaild

ii Darker blue rules : is the rules with highest confidence and lowest lift

h)
ilhs

rhs

support confidence lift

10251 {item15,item30,item56} => {item49} 0.0101 0.7709924 19.42046


10255 {item30,item56,item84} => {item49} 0.0100 0.7407407 18.65846
10250 {item15,item30,item49} => {item56} 0.0101 0.9619048 16.58456
In step3 we have rules with low confidence but here we have only rules with
high confidence and there is some rules with high lift and low confidence so it
is vary from the rules in step 3.
ii-

iii Obsrv. : low support rules has high confidence


support and order have inverse relationship

3. Nave Bayes
a)
A-priori probabilities:
10-50K

50-80K

GT 80K

0.80266371 0.12563818

0.07169811

Conditional probabilities:
age
Y

20-30

31-45

GT 45

10-50K 0.20796460 0.34457965 0.44745575


50-80K 0.08303887 0.39752650 0.51943463
GT 80K 0.06811146 0.34055728 0.59133127
sex
Y

10-50K 0.4798119 0.5201881


50-80K 0.2871025 0.7128975
GT 80K 0.2058824 0.7941176
educ
Y

College

Others Prof/Phd

10-50K 0.24585177 0.73976770 0.01438053


50-80K 0.49558304 0.44257951 0.06183746
GT 80K 0.53869969 0.29566563 0.16563467

b)

predicted
actual

10-50K 50-80K GT 80K

10-50K

787

50-80K

127

GT 80K

67

Accuracy = 0.795
Overall misclassification = 0. 205
10-50K misclassification rate = 0.007566204
50-80K misclassification rate = 1
GT 80K misclassification rate = 0.8933333
-

Explain the variation in the models predictive power across income classes

[10-50K]: has the highest priori prob. (0.8) so it is has the less miss-rate
[50-80K] and [GT 80K] has priori probs. with nearby values (0.12),(0.7)
but [50-80K] doesn't have highest condition probability in any
variable(sex,age,educ)
so the model will not classify any entry to to class [50-80K] so it has miss rate = 1

4. Nave Bayes
a)
actual

106 321

97 476

Accuracy = 0.582
Overall misclassification = 0. 418

Male misclassification rate = 0. 1692845


Female misclassification rate = 0. 7517564
Explain the variation in the models predictive power across income classes

As the male have priori probability higher than female it has less misclassification.

b)
A-priori probabilities:
Y
F
M
0.5 0.5
Conditional probabilities:
age
Y
20-30
31-45
GT 45
F 0.1865714 0.3411429 0.4722857
M 0.1891429 0.3454286 0.4654286
Y

educ
College
Others Prof/Phd
F 0.3228571 0.6531429 0.0240000
M 0.2848571 0.6791429 0.0360000
income
10-50K
50-80K
GT 80K
F 0.88171429 0.08314286 0.03514286
M 0.74885714 0.15628571 0.09485714

c)
Accuracy = 0.53
Misclassification rate = 0.47
Male Misclassification rate = 0.1358314
Female Misclassification rate = 0.7190227
- Female class is predicted better than male
- I think that is because the both has the same priori prob. And female has
better cond. Probability.
d) It has no effect on accuracy and confusion matrix

e) The classification of specific class become better when its priori prob.
become higher.

Decision tree
a)

- FEATURES: (Age) (Income) (Price)

b)
CLASS 1 is better as it has higher tpr
predicted
actual 0 1
0 314 26
1 19 241
tpr.0 = 0.923529411764706
tpr.1 = 0.926923076923077

c)
def . : It is the number of misclassification errors committed on training records
resub. = 0.075
- it is bad because we use the train data and we should use the test data to be
good predication performance

d)
Area under the curve = 0.925226
- The elbow of the curve is the best threshold forming TPR and FPR so we use the
curve to define this threshold.
e)
predicted
actual

0 1

76 10

6 58

Accuracy = 0.8933333
Resub. Error = 0.1066667

f)

no difference except in xerror and xstd values

g)
Features : (price), (Income)

- Why certain variables not used ?


Because these variables were in levels that pruned.
And these levels has small complexity parameter so they were pruned.

h)
AFTER
predicted
actual 1 2
0 314 26
1 19 241
Accuracy = 0. 8866667
Resub. Error = 0. 1133333

BEFORE
predicted
actual 0 1
0 76 10
1 6 58
Accuracy = 0.8933333
Resub. Error = 0.1066667

Predication of class 0 increases


Predication of class 0 decreases

You might also like