Big Data Assignment #1: Submitted To/ Eng. Eman Hossam

BIG DATA
ASSIGNMENT #1
Submitted to/ Eng. Eman Hossam
Team: #19
Submitted By:
Ahmed Mohamed Abd El-Hameed
Karem Mohamed El-Farra
KMEANS
d)
e) What can change each time you cluster the data? WHY?
ANSWER: the positions of the centers changed each time because the
initial centers is set in random way and that affect the result of the
clustering.
How do you prevent these changes from occurring?
ANSWER: By assign specific initial centers or by using set.seed(145)
f)
From the plot:
Best K = 5
Reason: Because it is the elbow value
g) How has the clustering changed? Why?

ANSWER: The distance between clusters decreassed and the clusters changed
Reason: Because the values changed and became small after apply the log function
so the cluster changed
h)
NO,
Reason: Because it is the elbow value
From the plot:
Best K = 5
i) Yes,there is outlier
K=4
END OF KMEANS
Assocation RULES
b)
# item13 is the most frequent item
# largest transaction has 25 item.
c)
i - No. of rules for (sup = 0.01 and conf = 0.00) = 11524

ii - No. of rules for conf 50% = 1165
iii - when the confidence increses the number of rules that achieve
that confidence decreases.
d)
i the interesting rules is the rules with more shade color darkest (high
lift value) and it is positon is on the top left of the plot.
e)
i - Observ. : No rules with high support and lift at the same time
ii - The rules on the top left of the plot is more interesting as there have
high lift darker colors rules
iii - These rules are located at: the positions with high support and low
lift
Lhs Rhs Support
Confidence
Lift Order
{}
{item5}
0.3699
0.3699
{}
{item13}
0.4948
0.4948
The expected observation of these itemsets is: the customer will buy
these items if he didnt buy any thing as the Lhs is empty{}.
The actual observation of the itemsets is: no thing , these rules doesnt
have any meaning so we it cant be a real rules.
iv Rule number
Lhs
Rhs
SupportConfidence
Lift
Order
3034
{item10}
{item13}
0.1492 0.4915980
0.9935287
3037
{item5}
{item30}
0.1276 0.3449581
1.0427996
3011
{item13}
{item92}
0.1290 0.2607114
0.9912981
f) i Rule number
Lhs
Rhs
SupportConfidence
Lift
Order
10348 {item30,item95,item96} => {item13} 0.0118 0.8027211 1.622314

3051 {item23,item5}
92 {item83}
=> {item13} 0.0105 0.8400000 1.697656

=> {item13} 0.0119 0.8439716 1.705682
3042 {item10,item44}
=> {item13} 0.0101 0.8487395 1.715318
=> {item13} 0.0154 0.8555556 1.729094
162 {item23}
=> {item13} 0.0292 0.8613569 1.740818

=> {item13} 0.0114 0.9120000 1.843169
ii NO, because these rules have lift greater than 1
g)
i
{item15,item30,item49} -> {item56} (lift: 16.58456)
{item15,item49} -> {item56} (lift: 14.88358)
{item49,item56} -> {item15} (lift: 9.153028)
{item30,item49,item56} -> {item15} (lift: 9.240199)
Rules with high lift and high confidence

so when we limit the confidence these rules still vaild
ii Darker blue rules : is the rules with highest confidence and lowest lift
h)
ilhs
rhs
support confidence lift

In step3 we have rules with low confidence but here we have only rules with
high confidence and there is some rules with high lift and low confidence so it
is vary from the rules in step 3.
ii-
iii Obsrv. : low support rules has high confidence

support and order have inverse relationship
3. Nave Bayes
a)
A-priori probabilities:
10-50K
50-80K
GT 80K
0.80266371 0.12563818
0.07169811
Conditional probabilities:
age
Y
20-30
31-45
GT 45
10-50K 0.20796460 0.34457965 0.44745575

50-80K 0.08303887 0.39752650 0.51943463
GT 80K 0.06811146 0.34055728 0.59133127
sex
Y
10-50K 0.4798119 0.5201881

50-80K 0.2871025 0.7128975
GT 80K 0.2058824 0.7941176
educ
Y
College
Others Prof/Phd
10-50K 0.24585177 0.73976770 0.01438053

50-80K 0.49558304 0.44257951 0.06183746
GT 80K 0.53869969 0.29566563 0.16563467
b)
predicted
actual
10-50K 50-80K GT 80K
10-50K
787
50-80K
127
GT 80K
67
Accuracy = 0.795
Overall misclassification = 0. 205
10-50K misclassification rate = 0.007566204
50-80K misclassification rate = 1
GT 80K misclassification rate = 0.8933333
-
Explain the variation in the models predictive power across income classes
[10-50K]: has the highest priori prob. (0.8) so it is has the less miss-rate
[50-80K] and [GT 80K] has priori probs. with nearby values (0.12),(0.7)
but [50-80K] doesn't have highest condition probability in any
variable(sex,age,educ)
so the model will not classify any entry to to class [50-80K] so it has miss rate = 1
4. Nave Bayes
a)
actual
106 321
97 476
Accuracy = 0.582
Overall misclassification = 0. 418
Male misclassification rate = 0. 1692845

Female misclassification rate = 0. 7517564
Explain the variation in the models predictive power across income classes
As the male have priori probability higher than female it has less misclassification.
b)
A-priori probabilities:
Y
F
M
0.5 0.5
Conditional probabilities:
age
Y
20-30
31-45
GT 45
F 0.1865714 0.3411429 0.4722857
M 0.1891429 0.3454286 0.4654286
Y
educ
College
Others Prof/Phd
F 0.3228571 0.6531429 0.0240000
M 0.2848571 0.6791429 0.0360000
income
10-50K
50-80K
GT 80K
F 0.88171429 0.08314286 0.03514286
M 0.74885714 0.15628571 0.09485714
c)
Accuracy = 0.53
Misclassification rate = 0.47
Male Misclassification rate = 0.1358314
Female Misclassification rate = 0.7190227
- Female class is predicted better than male
- I think that is because the both has the same priori prob. And female has
better cond. Probability.
d) It has no effect on accuracy and confusion matrix
e) The classification of specific class become better when its priori prob.
become higher.
Decision tree
a)
- FEATURES: (Age) (Income) (Price)
b)
CLASS 1 is better as it has higher tpr
predicted
actual 0 1
0 314 26
1 19 241
tpr.0 = 0.923529411764706
tpr.1 = 0.926923076923077
c)
def . : It is the number of misclassification errors committed on training records
resub. = 0.075
- it is bad because we use the train data and we should use the test data to be
good predication performance
d)
Area under the curve = 0.925226
- The elbow of the curve is the best threshold forming TPR and FPR so we use the
curve to define this threshold.
e)
predicted
actual
0 1
76 10
6 58
Accuracy = 0.8933333
Resub. Error = 0.1066667
f)
no difference except in xerror and xstd values
g)
Features : (price), (Income)
- Why certain variables not used ?

Because these variables were in levels that pruned.
And these levels has small complexity parameter so they were pruned.
h)
AFTER
predicted
actual 1 2
0 314 26
1 19 241
Accuracy = 0. 8866667
Resub. Error = 0. 1133333
BEFORE
predicted
actual 0 1
0 76 10
1 6 58
Accuracy = 0.8933333
Resub. Error = 0.1066667
Predication of class 0 increases

Predication of class 0 decreases

Big Data Assignment #1: Submitted To/ Eng. Eman Hossam

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Assignment #1: Submitted To/ Eng. Eman Hossam

Uploaded by

Copyright:

Available Formats

BIG DATA

From the plot:

Reason: Because it is the elbow value

g) How has the clustering changed? Why?

i - No. of rules for (sup = 0.01 and conf = 0.00) = 11524

10348 {item30,item95,item96} => {item13} 0.0118 0.8027211 1.622314

=> {item13} 0.0105 0.8400000 1.697656

=> {item13} 0.0101 0.8487395 1.715318

=> {item13} 0.0154 0.8555556 1.729094

=> {item13} 0.0292 0.8613569 1.740818

10444 {item3,item84,item95} => {item13} 0.0108 0.8780488 1.774553

=> {item13} 0.0114 0.9120000 1.843169

10328 {item16,item25,item77} => {item5} 0.0104 0.8062016 2.179512

ii NO, because these rules have lift greater than 1

Rules with high lift and high confidence

support confidence lift

10251 {item15,item30,item56} => {item49} 0.0101 0.7709924 19.42046

iii Obsrv. : low support rules has high confidence

10-50K 0.20796460 0.34457965 0.44745575

10-50K 0.4798119 0.5201881

10-50K 0.24585177 0.73976770 0.01438053

10-50K 50-80K GT 80K

Male misclassification rate = 0. 1692845

- FEATURES: (Age) (Income) (Price)

no difference except in xerror and xstd values

- Why certain variables not used ?

Predication of class 0 increases

You might also like