Answers PDF

ISHAN CHAWLA (0029031621) chawla7@purdue.
edu
CS573 DATA MINING HW 1
(Q0)
(1) C
(2) B
(a) http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticReg
ression.html
____________________________________________________________________________
(Q1)
Before doing logistic regression as we are using regularisation we want all features to be
penalised proportionately so i normalise each feature value by subtracting the mean of that
feature and dividing by its std deviation and then use them in the regression formulation. Also i
remove ‘time’ field before processing.
(1)
While training we maximise
M = Σall examples tc log y(ɸc) + (1-tc) log (1-y(ɸc))
As the number of examples for the negative class are much more than those of the positive
class, while training it learns parameters which classify the negative classes correctly. So even
some of the positive samples might be incorrectly classified as negative and hence the accuracy
for the negative class is much better (99%) than that of the positive class (42%)
(2)
Two effective solutions are :
1. Resample the data so that you have a comparable population of both classes , or get
more data for the class which has lower amount of data
2. Change the objective function to downweigh each class by its probability, so if a class
occurs more , still the corresponding term will carry a nearly equal weight to that of the
other class, so equal importance would be given to positive and negative samples in
training.
(3)
Increasing or decreasing C , does not largely change the accuracy for the negative class but it
hugely affects the accuracy for the positive class (the one which has fewer samples) .
Bayes theorem says that
P(y | data) ∝ P(data | y) P(y)
The LHS is the posterior which can be thought of as the product of the Likelihood and the Prior.
To maximise P(y | data) there is a tradeoff between the likelihood and the prior.
Now thinking of the P(y) term as a regulariser , it would not matter if the LHS was almost entirely
decided by P(data | y) ie the likelihood. As the first class has a large amount of data , the
likelihood of the data given that class is high and dominates that term. In contrast , the
regulariser controls the MAP estimate of the class with fewer samples.
(4)
To fix this issue , while training we maximise
M = Σall examples 1/pc t c log y(ɸc) + 1/ pc (1-tc) log (1-y(ɸc))
Basically we calculate the Probabilities of the two classes and maximise, in this case
Σall examples with class 1 1/p1 (t

1 log y(ɸ1)) + Σall examples with class 0 1/p
0 (t
0 log y(ɸ0))
p1 = total examples with class 1 / total examples
p0 = total examples with class 0 / total examples
(5)
I just calculated the class weight of each class as the inverse of its probability of occuring.
Specifically i used
-pos_weight = total / pos

-neg_weight = total/ neg
And then used

logreg =
linear_model.LogisticRegression(penalty='l2',C=1e-8,class_weight={1:pos_weight,0:neg_weight
})
New accuracies : 97%, 76%
(6)
The original objective function for SVM is as follows,

We are minimizing this function,
L(w,b,a) = ½ ||w||2 - 𝝨an ( tn(wTɸ(xn) + b) - 1)
Now here if examples of 1 class for example the negative class are more prevalent then the
formulation will learn to find parameters which maximise the second function or always correctly
classify the negative class. We would instead modify this function as follows.
L(w,b,a) = ½ ||w||2 - 𝝨an ( tn / pn (wTɸ(xn) + b) - 1)
Where pn is
the probability of the class of example n which is calculated as Sample of that class
/ Total samples.
____________________________________________________________________________
(Q2)
(1)
𝝐 ~ N(0,𝞴I)
P(y_i | x_i, 𝛃) = ?
As y_i = xiT 𝛃 + 𝝐
When we are given x_i and 𝛃, that is a constant,
Hence we have P(y_i | x_i, 𝛃) = N(xiT 𝛃,𝞴I) as c + N(𝞵,𝞂2) ~ N(𝞵+c,𝞂2)
(2)
If 𝛃 is normal then we can use this knowledge as a prior over 𝛃.

MAP estimate of 𝛃 ∝ argmax𝛃 P({y_i,x_i}1n | 𝛃) P(𝛃)
Assuming y_i,x_i are independent from other y_j,x_j , we can write
MAP estimate of 𝛃 ∝ argmax𝛃 ∏ P(y_i,x_i| 𝛃) P(𝛃)
∝ argmax𝛃 ∏ P(y_i| 𝛃,x_i) P(𝛃) P(x_i)
∝ argmax𝛃 ∏ P(y_i| 𝛃,x_i) P(𝛃) [ Assuming x_i are uniformly sampled]
Now ∏ P(y_i| 𝛃,x_i) = ∏ N(xiT 𝛃,𝞴I) ∝ 𝞴-n exp(-1/2𝞴2 (y-XT 𝛃)T(y-XT 𝛃))
-n
So our MAP estimate ∝ argmax𝛃 𝞴
exp(-1/2𝞴2 (y-XT 𝛃)T(y-XT 𝛃)) * 𝞂-1 exp(-𝛃2 /2𝞂2)
-n -1
∝ argmax𝛃 𝞴 𝞂 exp(-𝛃2 /2𝞂2 -1/2 𝞴2 (y
-XT 𝛃)T(y-XT 𝛃))
____________________________________________________________________________
(Q3)
At the decision boundary the Euclidean distance will be same from 𝞵+ and 𝞵- .
y = || x - 𝞵+ ||2 - || x - 𝞵- ||2 = 0 at the decision boundary
Y = (||𝞵+||2 - ||𝞵-||2 ) + 2(𝞵- - 𝞵+) X
So we have
wT = 2(𝞵- - 𝞵+) => w = 2(𝞵- - 𝞵+)T

b = (||𝞵+||2 - ||𝞵-||2 )
____________________________________________________________________________

Answers PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Answers PDF

Uploaded by

Copyright:

Available Formats

ISHAN CHAWLA (0029031621) ​chawla7@purdue.

CS573 DATA MINING HW 1

While training we maximise

M = Σ​all examples​ t​c​ log y(ɸ​c​) + (1-t​c​) log (1-y(ɸ​c​))

Two effective solutions are :

Bayes theorem says that

P(y | data) ∝ P(data | y) P(y)

To fix this issue , while training we maximise

Σ​all examples with class 1​ 1/p​1 (t​

p​1 ​ = total examples with class 1 / total examples

p​0 ​ = total examples with class 0 / total examples

-pos_weight = total / pos

And then used

New accuracies : 97%, 76%

The original objective function for SVM is as follows,

L(​w​,b,​a​) = ½ ||​w​||​2​ - 𝝨a​n​ ( t​n​(w​T​ɸ(x​n​) + b) - 1)

L(​w​,b,​a​) = ½ ||​w​||​2​ - 𝝨a​n​ ( t​n​ / p​n​ (w​T​ɸ(x​n​) + b) - 1)

When we are given x_i and 𝛃, that is a constant,

Hence we have P(y_i | x_i, 𝛃) = N(x​i​T​ 𝛃,𝞴I) ​as c + N(𝞵,𝞂​2​) ~ N(𝞵+c,𝞂​2​)

If 𝛃 is normal then we can use this knowledge as a prior over 𝛃.

Assuming y_i,x_i are independent from other y_j,x_j , we can write

MAP estimate of 𝛃 ∝ argmax​𝛃​ ∏ P(y_i,x_i| 𝛃) P(𝛃)

∝ argmax​𝛃​ ∏ P(y_i| 𝛃,x_i) P(𝛃) P(x_i)

∝ argmax​𝛃​ ∏ P(y_i| 𝛃,x_i) P(𝛃) [ Assuming x_i are uniformly sampled]

y = || x - 𝞵​+ ​||​2​ - || x - 𝞵​- ​||​2​ = 0 at the decision boundary

Y = (||𝞵​+​||​2​ - ||𝞵​-​||​2 ​)​ ​+ 2(𝞵​-​ - 𝞵​+​) X

w​T​ = 2(𝞵​-​ - 𝞵​+​) => w = 2(𝞵​-​ - 𝞵​+​)​T

You might also like

ISHAN CHAWLA (0029031621) chawla7@purdue.

M = Σall examples tc log y(ɸc) + (1-tc) log (1-y(ɸc))

Σall examples with class 1 1/p1 (t

p1 = total examples with class 1 / total examples

p0 = total examples with class 0 / total examples

L(w,b,a) = ½ ||w||2 - 𝝨an ( tn(wTɸ(xn) + b) - 1)

L(w,b,a) = ½ ||w||2 - 𝝨an ( tn / pn (wTɸ(xn) + b) - 1)

Hence we have P(y_i | x_i, 𝛃) = N(xiT 𝛃,𝞴I) as c + N(𝞵,𝞂2) ~ N(𝞵+c,𝞂2)

MAP estimate of 𝛃 ∝ argmax𝛃 ∏ P(y_i,x_i| 𝛃) P(𝛃)

∝ argmax𝛃 ∏ P(y_i| 𝛃,x_i) P(𝛃) P(x_i)

∝ argmax𝛃 ∏ P(y_i| 𝛃,x_i) P(𝛃) [ Assuming x_i are uniformly sampled]

y = || x - 𝞵+ ||2 - || x - 𝞵- ||2 = 0 at the decision boundary

Y = (||𝞵+||2 - ||𝞵-||2 ) + 2(𝞵- - 𝞵+) X

wT = 2(𝞵- - 𝞵+) => w = 2(𝞵- - 𝞵+)T