You are on page 1of 8

IV.

Statistical Methods
Statistics is the science oI collecting and organizing data and drawing conclusions Irom
data sets. The organization and description oI the general characteristics oI data sets is the
subject area oI descriptive statistics. How to draw conclusions Irom data is the subject oI
statistical inference. In this chapter, the emphasis is on basic principles oI statistical
inIerence; other related topics will be described brieItly, only enough to understand the basic
concepts.
hean dan hedIan
0ua ukuran central tendency adalah medIan dan mean, dIIlustrasIkan oleh contoh berIkut. |Isal kIta punya 5
sampel data darI berat badan seorang wanIta. 0Idapatkan ukuran 50 kg, 50 kg, 65 kg, 70 kg dan 75 kg.
Untuk mendapatkan medIan, kIta urutkan terlebIh dahulu data tersebut darI kecIl ke besar. JIka jumlah
data adalah ganjIl, maka medIan adalah nIlaI data tengah. JIka jumlah data bIlangan genap, maka
medIan adalah ratarata nIlaI dua nIlaI tengah. Untuk sample data dIatas, nIlaI medIannya adalah 65 kg.
mean darI suatu sample atau populasI dIhItung dengan menambahkan semua data kemudIan membagInya
dengan jumlah data.Untuk contoh sample dIatas, nIlaI meannya = (50+ 50 + 65 + 70 + 75)/5 = J10/5 = 62
kg. Persamaan untuk menghItung nIlaI mean :
|ean darI PopulasI = = ZX / N atau |ean darI Sampel = x = Zx / n
0Imana ZX adalah jumlah total semua data dalam populasI, N adalah jumlah data pada populasI, Zx
adalah jumlah total semua data pada sample, dan n adalah jumlah data pada pada sampel.
|ean darI populasI adalah , sedangkan mean darI sample dIgunakan symbol x.
Dalam populasi, variance dinyatakan dengan persamaan sebagai berikut :
o
2
L ( X
i
- )
2
/ N
dimana o
2
adalah variance dari populasi, adalah mean dari populasi, X
i
adalah
elemen kei dari populasi, dan N adalah jumlah elemen dalam populasi.
Sedangkan nilai variance dari sample, dideIinisikan sebagai berikut :
s
2
L ( x
i
- x )
2
/ ( n - 1 )
dimana s
2
adalah variance dari sample , x adalah mean dari sample, x
i
adalah
elemen ke- i dari sample, n adalah jumlah elemen dalam sample.
Standart deviasi adalah akar dari variance.
8oxplots
boxpIot, adalah tIpe data yang dIgunakan untuk menampIlkan pola darI data kuantItatIf.
oxpIot
8oxplot membagI hImpunan data ke dalam quartIle. 8oxplot terdIrI darI kotak atau box darI quartIle pertama
(Q1) sampaI quartIle ketIga (QJ).
JIka dalam data ada satu atau lebIh outlIer, outlIer tersebut dIplot terpIsah. Pada boxplot dIatas, terdapat dua
outlIer sebelum Q1 dan tIga outlIer setelah QJ. |edIan darI boxplot dIatas adalah 400.
agaImana cara mengInterprestasIkan sebuah oxpIot
8oxplot menampIlkan dua ukuran keragaman (varIabIlIty) atau sebaran dalam data yaItu range dan InterquartIle
range (QF).
Fange. JIka sebaran pada data perlu dItampIlkan, maka boxplot dapat menampIlkan range darI nIlaI
terendah sampaI tertInggI, termasuk outlIer yang ada. SebagaI contoh pada boxplot dIatas, data memIlIkI
range darI 700 (outlIer palIng kecIl) sampaI 1700 (outlIer palIng besar), sehIngga range = 2400. JIka nIlaI
outlIer dIabaIkan, maka nIlaI range pada boxplot dIatas adalah 1000.
nterquartIle range (QF). nterquartIle range adalah 50 darI data. 0alam boxplot, InterquartIle range
dIrepresentasIkan oleh lebar kotak ((QJ Q1). SehIngga pada gambar dIatas, nIlaI darI InterquartIle range
= 600 J00 = J00.
8oxplot memberIkan InformasI tentang pola dIstrIbusI darI data dalam hImpunan data. SepertI pada contoh
berIkut InI :
UjI Pemahaman
SoaI 1
PerhatIkan boxplot dIbawah InI.
Yang mana darI pernyataan berIkut yang benar :
. 0IstrIbusInya skewed rIght.
. NIlaI InterquartIle range sekItar 8.
. NIlaI |edIan sekItar 10.
(A) saja
(8) saja
(C) saja
(0) dan
(E) and
Jawaban
Jawaban yang benar adalah (8). SebagIan besar data berada pada bagIan sebelah kanan, karena Itu dIstrIbusInya
skewed left. NIlaI InterquartIle range dItunjukkan oleh panjang darI kotak, yaItu 18 10 = 8. 0an nIlaI medIan
dItunjukkan oleh garIs vertIcal pada bagIan tengah kotak, yaItu sekItar nIlaI 15.
4.1. Nave Bayes Classifier
The Bayes Theorem represents a theoretical background Ior a statistical approach to
inductive-inIerencing classiIication problems. We wil Iirst explain the basic concepts
deIined in the Bayes Theorem and then use this theorem in the explanation oI the Nave
Bayesian ClassiIication Process,or the Simple Bayesian ClassiIier.
Let X be a data sample whose class label is unknown. Let H be some hypothesis; such
that the data sample X belongs to a speciIic class C. We want to determine P(H/X), the
probability that the hypothesis H holds given the observed dta sample X. P(H/X) is the
posterior probability representing our conIidence in the hypothesis aIter X is given. In
contrast, P(H) is the prior probability oI H Ior any sample, regardless oI how the data in the
sample looks. The posterior probability P(H/X) is based on more inIormation then the prior
probability P(H). The Bayesian Theorem provides a way oI calculating the posterior
probability P(H/X) using probabilities P(H), P(X), and P(X/H). The basic relation is :
P(H/X) |P(X/H).P(H)|P(X)
Suppose now that there are a set oI m samples S S1,S2,.,SM} (the training data set
where every sample Si is represented asan n-dimensional vector x1,x2,.,xn}. Values
xicorrespoindto attribute A1,A2, . An, respectively Also, there are k classes C1,C2, ., Ck,
and every sample belongs to one oI the classes. Given an additional data sample X (its class
is unknown), it is possible to predict the class Ior X using the highest conditional probability
P(Ci/X), where i1,2, ., k. That is the basic idea oI Nave Bayesian ClassiIier These
probabilities are computed using Bayes Theorem :
P(Ci/X) |P(X/Ci).P(Ci)|/P(X)
As P(X) is constant Ior all classes, only the product P(X/Ci).P(Ci) needs to be
maximized. We compute the prior probabilities oI the classas :
P(Ci) number oI training samples oI class Ci/m (m is total numberoI training samples).
Because the computation oI P(X/Ci) is extremely complex, especially Ior large data sets,
the nave assumption oI conditional independencebetween attributes is made. Using this
assumption, we can express P(X/Ci) as a product :
P(X/Ci)
Where x
t
are values Ior attributes in the sample X. The probabilities P(x
t
/Ci) can be
estimated Irom the training data set.
Table 5.1. Training data set Ior a classiIication using Nave Bayesian ClassiIier

Sample Attribute1 Attribute2 Attribute3 Class


A1 A2 A3 C
1 1 2 1 1
2 0 0 1 1
3 2 1 2 2
4 1 2 1 2
5 0 1 2 1
6 2 2 2 2
7 1 0 1 1

Given a training data set oI seven Iour-dimensional samples (table 5.1.), it is necessary to
predict classiIication oI the new sample X1,2,2,class?}. Ior each sample, A1,A2, and A3
are input dimensions and C is the output classiIication.
In our example, weneed to maximize the product P(X/Ci).P(Ci) Ior i1,2 because there
are only two classes. First, we compute prior probabilities P(Ci) oI the class :
P(C1)4/7 0.5714
P(C2)3/7 0.4286
Second, we compute conditional probabilities P(xt/Ci) Ior every attribute value given in
the new sample X1,2,2,C?}, (or more precisely, XA11, A22, A32, C?}) using
training data sets :
P(A11/C1) 2/4 0.5
P(A11/C2) 1/3 0.33
P(A22/C1) 1/4 0.25
P(A22/C2) 2/3 0.66
P(A32/C1) 1/4 0.25
P(A32/C2) 2/3 0.66
Under assumption oI conditional independence oI attributes, the conditional probabilities
P(X/Ci) will be :
P(X/C1) P(A11/C1) . P(A22/C1) . P(A33/C1) 0.5 . 0.25 . 0.25 0.03125
P(X/C2) P(A11/C2) . P(A22/C2) . P(A33/C2) 0.33 . 0.66 . 0.66 0.14375
Finally, multiphying these conditional probabilities with corresponding riori probabilities,
we can obtain values proportional to P(Ci/X) and Iind their maximum :
P(C1/X) P(X/C1) . P(C1) 0.03125 . 0.5714 0.0179
P(C2/X) P(X/C2) . P(C2) 0.14375 . 0.4286 0.0616
Max P(C1/X), P(C2/X) } Max 0.0179, 0.0616 } 0.0616
Based on the previous two values that are the Iinal results oI the Nave Bayes ClassiIier,
we can predict that the new sample X belongs to the class C2.
4.2. Predictive Regression
The prediction oI continuous values can be modeled by a statistical technique called
regression. The objective oI regression analysis is to determine the best model that can relate
the output variable to various input variables.
Linear regression with one input variable is the simplest Iorm oI regression. Itmodels a
random variable Y (called a response variable) as a linear Iunction oI another random
variable X (called a predictor variable). Given n samplesor data points oI the Iorm (x1,y1),
(x2,y2), ., (xn,yn), where xi X and yi Y, linear regression can be expressed as :
Y u .X
Using standard relations Ior the mean values, regression coeIIicients Ior this simple case
oI optimization are :
/
u - .
For example, iI the sample data set isgiven in the Iorm oI a table, ad we are analyzing the
linear regression between two variables (predictor variable A and response variable B), then
the linear regression can be expressed as :
B u .A
Where u and coeIIicients can be calculated based on previous Iormulas (using mean
A

5 and mean
B
6), and they have the value :
u 0.8
1.04
the optimal regression line is :
B 0.8 1.04.A
Additional preprocessing steps may estimate the quality oI the linear-regression model.
Correlation analysis attempts to measure the strength oI a relationship between two variables
(in our case this relationship is expressed through the linear regression equation). One
parameter which shows this strength oI linear association between two variablesby means oI
a single number, is called a correlation coeIIicient r.
r .
where :
S
xx

S
yy

S
xy

The value oI r is between -1 and 1. Negative values Ior r correspond to regression lines
with negative slopes and a positive r shows a positive slope. We must very careIul in
interpreting the r value. For example, values oI r equal to 0.3 and 0.6 only mean that we
have two positive correlations, the second somewhat stronger than the Iirst. It is wrong to
conclude that r0.6 indicates a linear relationship twice as strong as that indicated by the
value r0.3
For our example oI linear regression given at the beginning oI this section, the model
obtained was B 0.8 1.04.A. We may estimate the quality oI the model using the
correlation coeIIicient r as a measure. Based on the available data in Figure 4.3, we obtained
intermediate results :
S
AA
62
S
BB
60
S
AB
52
And the Iinal correlation coeIIicient :
r 52 / 0.85
A correlation coeIIicient r 0.85 indicates a good linear relationship between two
variables. Additional interpretation is possible. Because r
2
0.72, we can say that
approximately 72 oI the variations in the values oI B is accounted Ior by a linear
relationship with A.
4.3. Exercise
1. A data set Ior analysis includes only one attribute X :
X7,12,5,18,5,9,13,12,19,7,12,12,13,3,4,5,13,8,6,6}
a) What is the mean oI the data set X ?
b) What is the median ?
c) What is the mode oI the data set X ?
d) Find the standard deviation Ior X
e) Give a graphical summarization oI the data set X using boxplot representation
I) Find outliers in the data set X. Discuss the results.
2. For the training set given in the previous table, predict the classiIication oI the Iollowing
samples using Simple Bayesian ClassiIier
a. 2,1,1}
b. 0,1,1}
3. Given a data set with two dimension X and Y :

X Y

1 5
4 2.75
3 3
5 2.5

a. Use a linear regression method to calculate the parameters u and where y u


x.
b. Estimate the quality oI the model obtained in a) using the correlation coeIIicient r.

You might also like