You are on page 1of 5

Commonly used terms in Statistics and choosing statistical

analysis
A. Dhandapani
The difficulty faced by many in understanding statistics is mostly due to
misunderstanding/not able to understand the terms used in statistics. For
example, significant in English implies sufficiently great or important to be
worth of noting. Statistical significant on the other hand tells us the
likelihood of obtaining the results merely through chance. This hand-out
provides brief explanation of most commonly used terms in statistics.
The origin of the word statistics implies that it is the study of human
population within a political union (State). However, Statistics can be
regarded (i) as Study of Populations; (ii) as study of variations and (iii) as
methods of reduction of data (Fisher, 1954 Statistical Methods for
Research Workers). When we collect information on assets owned from
100, 000 households in a state, we are not interested in individual
households. We are interested in the assets owned by the households. So,
we consider them as the population of assets rather than the population
of households. Thus, in statistics, we study the Population or aggregates
of individuals, either living or materials/things. The term Population in
statistics is not restricted only to living or materials, but even any
measurements measured indefinitely, can be regarded as a population.
The number of units in the population is called Population Size. The
population size may be finite, such as number of households or infinite
such as number of vehicles arriving at a particular traffic signal every
minute. Even when the population size is finite, it may be considered as
infinite, such as number of fishes in a lake. Also, when the population is
finite, another aspect worth noting is whether the individuals can be
identified.
When there is no variability in the units of the population, there is no need
to study the entire population, instead one can study only one individual
from the population. However, the variability is everywhere and we need
to study these variations. The measurement taken on individuals of the
population is called as Variate or as Variable. The possible values taken
by a variable can be binary such as gender of the child born (Male/Female)
or discrete such as number of children in a household (0,1,2,...) or
continuous such as height of the children.
While studying a population, we are usually not interested in individuals in
the population, particularly when there is variability and the size of the
population is high or infinite. The interest is to describe or identify the
relationships between the population units etc. The population is

described through some constants. These constants are called


Population Parameters. Some of the population parameters are widely
known, such as Arithmetic Mean (or simply mean), Median etc. Note that,
in general, the population parameters would be unknown.
In order to find the unknown population parameters of variable(s), the
values of individual units of the population must be collected. In case, if
the population is large/infinite, clearly it is not possible to find the values
for all the units in the population. In order to find the values of the
population
parameters,
we
resort
to
selecting
few
individuals/measurements of the population and observe the values. To
apply various statistical tools, it is necessary that the subset of the
population selected/observed must be at random and it is called Sample.
Thus, a sample is a subset of population units on which measurements will
be collected. There are various ways by which one can select a sample
from a population. Here, if the population units are not identifiable (or
very large/infinite), we can select using Simple Random Sampling or
simply called random sampling. In simple random sampling, the
probability of selecting a unit from the population is same for all units. The
number of units selected for the observation in a sample is called Sample
Size. While selecting a sample, particularly when the number of units in
the population is finite and identifiable, it is possible to employ various
procedures in which the probability of selecting a unit may be unequal.
Also, in finite population sampling, two methods of sampling are available,
namely, with replacement (a single unit may be selected more than
once) and without replacement (a single unit may be selected at most
once only).
The values collected/measured from the sampled units contain the
information about the population. These values, commonly referred as
Data will be used to estimate the population parameters. In case of
infinite populations, the variations observed in the data can be modelled
using known form of probability distributions. Some of the well known
probability distributions are Binomial, Poisson, Normal, Gamma,
Exponential etc. The functional form these distributions depend on one or
more constants, which are called Parameters. Many of these parameters
are again related to Mean, Variance etc. For example, Normal distribution
depends on two parameters which are nothing but its mean and variance.
Population parameters are estimated using some function of sample
values or data. These functions are called Estimators. Various estimators
are possible but estimators with some properties such as Unbiased
(roughly indicate the estimator takes true population parameter value on
an average), Consistent (the estimator value tends to true value when

the sample size is increased to infinity), Efficient (estimator is having less


variance than other possible estimators) etc.
There are various statistical tools developed over the years by
statisticians to find out the solutions to the different types of data. Initially
many of these tools were based on three assumptions, namely (i)
Independent (ii) Identical and (iii) Normality. The assumption of
independent observations is satisfied, when the units are selected
randomly and selection of a unit does not depend on other units. Some of
the methods which violate these assumptions are (i) when the sampling is
done without replacement (Selection of second and subsequent units
depend on units selected already) (ii) Measurements taken on same
individual/households/unit of the population (usually over time). The
second assumption implies that the random process is same, i.e. all units
in the population follow same distribution. Again this assumption may not
hold good, particularly, the variance may not be same for all the units.
The first two assumptions are usually called i.i.d (identically and
independently distributed). The third assumption is that the data
generated is from a normally distributed. The normality assumptions may
be justified in many situations, but sometime, the assumption may not
hold good.
It is imperative for anyone, who wishes to apply any statistical tool, to
know the assumptions under which the tool has been developed and the
consequences when the assumptions are not met. There are tools which
are now available in which one can see whether the assumptions can be
reasonably justified with the data and also newer methods in which some
of these assumptions can be relaxed.
Choosing Statistical Analysis
The following table provide information on how to choose appropriate
statistical analysis and SAS procedures/R functions to use. Note that many
statistical analyses would require several assumptions to be satisfied. One
should look for the assumptions of each analysis and validate those
assumptions before using the appropriate analysis. This material has been
adopted

from

these

web

http://bama.ua.edu/~jleeper/627/choosestat.html
http://www.ats.ucla.edu/stat/mult_pkg/whatstat/

pages:
&

Populatio
n & No. of
variables

One
Population,
one
variable

One
Population,
Two levels
(independe
nt)
One
Population,
two groups
(dependent
or
matched)

One
Population,
more than
2 groups

One
Population,
more than
2
groups
(dependent
/ matched)

Type of Variables

Analysis

SAS
Procedure

R
Function
(Package)

Ratio & Interval

Single sample
t-test (mean)

TTEST

t.test

Ordinal & Interval

Single sample
Median test

Nominal

Chi-Square
Goodness of fit

FREQ

chisq.test

Ratio & Interval

Two sample ttest (mean)

TTEST

t.test

Ordinal & Interval

Wilcoxon Man
Whitney test

UNIVARIATE wilcox.test

UNIVARIATE wilcox.t

est

Nominal

Chi-Square test

FREQ

chisq.test

Ratio & Interval

Paired t-test

TTEST

t.test

Ordinal & Interval

Wilcoxon
signed rank

UNIVARIATE wilcox.t

est

Nominal

McNeamer test

FREQ

Ratio & Interval

One
ANOVA

way

Ordinal & Interval

Kruskal
test

Wallis

Nominal

Chi-Square test

FREQ

chisq.te
st

Ratio & Interval

One
way
repeated
measurements
ANOVA

GLM

lm &
Anova
(Require
car &
foreign
packages)

Ordinal & Interval

Friedman Test

FREQ

friedman
.test

Nominal

Repeated
measures
Logistic
Regression

GENMOD

glemer
(requires
lme4)

GLM(or
ANOVA)

mcnemar.
test
Aov

NPAR1WAY Kruskal.tes
t

Populatio
n & No. of
variables

One
dependent
& two or
more
independe
nt
variables

One
dependent
and
one
Independe
nt
variables

2 or more
related
variables

Type of Variables

Analysis

SAS
Procedure

R
Function
(Package)

Dependent -Ratio or
Interval;
independent

ordinal/nominal

Factorial
ANOVA

GLM

anova

Dependent -Ratio or
Interval
;
independent

ordinal/nominal
&
ratio/interval

Analysis
of
Covariance/
Multiple
Regression

GLM
REG

lm
aov

Dependent ordinal
or
nominal;
independent

ordinal/nominal/rati
o/interval

Logistic
Regression

LOGSTIC

glm

Ratio/Interval

Correlation
Simple
Regression

CORR

cor

Ordinal/Interval

Nonparametric
Correlation

CORR

cor

Nominal

Simple logistic

LOGISTIC

glm

Ratio/Interval

Factor Analysis

FACTOR

fa (require
psych)

You might also like