1-1 Key Terms and Choosing Statistical Analysis

Commonly used terms in Statistics and choosing statistical
analysis
A. Dhandapani
The difficulty faced by many in understanding statistics is mostly due to
misunderstanding/not able to understand the terms used in statistics. For
example, significant in English implies sufficiently great or important to be
worth of noting. Statistical significant on the other hand tells us the
likelihood of obtaining the results merely through chance. This hand-out
provides brief explanation of most commonly used terms in statistics.
The origin of the word statistics implies that it is the study of human
population within a political union (State). However, Statistics can be
regarded (i) as Study of Populations; (ii) as study of variations and (iii) as
methods of reduction of data (Fisher, 1954 Statistical Methods for
Research Workers). When we collect information on assets owned from
100, 000 households in a state, we are not interested in individual
households. We are interested in the assets owned by the households. So,
we consider them as the population of assets rather than the population
of households. Thus, in statistics, we study the Population or aggregates
of individuals, either living or materials/things. The term Population in
statistics is not restricted only to living or materials, but even any
measurements measured indefinitely, can be regarded as a population.
The number of units in the population is called Population Size. The
population size may be finite, such as number of households or infinite
such as number of vehicles arriving at a particular traffic signal every
minute. Even when the population size is finite, it may be considered as
infinite, such as number of fishes in a lake. Also, when the population is
finite, another aspect worth noting is whether the individuals can be
identified.
When there is no variability in the units of the population, there is no need
to study the entire population, instead one can study only one individual
from the population. However, the variability is everywhere and we need
to study these variations. The measurement taken on individuals of the
population is called as Variate or as Variable. The possible values taken
by a variable can be binary such as gender of the child born (Male/Female)
or discrete such as number of children in a household (0,1,2,...) or
continuous such as height of the children.
While studying a population, we are usually not interested in individuals in
the population, particularly when there is variability and the size of the
population is high or infinite. The interest is to describe or identify the
relationships between the population units etc. The population is
described through some constants. These constants are called

Population Parameters. Some of the population parameters are widely
known, such as Arithmetic Mean (or simply mean), Median etc. Note that,
in general, the population parameters would be unknown.
In order to find the unknown population parameters of variable(s), the
values of individual units of the population must be collected. In case, if
the population is large/infinite, clearly it is not possible to find the values
for all the units in the population. In order to find the values of the
population
parameters,
we
resort
to
selecting
few
individuals/measurements of the population and observe the values. To
apply various statistical tools, it is necessary that the subset of the
population selected/observed must be at random and it is called Sample.
Thus, a sample is a subset of population units on which measurements will
be collected. There are various ways by which one can select a sample
from a population. Here, if the population units are not identifiable (or
very large/infinite), we can select using Simple Random Sampling or
simply called random sampling. In simple random sampling, the
probability of selecting a unit from the population is same for all units. The
number of units selected for the observation in a sample is called Sample
Size. While selecting a sample, particularly when the number of units in
the population is finite and identifiable, it is possible to employ various
procedures in which the probability of selecting a unit may be unequal.
Also, in finite population sampling, two methods of sampling are available,
namely, with replacement (a single unit may be selected more than
once) and without replacement (a single unit may be selected at most
once only).
The values collected/measured from the sampled units contain the
information about the population. These values, commonly referred as
Data will be used to estimate the population parameters. In case of
infinite populations, the variations observed in the data can be modelled
using known form of probability distributions. Some of the well known
probability distributions are Binomial, Poisson, Normal, Gamma,
Exponential etc. The functional form these distributions depend on one or
more constants, which are called Parameters. Many of these parameters
are again related to Mean, Variance etc. For example, Normal distribution
depends on two parameters which are nothing but its mean and variance.
Population parameters are estimated using some function of sample
values or data. These functions are called Estimators. Various estimators
are possible but estimators with some properties such as Unbiased
(roughly indicate the estimator takes true population parameter value on
an average), Consistent (the estimator value tends to true value when
the sample size is increased to infinity), Efficient (estimator is having less

variance than other possible estimators) etc.
There are various statistical tools developed over the years by
statisticians to find out the solutions to the different types of data. Initially
many of these tools were based on three assumptions, namely (i)
Independent (ii) Identical and (iii) Normality. The assumption of
independent observations is satisfied, when the units are selected
randomly and selection of a unit does not depend on other units. Some of
the methods which violate these assumptions are (i) when the sampling is
done without replacement (Selection of second and subsequent units
depend on units selected already) (ii) Measurements taken on same
individual/households/unit of the population (usually over time). The
second assumption implies that the random process is same, i.e. all units
in the population follow same distribution. Again this assumption may not
hold good, particularly, the variance may not be same for all the units.
The first two assumptions are usually called i.i.d (identically and
independently distributed). The third assumption is that the data
generated is from a normally distributed. The normality assumptions may
be justified in many situations, but sometime, the assumption may not
hold good.
It is imperative for anyone, who wishes to apply any statistical tool, to
know the assumptions under which the tool has been developed and the
consequences when the assumptions are not met. There are tools which
are now available in which one can see whether the assumptions can be
reasonably justified with the data and also newer methods in which some
of these assumptions can be relaxed.
Choosing Statistical Analysis
The following table provide information on how to choose appropriate
statistical analysis and SAS procedures/R functions to use. Note that many
statistical analyses would require several assumptions to be satisfied. One
should look for the assumptions of each analysis and validate those
assumptions before using the appropriate analysis. This material has been
adopted
from
these
web
http://bama.ua.edu/~jleeper/627/choosestat.html
http://www.ats.ucla.edu/stat/mult_pkg/whatstat/
pages:
&
Populatio
n & No. of
variables
One
Population,
one
variable
One
Population,
Two levels
(independe
nt)
One
Population,
two groups
(dependent
or
matched)
One
Population,
more than
2 groups
One
Population,
more than
2
groups
(dependent
/ matched)
Type of Variables
Analysis
SAS
Procedure
R
Function
(Package)
Ratio & Interval
Single sample
t-test (mean)
TTEST
t.test
Ordinal & Interval
Single sample
Median test
Nominal
Chi-Square
Goodness of fit
FREQ
chisq.test
Ratio & Interval
Two sample ttest (mean)
TTEST
t.test
Ordinal & Interval
Wilcoxon Man
Whitney test
UNIVARIATE wilcox.test
UNIVARIATE wilcox.t
est
Nominal
Chi-Square test
FREQ
chisq.test
Ratio & Interval
Paired t-test
TTEST
t.test
Ordinal & Interval
Wilcoxon
signed rank
UNIVARIATE wilcox.t
est
Nominal
McNeamer test
FREQ
Ratio & Interval
One
ANOVA
way
Ordinal & Interval
Kruskal
test
Wallis
Nominal
Chi-Square test
FREQ
chisq.te
st
Ratio & Interval
One
way
repeated
measurements
ANOVA
GLM
lm &
Anova
(Require
car &
foreign
packages)
Ordinal & Interval
Friedman Test
FREQ
friedman
.test
Nominal
Repeated
measures
Logistic
Regression
GENMOD
glemer
(requires
lme4)
GLM(or
ANOVA)
mcnemar.
test
Aov
NPAR1WAY Kruskal.tes
t
Populatio
n & No. of
variables
One
dependent
& two or
more
independe
nt
variables
One
dependent
and
one
Independe
nt
variables
2 or more
related
variables
Type of Variables
Analysis
SAS
Procedure
R
Function
(Package)
Dependent -Ratio or
Interval;
independent
ordinal/nominal
Factorial
ANOVA
GLM
anova
Dependent -Ratio or
Interval
;
independent
ordinal/nominal
&
ratio/interval
Analysis
of
Covariance/
Multiple
Regression
GLM
REG
lm
aov
Dependent ordinal
or
nominal;
independent
ordinal/nominal/rati
o/interval
Logistic
Regression
LOGSTIC
glm
Ratio/Interval
Correlation
Simple
Regression
CORR
cor
Ordinal/Interval
Nonparametric
Correlation
CORR
cor
Nominal
Simple logistic
LOGISTIC
glm
Ratio/Interval
Factor Analysis
FACTOR
fa (require
psych)

1-1 Key Terms and Choosing Statistical Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1-1 Key Terms and Choosing Statistical Analysis

Uploaded by

Copyright:

Available Formats

Commonly used terms in Statistics and choosing statistical

described through some constants. These constants are called

the sample size is increased to infinity), Efficient (estimator is having less

Ratio & Interval

Ordinal & Interval

Ratio & Interval

Two sample ttest (mean)

Ordinal & Interval

Ratio & Interval

Ordinal & Interval

Ratio & Interval

Ordinal & Interval

Ratio & Interval

Ordinal & Interval

You might also like