The Significance of Sampling Design On Inference An Analysis of Binary Outcome Model of Children's Schooling Using Indonesian Large Multi-Stage Sampling Data

Working Paper
in Economics and
Development Studies
Department of Economics
Padjadjaran University
No. 200809
The significance of Sampling Design on

Inference:
An Analysis of Binary Outcome Model
of Childrens Schooling Using
Indonesian Large Multi-stage Sampling
Data
Ekki Syamsulhakim
Department of Economics,
Padjadjaran University
Oktober, 2008
Center for Economics and Development Studies,

Department of Economics, Padjadjaran University
Jalan Cimandiri no. 6, Bandung, Indonesia.
Phone/Fax: +62-22-4204510
http://www.lp3e-unpad.org
For more titles on this series, visit:
http:// econpapers.repec.org/paper/unpwpaper/
The significance of Sampling Design on Inference:
An Analysis of Binary Outcome Model of Childrens Schooling
Using Indonesian Large Multi-stage Sampling Data
Ekki Syamsulhakim
Lecturer, Department of Economics and Development Studies, Universitas Padjadjaran, Indonesia and
PhD student, Department of Econometrics and Business Statistics, Monash University, Australia
Email: esya2@student.monash.edu.au
Presented on the Voice of Indonesian Future Leader International Conference

Victoria University of Technology, Melbourne, Australia
3 May 2008
Preliminary Version: Do Not Quote Without Permission
Abstract
This paper aims to exercise a rather recent trend in applied microeconometrics, namely the
effect of sampling design on statistical inference, especially on binary outcome model. Many
theoretical research in econometrics have shown the inappropriateness of applying i.i.d-
assumed statistical analysis on non-i.i.d data. These research have provided proofs showing
that applying the iid-assumed analysis on a non -iid observations would result in an inflated
standard errors which could make the estimated coefficients inefficient if not biased.
Consequently, a policy-affecting quantitative research would give an incorrect - usually of
type-1 errors - in its conclusion.
Using a dataset sourced from the third cycle of the Indonesia Family Life Survey (IFLS) ,
which sampling design involved multi-stage clustering and stratification, this paper shows
discrepancies in the estimation result of probit regressions of a child attending school when
the estimated standard errors are adjusted and not. The computation also shows a
considerable change in the level of confidence in not-rejecting the null hypothesis of the
explanatory variables. This paper provides more evidence that statistical analysis should
always take into account the sampling design in collecting the data.
Keywords: Applied microeconometrics, survey data, IFLS , design effects, economics of

education, demand for schooling
JEL classification: C12,C42, C81, I21
Acknowledgment: I wish to thank my supervisors Professor Brett Inder and Dr. Kathryn
Cornwell for their valuable suggestions and comments.
1. Introduction
Theoretical econometrics analyses have shown that disregarding sampling design of
data collection in conducting practical econometrics analysis would yield in an inefficient
estimator (Deaton, 1997; Pepper, 2002. Bhattacarya, 2005). This is especially true if there is
no match between the assumed distribution of the data and the assumptions underlying a
model to be used in the analysis. This means that a regression analysis assuming that a
vector of variables x is independently and identically distributed i.i.d should be
connected to data collection involving simple random sampling mechanism. However, very
rarely data especially those which are collected through surveys follows simple random
sampling of the population, mostly owing to the economical consideration of the surveys.
While inefficient estimator may not as unpleasant as biased estimator1 the standard
statistical tests could not be done in the analysis because the standard errors of the estimated
coefficients are no longer reliable. In some cases, the changes in the standard errors are
sufficiently noticeable that it could change the p-value of the estimated coefficients.
Despite of the breakthrough in either the theoretical or the computer package side,
hardly empirical microeconometrics research involving non-i.i.d data use appropriate
methods in analysing such observations. Wooldridge (2006) argues that this situation might
be because the applied researchers are not quite certain when certain estimators are robust
to various kinds of misspecification.
1 It is worth noting that Eltinge and Sribney (1996) believe that the estimator will be biased in a
complex sampling mechanism.
1
This situation to some extent is also captured in applied microeconometric analyses
using the Indonesian Family Life Survey (IFLS) Data. For instance, Deolalikar (1993) does an
analysis of gender differences in returns to schooling and in school enrolment rates, but
assuming that the observations distributed identically and independently. Another example,
Alisjahbana (1999) uses binary outcome model (probit model) to analyse the probability of a
child being enrolled in school, but there is no indication that the standard errors have been
adjusted.
This paper will be organised in the following way. Section 2 will elaborate the
sampling design of the IFLS data according to its manual. Section 3 will provide a brief
theoretical reasoning of the inappropriateness of conducting statistical or econometrical
inference assuming i.i.d observation when data is not collected through simple random
sampling mechanism. This section will also attempt to link this theoretical consideration to
the IFLS sampling mechanism, as well as to the probit model that will be used in the
application. Section 4 will give an empirical result comparing the magnitude of the standard
errors of the estimated coefficients when adjustment for capturing the non-i.i.d situation of
the data is accomplished and not. As an application, this paper will discuss factors affecting
the probability of a child enrolled in a school. The conclusion will be given in section 5.
2
2. IFLS data sampling design2
The IFLS is the only large-scale longitudinal survey available for Indonesia, and in
fact it is one of the large longitudinal surveys in the world with more than 30,000 individuals
from 7,224 households were interviewed for the baseline survey in 1993 (Frankenberg and
Thomas 2000, Strauss, et al. 2004). For each of its surveys (household and community facility
surveys), the IFLS follows a multi-stage sampling design involving clustering and
stratification. Moreover, Strauss, et. al. (2004) states that IFLS is a rich but complex survey.
The stages of the sampling mechanism of the IFLS data are as follows. First, 13 out of
the 1993 Indonesias 27 provinces were selected (see figure 2.1 below).3 Then, it selects
Enumeration Areas (EAs) in each province randomly, based on a representative
socioeconomic survey (the 1993 SUSENAS). 321 EAs were selected from the 13 provinces,
and an over-sampling mechanism had been done to facilitate the Javanese-non-Javanese and
urban-rural comparison.
The next stages of the sampling involves a random selection of 20 households within
each urban EAs and 30 households within each rural EAs by the interviewers, and the
random selection of some household members to be interviewed (two children of the head
and spouse aged 0 to 14; a person age 50 and over and his/her spouse; and an individual age
15 to 49). The result of this sampling mechanism represents 83 percent of Indonesian
population. In 2000 (IFLS wave 3), 10,435 households (43,649 persons) were interviewed,
2 This section draws heavily from Frankenberg and. Karoly (1995) and Strauss,et al. (2004).
3
The reasons of not including a province vary, but most of them were expensive survey costs aside
from political violence in Aceh (Frankenberg and Karoly, 1995).
3
consisting of 6,661 IFLS3 target households and 3,774 new split-off househoulds. For
longitudinal analysis, 6,564 households were interviewed in all waves of the IFLS (1993,
1997, and 2000).
Figure 2.1
IFLS Provinces in Indonesia
Source: http://rand.org/labor/FLS/IFLS/index.html#map, edited by author.
Therefore it should be sufficiently clear that the observations obtained from the IFLS
data can not follow the i.i.d distribution of observation, because the sampling design
involves multi-stage sampling mechanism. Or to simply put, the collection of IFLS data did
not follow simple random sampling mechanism . The next section discusses this matter in a
theoretical way.
4
3. Theoretical Background
3.1 The Efficiency of Estimator
From previous sections, it can be seen that the IFLS data (and many other survey-
based data) uses a multi-stage sampling design involving clustering and also stratification.
Unlike simple random sampling, the probability of an element to be included in the sample
is no longer the same for all individuals in this sampling design, i.e., the probability could be
higher (or lower) for some clusters (or strata) depending on the sampling design and rates
(Cameron and Trivedi. 2005).4
While clustering can save a great deal of survey costs, it causes observations to be no
longer independent because each cluster normally shares similar characteristics. If, for
instance, the data is clustered across households, the individuals living in the same
households are likely to be similar in terms of, say, their consumption pattern. However, the
degree of independency depends on the degree of heterogeneity across clusters and across
members of each cluster. In other words, the degree of independency is influenced by the
intra-cluster correlation.
Formally, assuming (1) one-stage cluster sampling; (2) clusters are equal in size; and
(3) simple random sampling is conducted in the selection of the clusters, the variance of an
unbiased estimator of the total Primary Sampling Unit (PSU) t, is given by:
4In this paper, cluster is defined the same as PSU or the selection of the geographical division where
the survey is taken, whereas strata is defined as population sub-groups defined either by urban/rural
status, gender, etc (see Eltinge and Sribney (1996) and Deaton, 1997 for a discussion about this
matter).
5
N 2 St n
2
var(t) 1
(1)
n N
Where:
S t2 is population variance
N is the number of PSU in the population
Since
t is an unbiased estimator of t , St2 can be estimated using sample variance st2 ,
and hence the standard error of

t is:
s n
2
SE (t
) N t
1 (2)
n N
It can be seen that standard errors of the unbiased estimate of total PSU is positively
related with the number of PSU (N), even in the case of simplest one. In a multi-stage
clustering design that is used by the IFLS (and clusters are not equally distributed), the
standard error of the unbiased estimator of the total PSU is still positively related to the
number of PSU.5
The IFLS data also uses some sorts of stratification, namely layering the sampling
into urban or rural and Javanese or non-Javanese. This is evident by giving different weights
or rates for both of the stratifications (Frankenberg and Karoly, 1995, page 7). Theoretically,
stratification should give a more precise estimation partly because the layering is done with
a purpose to take into account significant differences across observations. Assuming that
simple random sampling is used in selecting the members of the strata, the variance of the
unbiased estimator
t is given by (Lohr, 1999)
5 Readers should consult standard texts in sampling design for a thorough discussion on the
derivation of the sampling statistics, such as Levy and Lemeshow (1999) or Lohr (1999).
6
2 S n
H 2
str )
var(t Nh h 1 h (3)
h 1 nh Nh
Where:
S h2 is population variance in stratum h
Nh is the number of sampling unit in the h-th stratum
nh is the element in the h-th stratum
The standard error of

t str is given by
SE (t
str ) var(t str ) (4)
From this short exposition, it should be clear that the standard deviation of both
clustering and stratifying the observation depends on the number of PSU or sampling unit
in the strata. The magnitude of the standard error is different from the standard error
assuming simple random sampling (or assuming that the distribution of the observation is
i.i.d.).6 Therefore, it should also be clear that when a statistical inference assuming i.i.d is
being applied to non i.i.d data, the standard errors of the estimated coe fficient must be
adjusted.
In the case of binary outcome model, Lohr (1999) believes that sample design will
affect standard errors of the coefficient of logistic regression. Hence steps should be taken to
make sure that the standard errors are adjusted to the clustering and stratification of the
data.
s2 n
6
SE (x ) 1

n N
7
3.2 Adjustment for the Probit Estimator
Wooldridge (2002) provides brief theoretical steps in applying Probit model for
clustered samples. Using vector notations, the probit model can be defined as:
P ( y 1 | x) G (x) (5)
with
z
G( z ) ( z)
(v) dv (6)

(z ) is the standard normal density function that takes the form of
1
(z ) (2) exp(z 2 / 2)
2
(7)
Equation (5) is estimated using the maximum likelihood estimation. However, in order to do
that correctly, the observation entered to the model must be i.i.d. To correctly estimate
clustered observation which is not i.i.d Wooldridge (2002) suggests Partial Maximum
Likelihood Estimator. Hence, for cluster i and element g, the specification becomes
( ig)
P ( y 1| xig ) x (8)
The coefficient is then estimated by pooled probit analysis through maximum likelihood

estimation techniques, which solves :
2 N 1N ()
0 (9)

and in order to adjust the standard errors so that it can take into account intra-cluster
correlation, the Huber-estimated variance matrix (Pepper, 2002, and Wooldridge, 2002,
Cameron and Trivedi, 2005)
8
1 1
N Gi
N N i
G
ig i i ig )

A ( ) s ( ) s ( )' A ( (10)
i1 g1 i1 i1 g 1
where

{(xig )}2 x x
A ig ()
ig ig

is the conditional information matrix
( xig )[1 ( xig )]
G G
( xig
)}xig[ yig ( xig )]

si () sig (
) is the score matrix
g 1 g 1

( xig)[1 ( xig)]
g indexes elements in the PSU
i denotes cluster or PSU
Gi is the size of cluster i
Using this adjustment, the estimation of the probit model applied to a non-i.i.d data could be
done. The application of this approach is provided in the next section.
4. Application
As the application of the theoretical discussion above, a binary choice model of the
probability of a child attending school will be estimated. Similar to Deolalikar (1993) and
Alisjahbana (1999), the models to be estimated and compared are:
9
P (Attsch i 1| xi ) (age, age-squared, gender, parental education,
household income, provinces, rural/urban status) (11)
P (Attsch ig 1| xig ) (age, age-squared, gender, parental education,

household income, provinces, rural/urban status) (12)
Where:
- Attsch takes the value of 1 if a child is attending school or 0 otherwise. In the household
roster of IFLS 3, there is a question of the primary activity of a person during the past
week, and one of the available options is attending school.
- Age denotes the age of the school aged children (5 18 years old).
- Gender equals 1 if male , 0 if female
- Parental education is divided into 4 categories: whether mother or father attend primary,
secondary or tertiary school. The ommitted variables are whether mother or father did
not attend school at all.
- Household income is in the form of logarithm considering the large spread of the
income, and is proxied by household yearly expenditure. This variable is calculated by
summing up all of the households expenditure, both food and non-food, and then
divided by the number of adults.
- Provinces is a vector of dummy variables of 13 IFLS provinces, excluding Jakarta to
avoid dummy variable trap
- Rural / Urban status takes the value of 1 if urban, 0 uf rural
- g indexes elements in the PSU
- i denotes cluster or PSUz
Different with Deolalikar (1993) and Alisjahbana (1999), this paper will estimate the
model using probit instead of logit model. Equation (11) is estimated using usual probit
10
model assuming i.i.d, while equation (12) is estimated with the adjustment mechanism for
the standard error due to the non-i.i.d nature of the IFLS data. The sample is obtained from
the 3rd wave of the IFLS, comprising 11,763 children aged 5 to 18. The reason for choosing
this age range is due to the fact that some children in Indonesia (about 50% in the IFLS 3
data) start primary school at the age of 5, and some of them (approximately 35%) are still in
senior secondary school at age 18. To take into account the probability of the observations
are sampled, both equations (11) and (12) are estimated with person weight which is also
taken from the IFLS 3 data.
The estimation result is given in table 4.1 below. From the table, three important
notes emerge. Firstly, it can be observed that the estimated coefficients does not change. This
fact indicates that that the estimators are unbiased.
Secondly, the standard errors of the coefficients have changed after sampling design
issues (i.e., non-i.i.d data or sampling) is taken into account or intra-cluster correlation
has been resolved. The implication of this is that relying the analysis based on a model
without taking into account the non-i.i.d. issues in the data, and then estimating the probit
equation without adjusting the standard errors could have made a researcher trapped in the
type-1 error in drawing conclusion.
Thirdly, from the robust model (equation 12), it can be seen that the probability of a
boy attending school in a household is higher than a girl. This indicates that gender
difference exists in children educations issue in Indonesia. However, if one does not take
into account the sampling design issue in the inference, he or she can draw different or
even wrong conclusion.
11
Table 4.1
A comparison of probit models with out and with cluster-corrected standard err ors
Model (11) Model (12)

Variables
Coefficient Standard Coefficient Corrected
Errors Standard Errors
Log of household expenditure per adult 0.147*** 0.020 0.147*** 0.018
Age 0.770*** 0.025 0.770*** 0.031
Age squared -0.037*** 0.001 -0.037*** 0.001
Male 0.055 0.030 0.055* 0.022
Father highest education is primary school 0.006 0.042 0.006 0.039
Father highest education is secondary school 0.208*** 0.054 0.208** 0.056
Father highest education is tertiary school 0.128 0.098 0.128 0.096
Mother highest education is primary school 0.068 0.041 0.068 0.064
Mother highest education is secondary school 0.227*** 0.058 0.227** 0.074
Mother highest education is tertiary school 0.150 0.112 0.150 0.110
Urban area 0.356*** 0.035 0.356*** 0.066
North Sumatera 0.344*** 0.075 0.344*** 0.082
West Sumatera 0.523*** 0.089 0.523*** 0.072
South Sumatera 0.058 0.081 0.058 0.124
Lampung 0.357*** 0.086 0.357*** 0.075
West Java 0.024 0.062 0.024 0.091
Central Java 0.403*** 0.070 0.403*** 0.063
Yogyakarta 0.988*** 0.106 0.988*** 0.072
East Java 0.312*** 0.069 0.312*** 0.071
Bali 0.382*** 0.089 0.382*** 0.072
West Nusa Tenggara 0.096 0.074 0.096 0.065
South Kalimantan -0.054 0.088 -0.054 0.082
South Sulawesi -0.062 0.079 -0.062 0.099
Constant -5.280*** 0.327 -5.280*** 0.365
# obs: 11,763; * = significant at 10% level, ** = significant at 5% level, ***=significant at 10% level.
Equation 12 is estimated adjusting to the multi-stage sampling of the IFLS data; Stage 1 : clustering based on 13
PSU (province) and 1 strata (urban/rural status); Stage 2: SRS with SSU = household
12
5. Conclusion
This paper has attempted to provide a summary of the importance in matching the
assumption underlying sampling design in the data collection, especially that of survey,
with the statistical or econometrical inference. The use of i.i.d assumed inference or analysis
to a non-i.i.d data has been proven theoretically to distort standard errors.
From the application side, using IFLS data a non-i.i.d data due to multi-stage
sampling mechanism the comparison of the probit model estimation has shown a new
evidence of the importance to take into account intra-cluster correlation problem. Although
the estimated coefficients are not biased, the standard errors are no longer reliable. In a
simple application, it can be seen that a failure to adjust the standard errors leads a to an
incorrect conclusion drawing.
13
References
Alisjahbana, A., 1999. Does demand for childrens schooling quantity and quality in
Indonesia differ across expenditure classes? Mimeo, Department of Economics and
Development Studies, Padjadjaran University.
Bhattacarya, D., 2005. Asymptotic inference from multi-stage samples. Journal of

Econometrics 125, 145 171.
Cameron, A.C., and P.K. Trivedi, 2005. Microeconometrics: Methods and Applications. 1st
Edition. Cambridge
Deaton, A., 1997. The analysis of household surveys: A microeconometric approach to

development policy. 1 st edition. The World Bank.
Deolikar, A.B, 1993. Gender Differences in the returns to schooling and in school enrolment
rates in Indonesia. The Journal of Human Resources 28 (4) 899 932.
Eltinge, J.L., and W.M. Sribney, 1996. Some basic concepts for design -based analysis of
complex survey data. Stata Technical Bulletin 6 (31), 1-44.
Frankenberg, E., and L. Karoly, 1995. The Indonesian Family Life Survey: Overview and
Field Report. RAND. DRU-1195/1-NICHD/AID.
Frankenberg, E., and Thomas, D. (2000), The Indonesia Family Life Survey (Ifls): Study
Design and Results from Waves 1 and 2 (Vol. DRU -2238/1/1-NIA/NICHD), Rand
Corporation.
Levy, P.S., and S. Lemeshow, 1999. Sampling of Populations: Methods and Applications, 3 rd
edition. Wiley.
Lohr, S.L., 1999. Sampling: Design and Analysis. 1st edition. Duxbury Press.
Pepper, J.V., 2002. Robust Inferences from random clustered samples: an application using
data from the panel study of income dynamics. Economics Letters 75, 341 345.
Strauss, J., K. Beegle, B.. Sikoki, A.Dwiyanto, Y.Herawati and F. Witoelar, 2004. The third
wave of the Indonesia Family Life Survey (IFLS3): Overview and Field Report. WR-
144/1-NIA/NICHD.
Tarozzi, A. Undated. Applied Econometrics Lecture Notes. Duke University Department of

Economics. Downloadable from http:// www.econ.duke.edu/~taroz/SurveyDesign.pdf
Wooldridge, J.M., 2002. Econometrics analysis of cross-section and panel data. MIT Press.
14
_______________, 2003. Cluster-sample methods in applied econometrics. American
Economic Review 93, 133-138.
_______________, 2006. Cluster-sample methods in applied econometrics: an extended

analysis. Mimeo. Department of Economics, Michigan State University.
15

The Significance of Sampling Design On Inference An Analysis of Binary Outcome Model of Children's Schooling Using Indonesian Large Multi-Stage Sampling Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Significance of Sampling Design On Inference An Analysis of Binary Outcome Model of Children's Schooling Using Indonesian Large Multi-Stage Sampling Data

Uploaded by

Copyright:

Available Formats

Working Paper

The significance of Sampling Design on

Center for Economics and Development Studies,

Presented on the Voice of Indonesian Future Leader International Conference

Preliminary Version: Do Not Quote Without Permission

Keywords: Applied microeconometrics, survey data, IFLS , design effects, economics of

JEL classification: C12,C42, C81, I21

Theoretical econometrics analyses have shown that disregarding sampling design of

data collection in conducting practical econometrics analysis would yield in an inefficient

vector of variables x is independently and identically distributed i.i.d should be

hardly empirical microeconometrics research involving non-i.i.d data use appropriate

to various kinds of misspecification.

theoretical reasoning of the inappropriateness of conducting statistical or econometrical

Enumeration Areas (EAs) in each province randomly, based on a representative

15 to 49). The result of this sampling mechanism represents 83 percent of Indonesian

1997, and 2000).

IFLS Provinces in Indonesia

Source: http://rand.org/labor/FLS/IFLS/index.html#map, edited by author.

3.1 The Efficiency of Estimator

(Cameron and Trivedi. 2005).4

N is the number of PSU in the population

and hence the standard error of

S h2 is population variance in stratum h

Nh is the number of sampling unit in the h-th stratum

nh is the element in the h-th stratum

The standard error of

(z ) is the standard normal density function that takes the form of

Cameron and Trivedi, 2005)

g indexes elements in the PSU

i denotes cluster or PSU

Gi is the size of cluster i

done. The application of this approach is provided in the next section.

Alisjahbana (1999), the models to be estimated and compared are:

P (Attsch ig 1| xig ) (age, age-squared, gender, parental education,

week, and one of the available options is attending school.

- Gender equals 1 if male , 0 if female

not attend school at all.

income, and is proxied by household yearly expenditure. This variable is calculated by

divided by the number of adults.

- Provinces is a vector of dummy variables of 13 IFLS provinces, excluding Jakarta to

avoid dummy variable trap

- Rural / Urban status takes the value of 1 if urban, 0 uf rural

- g indexes elements in the PSU

- i denotes cluster or PSUz

taken from the IFLS 3 data.

fact indicates that that the estimators are unbiased.

type-1 error in drawing conclusion.

even wrong conclusion.

Model (11) Model (12)

Log of household expenditure per adult 0.147*** 0.020 0.147*** 0.018

Age 0.770*** 0.025 0.770*** 0.031

Age squared -0.037*** 0.001 -0.037*** 0.001

Male 0.055 0.030 0.055* 0.022

Father highest education is primary school 0.006 0.042 0.006 0.039

Father highest education is secondary school 0.208*** 0.054 0.208** 0.056

Father highest education is tertiary school 0.128 0.098 0.128 0.096

Mother highest education is primary school 0.068 0.041 0.068 0.064

Mother highest education is secondary school 0.227*** 0.058 0.227** 0.074

Mother highest education is tertiary school 0.150 0.112 0.150 0.110

Urban area 0.356*** 0.035 0.356*** 0.066

North Sumatera 0.344*** 0.075 0.344*** 0.082

West Sumatera 0.523*** 0.089 0.523*** 0.072

South Sumatera 0.058 0.081 0.058 0.124

Log of household expenditure per adult 0.147* 0.020 0.147* 0.018

Age 0.770* 0.025 0.770* 0.031

Age squared -0.037* 0.001 -0.037* 0.001

Father highest education is secondary school 0.208* 0.054 0.208 0.056

Mother highest education is secondary school 0.227* 0.058 0.227 0.074

Urban area 0.356* 0.035 0.356* 0.066

North Sumatera 0.344* 0.075 0.344* 0.082

West Sumatera 0.523* 0.089 0.523* 0.072

Lampung 0.357* 0.086 0.357* 0.075

Central Java 0.403* 0.070 0.403* 0.063

Yogyakarta 0.988* 0.106 0.988* 0.072

East Java 0.312* 0.069 0.312* 0.071

Bali 0.382* 0.089 0.382* 0.072

Constant -5.280* 0.327 -5.280* 0.365