You are on page 1of 6

The empirical distribution function

Hailin Sang Indiana University

We begin with the problem of estimating a CDF F (x) = P (X x) (cumulative distribution function). Suppose X1 , , Xn F , The empirical distribution function, Fn , is the CDF that puts mass 1/n at each data point Xi : 1 Fn (x) = n where I is the indicator function.
n

I(Xi x)
i=1

The empirical distribution function in R

R provides the very useful function ecdf for working with the empirical distribution function. Cox and Lewis (1966) reported 799 waiting times between successive pulses along a nerve ber. Here we list the rst 5 observations.

time 1 2 3 4 5 0.21 0.03 0.05 0.11 0.59

Figure 1 shows the data and the empirical CDF Fn . The following is the code to get the gure.
2

Empirical CDF

1.0

q q q qqq q q qq qq qqq qq qq qqq qq qq qq qq qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

q q

qq

^ F(x) 0.2 0.4

0.6

0.8

0.0

0.0

0.5 time

1.0

1.5

Figure 1: Empirical CDF of the nerve pulse data

Data < read.delim( nerve-pulse.txt ) Fhat < ecdf(Data$time) > Fhat(0.1) [1]0.3829787 > Fhat(0.6) [1]0.933667 plot(Fhat, main= Empirical CDF , xlab= time , ylab=expression(hat(F)(x)))

Properties of F
1. At any fxed value of x, E(F (x)) = F (x)

V(F (x)) =

1 F (x)(1 F (x)). n

Note that these two facts imply that


P F (x) F (x).

2. An even stronger proof of convergence is given by the Glivenko-Cantelli Theorem:


a.s. sup(F (x) F (x)) 0. x

3. A nite sample result is given by the ne Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. for any > 0, P sup |Fn (x) F (x)| >
x

2e2n .

The DKW inequality can be applied to construct condence band. Note that for xed x, (L(x), U (x)) is called condence interval of F (x) if P(L(x) < F (x) < U (x)) 1 . (L(x), U (x)) is called condence band of F (x) if P{L(x) < F (x) < U (x), x} 1 . Let
2 n

= log(2/)/(2n),
n , 0}, n , 1}.

L(x) = max{Fn (x) U (x) = min{Fn (x) +


4

95 percent confidence band for ECDF

^ F(x) 0.0 0.2 0.4

0.6

0.8

1.0

0.0

0.5 Time

1.0

1.5

Figure 2: Condence band for the Empirical CDF of the nerve pulse data

From DKW inequality, P{L(x) < F (x) < U (x), x} 1 . Example: For the nerve-pulse data, the red lines in gure 2 gives a 95 percent condence band using Here is the code pulse < read.delim( nerve-pulse.txt ) x < pulse$time Fhat < ecdf(x) n < length(x) x < sort(pulse$time) l < u < numeric(n) for (i in 1:n)
5
n

1 2n

2 log( 0.05 ) = 0.48.

{ epsilon < sqrt((1/(2*n))*log(2/.05)) l[i] < max(Fhat(x[i])-epsilon,0) u[i] < min(Fhat(x[i])+epsilon,1) } par(mar=c(5,5,4,1)) plot(Fhat,verticals=TRUE,do.points=FALSE,xlab= Time , ylab=expression(hat(F)(x)),main= ) lines(x,l,col=2) lines(x,u,col=2) title( 95 percent condence band for ECDF )

You might also like