Professional Documents
Culture Documents
Marita Olsson
Department of Mathematics,
Chalmers University of Technology, and Gteborg University
June 1998
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Phase-type distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 What does EMpht do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 How to run EMpht . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Setup and compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Sample as input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Distribution as input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
2.3 Starting the EM-algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Specication of the order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Specication of the structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
2.3.3 Starting values for (; T ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Number of iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
2.3.5 Step-length in Runge-Kutta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
2.4 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 PHplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1
1 Introduction
EMpht is a programme for tting phase-type distributions. It can be used either to t
a phase-type distribution to a sample (which may contain censored observations), or to
make a phase-type approximation of another continuous distribution. The tting procedure
consists of an iterative estimation of the parameters of the phase-type distribution, using
an EM-algorithm. The programme is an implementation of the EM-algorithm presented
in Asmussen et al (1996), and in Olsson (1996).
The EMpht-programme is a C-programme, (ansi-standard). It is complemented by a
Matlab programme, PHplot, for graphical display of the tted phase-type distribution.
EMpht is an extension of the EMPHT-programme (Hggstrm et al 1992); the main
dierence being that EMpht can handle samples which contains right-censored and/or
interval-censored observations. The three parts of EMPHT: EMPHTenter, EMPHTdensity,
and EMPHTmain, are all contained in EMpht. EMPHTgraphics is replaced by PHplot.
2.2 Input
The rst question which is displayed when EMpht has been started, is about the input:
Type of input:
1. Sample
2. Density
Chose 1 if you want to t a phase-type distribution to data, and 2 if you wish to approxi-
mate another continuous distribution with a phase-type distribution.
2.2.1 Sample as input
EMpht needs to know whether or not your sample contains any censored observations.
Type of observations:
1. no censored observations
2. some right- and/or interval-censored observations
If Y denotes the variable which is observed, a right-censored observation c of Y corresponds
to the event fY > cg, and an interval-censored observation (a; b] of Y corresponds to the
event fa < Y bg. (In the programme observations are sometimes referred to as times,
or failure times.) The observations (censored and non-censored) can be entered from the
keyboard or from a le, with or without weights.
Way of entering the observations:
1. Unweighted, from keyboard.
4
2. Weighted, from keyboard.
3. Unweighted, from file 'unweighted'
4. Weighted, from file 'sample'
The weights are simply the numbers of observation of the same value. If observations
are entered without weights, EMpht assigns a weight equal to one to each observation. To
indicate end of data, the input is always ended by -1.
In whatever way you enter a set of observations, EMpht creates a le named sample
(or overwrite an existing le) and stores the observations and their weights in it. Thus,
the next time you run the programme to do another phase-type t to the same data-set,
you can use sample as input (option 4 above).
Let us illustrate the dierent ways of entering data with two examples.
If you have selected option 2 above, you will be asked to enter the number of cases,
(the weight), after each entered observation:
Select (1-4): 2
Enter failure times and number of cases. Quit with time = -1.
Time 1:4
Number of cases:1
Time 2:9
Number of cases:1
Time 3:6
Number of cases:2
Time 4:1
Number of cases:1
Time 5:-1
To use option 3 above, you must rst store the observations in a column (or in a row)
ending with -1, in a le named unweighted. In the le sample, (option 4), each observation
must be followed by its weight. This is easiest done by letting each row of the le consist
of an observation and its weight (in that order). The le must be ended with -1.
5
4 4 1
9 9 1
6 6 2
1 1 1
6 -1
-1
Figure 1: The contents of the les unweighted (to the left), and sample (to the right), for
the data considered in Example 1.
1 7 1 7 1
0 25 0 25 2
0 25 0 7 1
0 7 2 10 12 1
2 10 12 2 15 18 1
2 15 18 -1
-1
Figure 2: The contents of the les unweighted (to the left), and sample (to the right), for
the set of observations given in Example 2.
The densities 2-6 are dened for arbitrarily large x, and therefor you are asked to specify
an upper truncation point. The normal distribution is automatically truncated to the left,
so that only positive failure times are allowed. EMpht will also ask you to specify the
parameters of the density you have selected, using the same notation of the parameters as
in the density formulas given above.
Beside an upper truncation point and the parameter values, you will be asked to spec-
ify the maximum acceptable probability in one point, and the maximum time interval
corresponding to one point. This is because the distribution given as input to EMpht is
discretised into a weighted sample, where the weights are the probability mass in a small
interval. The discretisation can be done more or less rened; the smaller maximum time
interval and maximum probability you specify, the more rened the discretisation is, and
the larger the size of the weighted sample becomes. However, the size of the sample aects
the amount of computation involved in the EM-algorithm; the more observations, the more
computation is involved and the more time each iteration will require. There is usually not
much to gain in precision by letting the maximum acceptable probability be too small, (no
less than maybe 0.01, but it depends on the time scale involved). Usually, by choosing the
time interval=l, such that (truncation point / l) 500, the computations involved in each
iterations of the EM-algorithm are manageble. (The time each iteration requires, however,
depends mostly on the order p of the phase-type distribution. See section 2.3.1.)
7
A phase-type distribution as input
When option 6 is selected, you are asked to specify the number of phases, p, the initial
distribution of the underlying Markov process , and the phase-type generator T . This
can be done via the keyboard, but it is more convenient to use a le input-phases. In both
cases the parameters are entered in the following order:
p
1 T11 : : : T1p
2 T21 : : : T2p
:::
p Tp1 : : : Tpp
Let us consider an Erlang distribution (3,2) as an example of a phase-type distribution
as input. An Erlang (3,2) is the distribution of the sum of three independent exponential
random variables, each with expectation 1/2 (parameter=2). It can be represented as a
phase-type distribution of order p = 3, where the underlying Markov process starts in state
1 (implying 1 = 1; 2 = 0; 3 = 0), and jumps successively to state 2, to state 3, and
nally to the absorbing state. In each state it spends an exponentially distributed time
with parameter equal to 2; T11 = T22 = T33 = 2. Thus, the le input-phases contains the
following:
3
1 -2 2 0
0 0 -2 2
0 0 0 -2
8
printf("a:");
scanf("%lf", ¶meter[0]);
printf("b:");
scanf("%lf", ¶meter[1]);
break;
Now, line number 355 (in function density), should contain an expression of the Rayleigh
density at t. Thus, you change this row to
return( (par[0]+par[1]*t )*exp(-par[0]*t-0.5*par[1]*t*t) );
and the specication of the Rayleigh density is completed. After compilation EMpht will
have the Rayleigh distribution as option 7 on the list of densities.
Note that in the function density , the parameter given as par[0] is always the pa-
rameter rst entered in input-density, and par[1] is the second parameter entered. If
the distribution you want to approximate has more than two parameters, you must also
increase the number of elements that the vector parameter in function input-density is
allowed to contain. For instance, if you have three parameters you change parameter[2]
to parameter[3] on line 499.
2.4 Output
The estimates of and T are written on the screen; the leftmost column is the -vector,
the rightmost column is the t-vector, and in between is the T -matrix. The estimates of
the rst 5 iterations are displayed, thereafter every 25:th iteration, and the last one. Also,
the value of the log-likelihood function is given together with every displayed setup of
parameter estimates.
It is not hard to change in the programme-code which iterations should be displayed.
On line 1573 in EMpht.c you will nd
if ((k < 6) || ( (k % 25)==0 || k==NoOfEMsteps ))
The iteration number is k, and NoOfEMsteps is the specied number of iterations to be
performed. If you for instance want to see the rst 3 estimates, every 10:th estimate and
the last estimate of and T on the screen, you can change line 1573 to
if ((k < 4) || ( (k % 10)==0 || k==NoOfEMsteps ))
and recompile the programme.
Two les, inputdistr and phases, are created when running EMpht. The le inputdistr
contains the input; either a sample or a discretised density, and is used as input to the Mat-
lab programme PHplot. (However, if your input is a sample containing interval-censored
12
observations, the le inputdistr will be empty. Calculation of the empirical distribution for
such a sample has not yet been implemented.)
The le phases contains the nal estimates of and T calculated in EMpht, (it is
actually updated every time new parameters are displayed). It is used as input to PHplot,
but can also serve as starting-values to the EM-algorithm (choice number 7 when selecting
structure), if EMpht is restarted to continue the last performed t.
3 PHplot
To run PHplot you must rst start Matlab. In Matlab PHplot is started by writing PHplot,
(provided PHplot has been saved in a le named PHplot.m). The following menu will be
shown on the screen:
1. Mean and standard deviation
2. Display survival function
3. Display distribution function
4. Display density
5. Display failure rate
6. Print graph
7. Save graph on file "PHgraph"
8. Load new estimates of pi and T
9. Quit
Select (1-9):
The rst option gives the mean and standard deviation of the tted phase-type distri-
bution. Options 2-5 gives a graph of the tted phase-type distribution. If you have used
EMpht to approximate another continuous distribution, this distribution will be plotted
together with the tted phase-type distribution.
If your input to EMpht is a sample containing no censored observations, the empirical
distribution function is plotted in option 3 (and one minus the empirical distribution in
option 2), together with the tted phase-type distribution. If your input was a sample
containing some right-censored observation, the Kaplan-Meier estimate of the failure time
is plotted together with the survival function of the tted phase-type distribution in option
2, (and one minus the Kaplan-Meier estimate in option 3). Option 4 and 5 give the density
and failure rate, respectively, of the tted phase-type distribution only.
Option 6 and 7 can be used to print and save on a le, the graph currently displayed.
Option 8 is convenient if you run EMpht and PHplot simultanously in dierent windows,
and want to plot the current t. Using option 8, you do not have to quit and restart
PHplot every time you have updated a t (performed more iterations) in EMpht.
13
4 Examples
In this section some examples of approximations of other distributions by phase-type dis-
tributions are presented. Hopefully these examples can provide some guidance on how
to choose the truncation point, maximal time interval, maximal probability, number of
phases, etc.
2.5
1.5
DENSITIES
0.5
0
0 0.5 1 1.5 2 2.5 3
Inverse Gaussian, PH p=4 (dashed), PH p=6 (dotted)
Figure 1: The density of an inverse Gaussian distribition (solid curve) and two phase-type
approxmations; p=4 (dashed curve), and p=6 (dotted curve).
14
Second example; a lognormal distribution
A lognormal distribution with parameters =1 and =1 is approximated in our second
example by phase-type distributions of order 2 and 4, (Figure 2). This lognormal distribu-
tion has expectation about 4.5 and standard-deviation about 5.9. In EMpht we truncated
the distribution at 25, set the maximum probability to 0.01 and the maximum time in-
terval to 0.05. The phase-type approximatin with p=2 is based on 2000 iterations of the
EM-algorithm, and for p=4 we used 4000 iterations.
0.35
0.3
0.25
0.2
DENSITIES
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16 18 20
LOGNORMAL, PH p=2 (dashed), PH p=4 (dotted)
Figure 2: The density of a lognormal distribition (solid curve), and two phase-type approxma-
tions; p=2 (dashed curve), and p=4 (dotted curve).
1.8
1.6
1.4
1.2
DENSITIES
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Figure 3: The density of a Weibull distribition; =5, and =1 (solid curve), and three phase-
type approxmations of order 10, 15 and 20. The pointedness of the phase-type densities increase
with the order.
16
1
0.9
0.8
0.7
0.6
DENSITIES
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3
Figure 4: The density of a Weibull distribition; =1.8 and =1, (solid curve), and a phase-type
approxmation of order 5 (dashed curve).
17
References
[1] D. Aldous and L. Shepp (1987) The least variable phase-type distribution is Erlang.
Commun. Statist. -Stochastic Models 3, 467473.
[2] S. Asmussen, O. Nerman, and M. Olsson (1996) Fitting phase type distributions via
the EM algorithm. Scand. J. Statist. 23 419441.
[3] S. Asmussen, and M. Olsson (1997) Phase-type distribution. Encyclopedia Of Statistical
Sciences Update volume 2, 525530.
[4] O. Hggstrm, S. Asmussen and O. Nerman (1992) EMPHT - A program for tting
phase type distributions. Technical report. Department of Mathematics, Chalmers Uni-
versity of Technology, Gteborg, Sweden.
[5] M. F. Neuts (1981) Matrix-Geometric Solutions in Stochastic Models. Johns Hopkinks,
Baltimore.
[6] O'Cinneide, C.A. (1990) Characterizations of phase-type distributions. Commun.
Statist. -Stochastic Models 6, 157.
[7] Olsson, M. (1996) Estimation of phase type distributions from censored data. Scand.
J. Statist. 23, 443460.
Marita Olsson
Department of Mathematics
Chalmers University of Technology
S-412 96 Gteborg
SWEDEN
E-mail: marita@math.chalmers.se
18