Professional Documents
Culture Documents
Abstract
As computer systems continue to be applied to mission-critical environments, techniques to evaluate
their dependability become more and more important. Of the dependability measures used to
characterize a system, availability is one of the most important. Techniques to evaluate a system's
availability can be broadly categorized as measurement-based and model-based. Measurementbased evaluation is expensive as it requires building a real system and taking measurements and
then analyzing the data statistically. Model-based evaluation on the other hand is inexpensive and
relatively easier to perform. In this paper, we rst look at some availability modeling techniques
and take up a case study from an industrial setting to illustrate the application of the techniques
to a real problem. Although easier to perform, model-based availability analysis poses problems
like largeness and complexity of the models developed which makes the models dicult to solve.
This paper also illustrates several techniques to deal with largeness and complexity issues.
1 Introduction
Complex computer systems are widely used in dierent applications ranging from
ight control,
command and control systems to commercial systems like information and nancial services. These
applications demand high performance and high availability. Availability evaluation addresses failure and recovery aspects of a system, while performance evaluation addresses processing aspects
and assumes that, the system components do not fail. For gracefully degrading systems, a measure that combines system performance and availability aspects is more meaningful than separate
measures of performance and availability. These composite measures are called performability measures. The two basic approaches to evaluate the availability/performance measures of a system are:
measurement-based and model-based. In measurement-based evaluation, the required measures
are estimated from measured data using statistical inference techniques. The data is measured
1
from a real system or its prototype. In case of availability evaluation, measurements are not always feasible. The reason being either the system has not been built yet, or it is too expensive
to conduct experiments. That is, in a high availability system one would need to measure data
from several systems to gather good sample data. On the other hand, injecting faults in a system
can be an expensive procedure. Model based evaluation is the cost-eective solution as it allows
system evaluation without having to build and measure a system. In this paper we discuss availability modeling techniques and their usage in practice. To emphasize the practicality of these
techniques, we discuss their pros and cons with respect to a case study, VAXcluster systems1 of
Digital Equipment Corporation (DEC2 ).
In this paper, we rst discuss dierent availability modeling approaches in Section 2. In Section
3, we discuss the benets of utilizing a composite availability and performance model in practice
instead of a pure availability model. Our discussion emphasizes this point using a model developed
for multiprocessors to determine the optimal number of processors in the system. In Section 4, we
present a case study to demonstrate the utility of availability modeling in a corporate environment.
2 Modeling Approaches
Model-based evaluation can be through discrete-event simulation, or analytic models, or hybrid
models combining simulation and analytic parts. A discrete-event simulation model can depict
detailed system behavior, as it is essentially a program whose execution simulates the dynamic
behavior of the system and evaluates the required measures. An analytic model consists of a set
of equations describing the system behavior. The evaluation measures are obtained by solving
these equations. In simple cases closed-form solutions are obtained, but more frequently numerical
solutions of the equations are necessary.
The main benet of discrete-event simulation is the ability to depict detailed system behavior
in the models. Also, the
exibility of discrete-event simulation allows its use in performance, availability and performability modeling. The main drawback of discrete-event simulation is the long
execution time, particularly when tight condence bounds are required in the solutions obtained.
Also, carrying out a \what if" analysis requires rerunning the model for dierent input parameters.
Advances in simulation speed-up such as regenerative simulation, importance sampling, importance
splitting, parallel and distributed simulation also need to be considered.
Analytic models are more of an abstraction of the real system than a discrete-event simulation
model. In general, analytic models tend to be easier to develop and faster to solve than a simulation
model. The main drawback is the set of assumptions that are often necessary to make analytic
models tractable. Recent advances in model generation and solution techniques as well as computing
power make analytic models more attractive. In this paper, we discuss model-based evaluation using
analytic techniques and how one can achieve results that are useful in practice.
A system modeler can either choose state space or non-state space analytical modeling techniques.
The choice of an appropriate modeling technique to represent the system behavior is dictated by
factors such as the measures of interest, level of detailed system behavior to be represented and
the capability of the model to represent it, ease of construction, and availability of software tools
1
2
to specify and solve the model. In this section, we discuss several non-state space and state space
modeling techniques.
In a reliability block diagram (RBD) each component of the system is represented as a block
[9, 12]. The blocks are then connected in series and/or parallel based on the operational dependency
between the components. If for the system to be up all the components need to be operational,
blocks in a RBD are connected in series. On the other hand, if the system can survive with at least
one component then blocks are connected in parallel. An RBD can be used to model availability
if the repair times (and failure times) are all independent. Figure 1(a) shows a multiprocessor
availability model with n processors where at least one processor is required for the system to be
up. From this we conclude that the RBD represents a simple parallel system. Given a failure rate
and repair rate , the availability of processor Proci is given by,
Ai = + :
A=1,
n
Y
i=1
(1)
(1 , Ai ) = 1 , +
n
(2)
A fault tree [9], like a reliability block diagram is useful for availability analysis. It is a pictorial
representation of the sequence of events/conditions to be satised for a failure to occur. A fault
tree uses and, or and k of n gates to represent this combination of events in a tree-like structure. To
represent situations where one failure event propagates failure along multiple paths in the fault tree,
fault trees can have repeated nodes. Several ecient algorithms for solving fault trees exist. These
include algorithms for series-parallel systems (for fault trees without repeated components) [17],
a multiple inversion (MVI) algorithm called the LT algorithm to obtain sum of disjoint products
(SDP) from mincut set [18] and the factoring/conditioning algorithm that works by factoring a
fault tree with repeated nodes into a set of fault trees without repeated nodes [20]. Binary decision
diagram (BDD)-based algorithms can be used to solve very large fault trees [21, 22]. Figure 1(b)
shows the fault tree model for our multiprocessor system. UAi represents the unavailability of
3
Proc 1
A1
FAILURE
Proc 2
A2
.
.
.
...
Proc n
An
UA UA
1
2
(a)
UA
(b)
i. Markovian Models
Markov Chains
In this section we will consider homogeneous Markov chains. A homogeneous continuous
time Markov chain (CTMC)[12] is a state space model, in which each state represents various
conditions of the system. In homogeneous CTMCs, transitions from one state to another
occur after a time that is exponentially distributed. The arcs representing a transition from
one state to another are labeled by the constant rate corresponding to the exponentially
distributed time of the transition. If a state in the CTMC has no transitions leaving it, then
that state is called an absorbing state, and a CTMC with one or more such states is said to be
an absorbing CTMC. For the multiprocessor example, we now illustrate how a Markov chain
can be developed to capture shared repair and multiple failure modes.
The parameters associated with a system availability model that we will now develop for our
multiprocessor system are, the failure rate
of each processor and the processor repair rate
. The processor fault is covered with probability c and not covered with probability 1 , c.
After a covered fault, the system is up in a degraded mode after a reconguration delay.
On the other hand, an uncovered fault is followed by a longer delay imposed by a reboot
action. The reconguration and reboot delays are assumed to be exponentially distributed
with means 1= and 1= respectively. In practice the reconguration and reboot times are
extremely small compared to the times between failures and repairs, hence we assume that
n c
X
n
Yn
n-1
n-1
n
n (1-c)
(n-1) c
n-2
(n-1) (1-c)
Y
...
n-1
arr
buffer
request
serving
service
proc
n
b-n
b , n tokens in place buer, since there can only be b tasks in the system : n in place serving
and b , n in place buer. Therefore the probability that an incoming task is rejected is the
In practical system design, a pure availability model may not be enough for systems such as
gracefully degradable ones. In conjunction with availability, the performance of the system as
it degrades needs to be considered. This requires a \performability" model that includes both
performance and availability measures. In the next section, we present an example of a system for
which a performability measure is needed.
A CTMC model for the failure/repair characteristics of the multiprocessor system is shown in Figure
2. The details of the model were already discussed when introducing Markov chains in the previous
section. The downtime during an observation interval of duration T is given by UA(n) T . The
results shown in Figure 4 assume T is 1 year, i.e., 8760 hours. In Figure 4(a) we plot the downtime
D(n) against n for varying values of the mean reconguration delay using c = 0:9,
= 1=6000 per
hour, and = 12 per hour. In Figure 4(b), we plot D(n) against n for dierent coverage values with
a mean reconguration delay of 10 seconds. We conclude from these results that the availability
benets of multiprocessing (i.e., increase in availability with increase in the number of operational
processors) is possible only if the coverage is near-perfect and the reconguration delay is very small
or most of the other processors are able to carry out useful work while a fault is being handled.
We further observe that for most practical parameter values, the optimal number of processors is 2
or 3. In the next subsection we consider a performance-based model for the multiprocessor sizing
problem.
(a)
(b)
On the lines of the GSPN example discussed in the previous section, we used an M=M=n=b queuing
model with nite buer capacity (see Figure 3) to compute the probability that a task is rejected
because the buer is full. In Figure 5 we plot the loss probability as a function of n for dierent
values of arrival rates. We observe that the loss probability reduces as the number of processors
increases. The conclusion from the performance model of the fault-free system is that the system
improves as the number of processors is increased. The details of this model and results are
presented in [2].
The above models point out the deciency of simply considering a pure availability or performance measure. The pure availability measure ignores dierent levels of performance at various
system states, while the pure performance measure ignores the failure/repair behavior of the system.
The next section considers combined measures of performance and availability.
Dierent levels of performance can be taken into account by attaching a reward rate ri corresponding
to some measure of performance to each state i of the failure/repair Markov model in Figure 2. The
resulting Markov reward model can then be analyzed for various combined measures of performance
and availability. The simplest reward rate assignment is to let ri = i for states with i operational
processors and ri = 0 for down states. With the reward assignment shown in Table 1, we can
compute the capacity-oriented availability, COA(n) as the expected reward rate in the steadystate. COA(n) is an upper bound on system performance that equates performance with system
capacity. When i processors are operational we used an M=M=i=b queuing model (such as the
7
Reward rate, r
0
State
0
1in
X , and Y ,
i = 0; : : : ; n , 2
n
State
0
1in
X , and Y ,
i = 0; : : : ; n , 2
n
Reward rate, r
0
T (i), throughput of system
with i processors and b buers
0
State
0
1in
X , and Y ,
i = 0; : : : ; n , 2
n
Reward rate, r
1
q (i) + (1 , q (i))(P (R (b) > d))
1
Figure 7: Total loss probability Vs. number of processors for dierent task arrival rates
a system with i operational processors, Ri (b) is the response time for a system with i operational
processors and b buers, and d is the deadline on task response time. In Figure 7, we plot the total
loss probability as a function of n for dierent values of the task arrival rate. We observe that
the optimal number of processors increases with the task arrival rate, tighter deadlines and smaller
buer spaces.
VAX
HSC
Disk
VAX
.
.
.
Star
Coupler
Disk
HSC
VAX
Computer Interconnect
11
The rst model of VAXclusters uses a non-state space method, namely, the reliability block diagram. This approach was seen in the availability model of VAXclusters by Balkovich et. al [6]. We
use this approach to partition the VAXcluster along functional lines, and this allows us to model
each component type separately. In Figure 9, the block diagram represents a VAXcluster conguration with n processors, n HSCs and n disks. We assume that the VAXcluster is down if all the
Processing
Subsystem
Storage
Subsystem
VAX
HSC
Disk
.
.
.
.
.
.
HSC
Disk
VAX
.
.
.
VAX
A = 1,
P
P + ( 1=
!n!
1
+1=
1,
H
H + ( 1=
!n!
1
+1=
1,
D
P + ( 1=
!n!
1
+1=
(3)
Here,
1=P is the mean time between VAX processor failures.
1=H is the mean time between HSC failures.
1=D is the mean time between disk failures.
1=F is the mean eld service travel time.
1=P , 1=H and 1=D are the mean time to repair a VAX processor, HSC and disk respectively.
The assumption that a VAXcluster is down when all the components of any of the three subsystems are down is not in tune with reality. For a VAXcluster to be operational the system should
meet quorum, where quorum is the minimum number of VAXes required for the VAXcluster to
function.
12
Cluster failure
OR
(n-k+1) of n
(n-k+1) of n
...
U U
P P
1 2
(n-k+1) of n
...
U
U U
P
n
H H
1 2
...
U
U U
n
D D
1 2
In this section, we present a fault tree model for the VAXcluster conguration discussed in Section
4.1. Figure 10, is a model for the VAXcluster with n processors, n HSCs, and n disks. Observe
that, in a block diagram model, the structure tells us when the system is functioning, while in a
fault tree model, the structure tells us when the system has failed. In addition, we have extended
the model to include a quorum required for operation. The cluster is operational as long as k out
of n processors, HSCs and disks are up. The negation of this operational information is depicted
in the fault tree as follows. The topmost node denotes \Cluster Failure" and the associated \OR"
gate species that, a cluster fails if (n , k + 1) processors, (n , k + 1) HSCs, or (n , k + 1) disks
are down. The steady state unavailability of the cluster, Ucluster is given by,
The RBD and fault tree VAXcluster availability models are very limited in their depiction of
the failure/recovery behavior of the VAXcluster. For example, they assume that each component
has its own repair facility, and that there is only one failure/recovery type. In fact, combinatorial
models like RBDs, fault trees and reliability graphs require system components to behave in a
stochastically independent manner. Dependencies of many dierent kinds exist in VAXclusters and
hence, combinatorial models are not entirely satisfactory for such systems. State space modeling
techniques like Markovian models can include dierent kinds of dependencies. In the following
sections, we develop state space models for the processing subsystem and the storage subsystem
separately, and use a hierarchical technique to combine the models of the two subsystems to obtain
an overall system availability model.
13
20p,1
11s,1
PB
P
10p,0
CB
10c,1
00t,1
10t,1
02r,1
2 (1-c)
P
2 c
P
2 (1-k)
I
01c,1
000,0
2 k
I
PB
2
01b,0
01t,1
PB
CB
14
(5)
where, Pabc;d denotes the steady-state probability that the process is in state (abc; d). We computed
the availability of the VAXcluster system by solving the above CTMC using SHARPE [9], which is
a software package for availability/performability analysis. The main problem with this approach
was that the size of the CTMC grew exponentially with the number of processors in the VAXcluster
system. The largeness posed the following challenges: (1) the capability of the software to solve the
model with thousands of states for VAXclusters with n > 5 processors. (2) the problem of actually
generating the state space. In the next section we address these two drawbacks.
n
An =
PXn PY0 PZn,n
n
0
n
,
n
X
X
n =1
n
X
n
=
PXn PZn,n , 0!(nn,! 0)! PX0 PZn
n
X
n =0
= (PX + PZ )n , PZn
X
(6)
The authors could analyze dierent VAXcluster congurations by simply varying the number
of processors n in the above equation. The main drawbacks of this approach are the approximate
nature of the solution versus an exact solution, and the need to make simplifying assumptions, one
of the assumptions being an independent repairman for each processor. In the next section, we
illustrate another approximation technique to deal with large subsystems.
In this section, we discuss a novel availability model for VAXclusters with large storage subsystems.
In [10], a xed-point iteration scheme was used over a set of CTMC sub-models. The decomposition
of the model into sub-models controlled the state space explosion and the iteration modeled the repair priorities between the dierent storage components. The model considered congurations with
shadowed (commonly known as mirrored) disks and characterized system along with application
level recovery.
In Figure 12, we show the block diagram of an example storage system conguration, that will
be used to demonstrate the technique. The conguration shown consists of two HSCs, and a set
of disks. The disks are further classied into two system disks and two application disks. The
operating system resides on the system disk, and the user accounts and other application software
on the application disks. Further, it is assumed that the disks are shadowed4 and dual pathed
and ported between the two HSCs [10]. A disk dual pathed between two HSCs can be accessed
cluster-wide in a coordinated way through either HSC. In case, one of the HSC fails, a dual ported
disk can be accessed through the other HSC after a brief failover period.
4
16
HSC1
System
Disk 1
Application
Disk 1
HSC2
System
Disk 2
Application
Disk 2
1H
0H
2A
2S
1S
2H
0S
1H
(a)
1A
2S
0A
1S
(b)
2A
1A
(c)
State nX represents, n components of the subsystem are operational, where n can take the
values 0, 1 or 2.
HSC
SDISK
ADISK
Figure 14: Top level reliability block diagram for the storage subsystem
17
State TnX represents that the eld service has arrived and that (n , 1) components of the
subsystem are operational and the rest are under repair.
In the above notation, the value of X is H , S or A, where H is associated with the HSC subsystem
model, S with the system disk subsystem model and A with the application disk subsystem model.
The steady state availability of the storage subsystem in Figure 14 is given by,
A = AH AS AA
(7)
(8)
where PiX and PT are the steady-state probability that the Markov chain is in state iX and state
TiX respectively.
In the third approximation we took into account disk reload and system recovery. This takes into
account the following activities. When a disk subsystem experiences a failure, data on the disk may
be corrupted or lost. After the disk is repaired the data is reloaded on to the disk from an external
source, such as a backup disk or tape. While the reload is a local activity of a disk subsystem,
recovery is a global system-wide activity. This behavior is incorporated in the Markov models of
Figure 15(a), (b), (c) as follows. The HSC Markov model is enhanced by including application
recovery states R2H and R1H after the loss of both the HSCs in the HSC subsystem. The system
disk Markov model is extended by incorporating reload states L2S and L1S , and application recovery
states R2S and R1S . The reload followed by application recovery starts immediately after the rst
disk is repaired. We further assume that a component could suer failures during a reload and/or
recovery. The application disk Markov model is extended similar to the system disk model by
including reload states L2A and L1A , and recovery states R2A and R1A . The expression for the
steady-state availability of the storage subsystem is similar to the expression obtained in the second
approximation.
In the fourth approximation, the assumption of independent repair facility for each subsystem is
eliminated. In this approximation, the repair facility is shared between subsystems, and when more
than one component is down, the following repair priority is assumed: (1) any subsystem with all
failed components is repaired rst; (2) otherwise, an HSC is repaired rst, system disk second, and
application disk third. This repair priority scheme does not change the Markov model for the HSC
subsystem, but changed the model for the system and application disk subsystems. The system
disk has the second highest priority and hence, the system disk repair rate D is slowed down by
multiplying it by P1 , the probability that both HSCs are operational, given that eld service is
present and the system is not in a recovery mode. Then P1 is given by,
X
iX
P1 = (P + PP2H + P ) :
2H
T1
T2
H
(9)
In [10] it is assumed that a component can be repaired during recovery. Then the system disk
repair rate, D from the recovery states is slowed down by multiplying it by P2 where,
P2 = (P PR+2 P ) :
R1
R2
H
18
(10)
R
2H
2H
2
1H
0H
R
1H
2H
2S
1S
1H
0S
2S
R
2S
(a)
(b)
2A
2
1A
2A
0A
R
2A
2A
2
1A
1A
1A
(c)
Figure 15: CTMC models:
(a) System Recovery Included for HSC subsystem,
(b) Disk Reload and System Recovery Included for SDisk subsystem,
(c) Disk Reload and System Recovery Included for ADisk subsystem
19
1S
1S
2S
2
1S
P3 = PA2H PB2S :
(11)
Here A = P2H + PT1 + PT2 and B = P2S + PT1 + PT2 + PL1 + PL2 . Then P3 expresses the
probability that both HSCs are operational given that the HSC subsystem is not in the recovery
states or in states with less than two HSCs operational, and that both system disks are operational
given that the system disk is in non-recovery states or states with more than one system disk up.
The steady-state availability is computed as in the rst approximation.
In the above approximations we included the eld service travel time for each subsystem. In the
real world, if a eld service person is present and repairing a component in one subsystem, he would
respond to a failure in another subsystem. Thus in this case we should not be including travel time
twice. Also, the eld service would follow the repair priority described above. The Markov model
for each subsystem can be modied, by iteratively checking the presence of eld service person in
the other two Markov models. The eld service person is assumed to wait on site until reload and
recovery is completed in the SDisk and ADisk subsystem, and until recovery is completed in the
HSC subsystem.
The HSC subsystem is extended as follows. The rate of transition due to a component failure
is probabilistically split using the variable y1 (or 1 , y1 ). The probability that the eld service is
present for repairing a component in either of the two disk subsystems is,
H
(12)
The initial value of y1 is assumed to be 0 in the rst iteration. Then the above value of y1 is
used for the next iteration.
The system (application) disk subsystem is extended as follows. The rate of every transition due
to a component failure that occurs in the absence of the repair person in the system (application)
disk subsystem is multiplied by y2 (or 1 , y2 ). The expression for y2 is similar to the expression
for y1 except S (A) is replaced by H . This takes into account that the eld service is present in the
HSC and/or application (system) disk subsystem.
In a similar manner, the next approximation rened the model by taking into account the
global nature of system recovery. That is, if a recovery is ongoing in one subsystem the other
two subsystems are forced to go into recovery. The approximated eect of global recovery is
achieved with an iterative scheme that allows for interaction between the sub-models. The nal
approximation only modied the HSC subsystem model to incorporate the eect of an HSC failover5
as shown in Figure 16.
In state 2H , instead of a single failure transition labeled 2H , we now have three failure transitions. If the primary HSC fails the model transitions from state 2H to state PFAIL with a rate
H , and PFAIL transitions to state 1H after a failover to the secondary HSC with a rate FD . In
Failover is the procedure of switching to an alternate path or component after failure of a path or a component
[19]. During the HSC failover period all the disks are switched on to the operational HSC.
5
20
2H
2H
(1-P )
H
(1-P )
det
SFAIL
PFAIL
2H
1H
H det
0H
P
FD
1H
RSFAIL
RPFAIL
H
H det
FD
det
T
1H
F
(13)
The steady state availability of the storage subsystem is given by Equation 7. In [10], after
various experiments it was observed that the storage downtime is more sensitive to detection of a
secondary HSC failure than the average failover time.
In this section we discuss a VAXcluster availability model that tolerates largeness and automates
the generation of large Markov models. Ibe, Sathaye et al. [1], use generalized stochastic Petri
nets to model VAXclusters. The authors used the software tool SPNP [8] to generate and solve
the Markov model underlying the SPN. In fact, the SPN model in [1] allows extensions to permit
specications at the net level, hence the resulting model is a stochastic reward net.
In Figure 17 shows a partial SPN VAXcluster system model. The details of the entire model in
[1] are beyond the scope of the paper. The place PUP with N tokens represents the initial condition
that all the N processors are up. The processors can suer a permanent or intermittent failure,
represented by the timed transitions tINT and tPERM respectively. The ring rate of the transition
tPERM and tINT ) are marking dependent. This rate is expressed as #(PUP ; i)P and #(PUP ; i)I
respectively, where #(PUP ; i) represents the number of tokens in place PUP in any marking i.
The place PPERM (PINT ) represents that a permanent (intermittent) failure has occurred. When
permanent (intermittent) failure occurs, it will be covered with probability c (k) and uncovered
with probability 1 , c (1 , k). The covered permanent and intermittent failure is represented by
immediate transitions tPC and tIC respectively. The uncovered permanent and intermittent failure
is represented by immediate transitions tPU and tIU respectively. A failure is considered to be
covered only if the number of operational processors is at least l, the quorum. The input and
output arc with multiplicity l from and to the place PUP ensures quorum maintenance. In Figure
21
Intermittent
P
UIF
Block
Permanent
l
Block
P
CPF
l
t
IC
t
P
N
P
INT
P
UP
INT
1-k
P
REB
t
t
c
P
PERM
P
RP
PC
PERM
1-c
t
IU
PU
P
UPF
REB
RECONFIG
Cluster
Reconfiguration
RECONFIG
Block
IP
A covered permanent failure is not possible while the cluster is being rebooted after an
uncovered failure (token in either PUIF or PUPF ). This is represented by an inhibitor arc
from PUIF and PUPF to the immediate transition tPC .
It is assumed that a failure does not occur while the cluster is being recongured. This is
represented by the inhibitor arcs from the \Cluster Reconguration Block" to tPERM and
tINT .
A processor under reboot can suer a permanent failure. This is represented by the fact that
when there is a token in PREB both the transitions tREB and tIP are enabled.
The steady-state availability is given by:
A=
X
i2
22
ri i
(14)
8
>
1;
>
>
>
<
ri = >
>
>
>
:0;
if (#(PUP ; i) l)
W
V
[#(PofClusterRebootPlaces; i) < 1 #(PUIF ; i) < 1]
[#(PClusterReconfigurationBlock; i) < 1]
Otherwise
In this section, we present a SPN model that considers realistic VAXcluster congurations which
include uniprocessors and multiprocessors [3]. The heterogeneity in the model allowed each multiprocessor to contain varying number of processors, and each VAX in the VAXcluster to belong
to dierent VAX families. Henceforth, we refer to the SPN model as the heterogeneous model.
Throughout this section we refer to a single multiprocessor system as a machine, which consists of
two components: one or more processors and a platform. The platform consists of the memory,
power, cooling, console interface module and I/O channel adapters [3]. As in the uniprocessor case
in the above sections, we depict covered and uncovered permanent and intermittent failures. In
addition, we depict the following failure/recovery behavior for a multiprocessor. A processor failure
in a machine requires a machine reboot to map the faulty processor oine. The entire machine is
not operational during the processor repair and the reboot following the repair. Before and after
every machine reboot a cluster reconguration maps the machine out and into the cluster. The
platform components of a machine are single points of failure for that machine [3]. In addition, the
following features are included:
The option of including the quorum disk. A quorum disk acts as a virtual node in the
Quorum
Disk
Machine N
Machine N-1
Machine 3
Machine 2
Machine 1
Field
Service
Cluster
Transition
Initial: NU = 0
For i = 1; ; N
If ((mark(Platform failure; i) = 0)AND(mark(PUP ; i) > 0)AND
Repair and Reboot Transitions Disabled then
NU = NU + 1
NU represents that a machine is up if no platform failure has occurred and that at least one
processor is up and repair or reboot is not in progress.
This heterogeneous model was evaluated using the SPNP package [8]. This package solved the
SPN by analyzing the underlying CTMC. We resolved the problem in [3] by using a technique that
involved the truncation of the state space [7]. The state space cardinality of the CTMC isomorphic
with the heterogeneous model increased with the number of machines in the VAXcluster, as well
as the number of processors in each machine. To implement this state space reduction technique
by specifying a truncation level K for processor failures in the model, the maximum value of K is
M1 + M2 + + MN . The value K species that the reachability graph and hence the corresponding
CTMC be generated up to K processor failures. This is implemented in the model by means of an
enabling function associated
PN with all the failure transitions. The enabling function disables all the
failure transitions if ( i=1 Mi , mark(PUP; i)) K . This technique is justied as follows:
In real systems, most of the time the system has majority of its components operational [7].
This means the probability mass is concentrated on a relatively small number of states in
comparison to the total number of states in the model.
We observed the impact of varying the truncation level on the availability measures for an
example heterogeneous cluster, and concluded that the eect was minimal.
We used the heterogeneous model to not only evaluate measures associated with standard system
availability, but also with system reliability and task completion. In the system reliability class
measures, we evaluated measures like frequency of failures and frequency of disruptive outages. The
term disruptive is dened as follows { any outage that exceeds the specied tolerance limit of the
user. The task completion measures evaluated the probability that the application is interrupted
during its application period.
In this paper we discuss an example measure from each of the three classes of measures [3]:
Mean Cluster Downtime D in minutes per year: This is a system availability measure and
represents the average amount of time the cluster is not operating during a one year observation period. Then the expression for D in terms of the steady state cluster availability, A
is given by,
D = (1 , A) 8760 60:
(15)
represents the mean number of recongurations which exceed the specied tolerance duration
during the one year observation period. We evaluate FDR as,
(16)
where rg is the cluster reconguration (in, out or formation) rate, thresh is the time units of
the specied tolerance duration on the reconguration times and Prg is the probability that
a reconguration is in progress.
25
Probability of Task Interruption under Pessimistic Assumption (Prob Psm): This is a task
completion measure. It measures the probability of a task that initially nds the system
available and which needs x hours for execution, but is interrupted by any failure in the
cluster. This is a pessimistic assumption because the system does not tolerate any interruption
including the brief reconguration delays. The expression for Prob Psm is given by:
Prob Psm =
,P
j 2Upstate (1:0 , e
N
k=1
(c ( + )+(
k;j
p;k
i;k
plt;k
+
plt int;k
))x
Pj
(17)
where for machine k, ck;j is the number of operational processors, Pj is the probability of
being in an operational state, p;k (i;k ) is the processor permanent(intermittent) failure rate,
plt;k (plt int;k ) is the platform permanent (intermittent) failure rates for machine k, and A
is the cluster availability.
In [3], we used these three measures for a particular conguration to study the impact of
truncation. In Table 4, we present the number of states and, the number of transitions of the
underlying CTMC.
Trunc.
Level
1
2
3
4
5
No. of No. of
States Arcs
348
948
2088
7110
6394 26686
13236 66596
20728 122746
Mean Cluster
Freq. of Disruptive
Downtime min./yr. recong. threshold=10s
12.91732078
9.96432050
13.00257283
9.96751584
13.00258549
9.96751604
13.00258549
9.96751604
13.00258549
9.96751604
Prob. of task
Interruption t=1000s
0.00032767
0.00032767
0.00032767
0.00032767
0.00032767
5 Conclusion
We started the paper by brie
y discussing various non-state space and state space availability and
performance modeling approaches. Using the problem of deciding the optimal number of processors
in an n-component parallel multiprocessor system, we showed the limitations of a pure availability or
performance model, and emphasized the need for a composite availability and performance model.
Finally, we took a case study from a corporate environment and demonstrated an application of the
techniques in a real situation. Several approximations and assumptions were made and validated
before use, in order to deal with the size and complexity of the models encountered.
26
References
[1] O. Ibe, A. Sathaye, R. Howe and K. S. Trivedi, \Stochastic Petri Net Modeling of VAXcluster
System Availability", Proc. Third International Workshop on Petri Nets and Performance
Models (PNPM89), pp. 112-121, Kyoto, Japan, 1989.
[2] K. S. Trivedi, A. Sathaye, O. Ibe, and R. Howe, \Should I Add a Processor?", Proc. 23rd
Annual Hawaii Conference on System Sciences, pp. 214-221, January 1990.
[3] A. Sathaye, K. S. Trivedi and R. Howe, \Availability Modeling of Heterogeneous VAXcluster
Systems: A Stochastic Petri Net Approach", Proc. of International Conference on FaultTolerant Systems, Varna, January 1990.
[4] J. Muppala, A. Sathaye, R. Howe and K. S. Trivedi, \Dependability Modeling of a Heterogeneous VAXcluster System Using Stochastic Reward Nets", Hardware and Software Fault
Tolerance in Parallel Computing Systems, D. Avresky (ed.), pp. 33-59, Ellis Horwood Ltd.,
1992.
[5] N.P. Kronenberg, H.M. Levy, W.D. Strecker, R.J. Merwood, \VAXclusters: A Closely
Coupled Distributed System", ACM Trans. Computer Systems, Vol. 4, pp. 130-146, May 1986.
[6] E. Balkovich, P. Bhabhalia, W . Dunnington, and T. Weyant, \VAXcluster Availability Modeling", Digital Technical Journal, No. 5, pp. 69-79, September 1987.
[7] R. Muntz, E. de Souza e Silva, and A. Goyal, \Bounding Availability of Repairable Computer
Systems", IEEE Trans. on Computers, Vol. 38, No. 12, pp. 1714{1723, December, 1989.
[8] G. Ciardo, J. Muppala and K. S. Trivedi, \SPNP: Stochastic Petri Net Package", Proc. Third
Int. Workshop on Petri Nets and Performance Models (PNPM89), pp. 142 - 151, Kyoto, Japan,
1989.
[9] R. Sahner, A. Puliato and K. S. Trivedi, Performance and Reliability Analysis of Computer
Systems: An Example-Based Approach Using the SHARPE Software Package, Kluwer Academic Publishers, Boston, 1995 (418 pages).
[10] A. Sathaye, K. Trivedi and D. Heimann, \Approximate Availability Models of the Storage
Subsystem," Technical Report, DEC., September 1988.
[11] D. Siewiorek and R. Swarz, The Theory and Practice of Reliable System Design, Digital Press,
1982.
[12] K. S. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, Prentice-Hall, Englewood Clis, NJ, 1982 (624 pages).
[13] L. Tomek and K. S. Trivedi, \Fixed Point Iteration in Availability Modeling", InformatikFachberichte, Vol. 283: Fehlertolerierende Rechensysteme, M. Dal Cin (ed.), pp. 229-240,
Springer-Verlag, Berlin, 1991.
[14] D. R. Avresky, Hardware and software fault tolerance in parallel computing systems, Ellis
Horwood Ltd., New York, 1992.
[15] H. Sun, X. Zang and K. S. Trivedi, \A BDD-based Algorithm for Reliability Analysis of
Phased-Mission Systems", IEEE Transactions on Reliability, Vol. 48, No. 1, pp. 50{60, March
1999.
27
28