Professional Documents
Culture Documents
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
significant probing effect to the Grid system subject to mon- dependencies amongst them. For example, a Producer
itoring, as the monitoring itself already adds some level of may assume that a certain sensor exists in its accessi-
performance intrusiveness already. ble network reach in an efficient fashion, so that mon-
Given such a background, our goal is to develop a gener- itored data can be pulled off the sensor. Any faults or
alized autonomous management framework for Grid mon- reconfigurations that would invalidate such an depen-
itoring systems. Currently, a prototype has been built on dence assumption will require the system to take some
top of NWS, and will automatically configure its “clique” actions to cope with the situation, based on the depen-
groups as well as cope with single-node faults automati- dency information.
cally without user intervention. An experimental deploy-
ment on the Tokyo Institute of Technology’s Campus Grid Registration dependency In a similar fashion, Producers,
(The Titech Grid) consisting of over 15 sites and 800 pro- Consumers, Sensors, etc. register themselves with the
cessors on campus interconnected by a multi-Gigabit back- Directory Service, and as such there will be natural
bone, has shown the system to be robust in handling faults central dependence assumption that registration will
and reconfigurations, automatically deriving an ideal clique persist and accessible from the Directory Service. Any
configuration for all the nodes in less than two minutes. change in the system must have automated effect on
the component registration, as well as coping with the
failure of the directory service itself.
2. Overview of Existing Grid Monitoring Sys-
tems Here is examples of actual implementations of Grid
Monitoring systems used in some forms of production:
We first briefly survey the existing Grid monitoring sys-
tems to investigate their component-level architecture. 2.2 The Network Weather Service (NWS)
2.1. Grid Monitoring Architecture The Network Weather Service (NWS) is the wide area
distributed monitoring system developed at SDSC, and con-
sists of the following four components:
A Grid Monitoring and Performance working group at
the Global Grid Forum (GGF)[5] proposed a Grid Monitor- Sensorhost measures CPU, memory and disk usage, and
ing Architecture (GMA)[10] specification in one of its doc- performance of the network,
uments. The paper addresses the basic architecture of Grid
monitoring systems, identifying the needed functionalities Memoryhost stores and manages the short-term monitored
of each component, as well as allows for interoperability data temporarily, and provide them to client programs
between different Grid monitoring systems. In particular,
The GMA specification defines three components: Nameserver Synonymous to the GMA Directory Service.
The sensorhost and memoryhost will register them-
Producer retrieves performance data from various Sen- selves and their relevant info (e.g., the IP address of
sors, and makes them available to other GMA compo- the node they will be running on) to the Nameserver.
nents. Producers can be regarded as a component class
Client program & Forecaster The client program ex-
of data source. The GMA specification does not define
tracts monitored data for its own use, while the fore-
how the Producer and the Sensors mutually interact.
caster is a special client that makes near-term predic-
Consumer receives performance data from Producers and tions of monitored data.
processes them, such as filtering or archiving their
NWS requires the Grid administrators to make configu-
info. Consumers can be regarded as a component class
ration decisions upon starting up the system. For example,
of data sink.
each sensorhost must be registered to the nameserver upon
Directory Service supports information publication and startup, upon which an administrator must decide which
discovery of components as well as monitored data. memoryhost this sensorhost sends data to. This also im-
plies that the nameserver and memoryhost must be running
The GMA components, those that are defined above as prior to sensorhost startup.
well as external ones such as Sensors, are not stand-alone. Another issue with the older version of NWS had been
Rather, there are interdependent amongst themselves, and that bandwidth measurements between the Grid nodes were
an autonomous monitoring system must identify and main- required for every valid pair of nodes in the Grid; this meant
tain such dependencies in a low-overhead, automated fash- continuous bandwidth measurement of O(n2 ) (n: number
ion. In particular, there are two kinds of dependencies of machines) complexity, which would have put consider-
amongst GMA components, including the sensors. able network traffic pressure on the entire Grid. To resolve
this issue, the later versions of NWS introduced the network
Data transfer dependency: Since Producers, Consumers, “clique ” feature for greatly reducing the measurement cost.
as well as “external” GMA components such as sen- Each administrator at sites groups machines typically on lo-
sors or archives communicate with each other via net- cal sites with full mutual connectivity and without signifi-
works to transfer monitored data, there will be natural cant wide-area bandwidth differences into “clique” groups,
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
and picks a representative node in each clique for perform- gathers data remotely from the agents. These roughly cor-
ing O(m2 ). bandwidth measurements, where m is now a respond to Sensors, Producers and Consumers in the GMA
number of cliques. The bandwidth measured between the model. There are command-line, GUI, and web front end
representing nodes in the cliques is now regarded as the tools to observe the monitored data.
bandwidth between any pair of nodes in respective pair of
cliques. 2.5. R-GMA
The problem then is that, sometimes clique grouping
may not be obvious, nor how the representing nodes would R-GMA is being developed as part of the Work Package
be chosen. In the days of high bandwidth wide-area net- 3 of the EU DataGrid Project[4]. It is based on the GMA
works, some machines in the local area may have less mu- architecture but also combines relational database and Java
tual bandwidth compared to the wide area. And it is not ob- servlet technology, in that R-GMA implements producer,
vious without performing proper bandwidth measurements consumer and registry as Java servlet, and uses relational
which Grid node would best serve as a representing clique databases for registration. In addition, Producer and con-
node. Moreover, when faults and changes occur in the net- sumer facilitates its own interface (s). Although the R-GMA
work, cliques as well as representing nodes may have to be is flexible enough to accommodate different types of sen-
changed to best suite the new network configurations. To sors, the EU Data Grid will employ Nagios for data collec-
expect the network administrators to coordinate to maintain tion.
such a structure in a very large Grid is a daunting task.
2.6. Summary of Existing Monitoring Systems
2.3. Globus MDS
As has been pointed out above, much of the Grid moni-
The Globus Alliance[6] distributes the Monitoring and toring systems have good commonalities in its architecture,
Discovery Service(MDS)[1] as part of the Globus Toolkit. although each system calls them differently, and there are
The MDS focuses on being a scalable and reliable Grid in- some subtle differences among them. Table1 is a summary
formation system, to embody various Grid configuration as of Grid monitoring components.
well as monitoring information. The MDS consists two
components:
GMA Components
GRIS(Grid Resource Information Service) serves as a
repository and collector of local resource information. Comsumer
For example, it collects and sends to GIIS by request
the information of local hardware (CPU, memory, stor- Directory
age, etc.), the software (OS version, installed applica- Service
tions and libraries, etc.). The Globus toolkit itself uses
GIIS for various purposes, but also arbitrary applica- Producer Registration
tions and middleware can utilize GRIS by registering dependency
its own private information.
Data transfar
GIIS(Grid Index Information Services) is part of a MDS
tree or DAG that maintains a hierarchical organization dependency
of information services, with GIIS as the interior nodes
and GRIS as the leaf nodes. GIIS receives for regis-
tration from sibling GRIS and GIIS using a dedicated Sensor Application Monitoring Database
protocol called GRRP (Grid Registration Protocol). System
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
Table 1. Summary of Grid Monitoring Systems
NWS MDS Hawkeye R-GMA
Directory Service nameserver GIIS Manager Registry Servlet
Producer memoryhost GRIS Agent & Manager Producer Servret
Consumer Commands Commands Commands Consumer Servlet
Source of Data sensorhost GRIS & middleware modules some sensors
system will be “aware” of the correct configuration of the communication. In this case some proxies may be des-
Grid, based on various info including its own probing info ignated, or the monitoring system may be temporar-
that the systems uses to determine the configuration of the ily split up and individually operate, and later merged
Grid, esp. the status of the nodes and the network topolo- when communication recovers.
gies. Here is the set of the requirements we imposed on
such a system: In all cases, the loss of communications between compo-
nents is the first sign of failure. The system then must pro-
• Applicability: Support of multiple, existing Grid mon- ceed to determine what fault had actually occurred, by prov-
itoring systems. ing the system dynamically to discover whether the subject
node is dead or alive, whether there are alternate network
• Scalability: Scalable to numerous numbers of nodes paths that exist, etc. The monitoring feature of the monitor-
interconnected with complex network topologies ing system itself may be used for this purpose when appro-
• Autonomy: Managing of the Grid monitoring system, priate.
including coping with dynamic faults and reconfigura- To satisfy the above goals, we apply the following com-
tions, must be largely autonomous with very little user ponent allocation and execution strategies in our current
intervention. prototype.
• Extensibility: The framework should be extensible to • The system first forecasts the network topology of Grid
incorporate various autonomic, self-management fea- nodes, as well as diagnoses whether particular Grid
tures. monitoring components will correctly execute on each
node.
The autonomic management of Grid monitoring systems
is largely into three issues. The first is the configuration of • It then determines and forms node groups that serve as
the monitoring system, identifying component dependen- cliques. For newly added nodes it will edit and reform
cies, registering with the directory service, starting the sen- group memberships accordingly.
sor, producer, and consumer processes, preparing the stor-
age for data collection, etc. This may not be done all at • It next decides on which nodes respective Grid moni-
once, but rather it must be possible for (re-)configurations toring components should execute on.
to occur gradually as new nodes enter and leave from the
Grid. Any groupings such as NWS cliques must also be • Finally, it actually starts up the components on as-
handled here, by observing and choosing the appropriate signed nodes, and registers them to the directory ser-
groupings as well as the representative nodes via dynamic vice(s) thereof.
instrumentation.
The second issue is detecting and handling of faults in Executability of each component could be determined by
the Grid monitoring system itself. There can be several ready-made diagnostics utilities that come with each Grid
types of faults monitoring system. Such a check is essential since some
components serve an important role such that all other mon-
• Monitoring process termination: when a process of the itoring components will depend on it. One such example
monitoring system gets terminated, such as accidental would be the directory server component.
process singalling or OS reboot. Once the system starts executing, the system must sup-
• Node loss: when a node is physically lost due to hard- port dynamic removal of faulty nodes from the group, and
ware failure, power loss, etc. In this case the system addition of machines which have recovered or designated
must recover what it can of the current monitoring info, as a replacement. If the group configuration changes in any
and also reconfigure an alternate node for running the way, such a change must be registered, advertised, and read-
component. This is dependent on what component the ily noted by other parts of the system. For example, when
node has been running. the representing node of a particular group changes, then
such a change must be known to all other representing nodes
• Network loss: although difficult to distinguish from in the group as well as other necessary components. Such
node loss, sometimes a network may become discon- an action must be done without incurring significant CPU
nected, but alternate paths may be available for indirect or networking costs.
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
4. Prototype Implementation • If the most proximal node belongs to the same group,
then a new group is created with some arbitrary non-
We have implemented a prototype of autonomously con- member node being the singleton member, and those
figuring Grid monitoring system on top of NWS. The cur- are designated as the new current node and the current
rent prototype configures all the NWS components auto- group, respectively.
matically, and recover from process and node failures of
• If the proximal node does not belong to any group.
some if not all of the components. Currently, the au-
Then a new node is chosen as current node and it is
tonomous management functions are executed on a single added to the current group.
Grid node; this is not ideal as it hampers scalability, as well
as causing the node to be a single point of failure within The autonomic monitor manager then designates the
the system. We plan to replicate and distribute the manage- Grid node with the a) most ping connectivity with other ma-
ment functionalities to solve both problems. For proof-of- chines, and b) the minimum average RTT from other nodes
concept for many of the features, the current systems suf- recorded above, as the NWS name server. Then for each
fices. group, the node being designated as being the most proxi-
The prototype configures NWS automatically in the fol- mal from the most number of nodes in the group, is desig-
lowing manner: nated as being the NWS memory host, and that particular
• Sensorhosts are executed on all the nodes that the ad- node is also chosen to be the group (clique) representative.
ministrator listed. Now that the system has sufficient configuration, the sys-
tem determines the dependency information between the
• Nameserver is executed on one of the nodes the admin- NWS components and the nodes the components will be
istrator listed. actually running on. As described earlier, the NWS memo-
ryhost needs to know the node and the port number where
• Network distances between the nodes are determined the nameserver will run and listen to. The NWS sensorhost
by actively measuring and averaging the RTT between will also need the nameserver information as well as needs
the nodes. to be told which memoryhost on which host / port number it
• Given the RTT info, nodes are grouped into cliques, should send its data to, etc. Such configurations are format-
and a representative node is chosen. Also, a node is ted as command-line NWS component startup options to be
chosen to act as the memoryhost of each group. executed at respective nodes via some Grid execution ser-
vice, in the order of nameserver, memoryhost, and sensor
To determine the RTT upon node grouping above, we hosts.
systematically ICMP “ping” the nodes in parallel in a n-by-
n fashion. In particular from a list of machines provided by 5. Fault Handling and Recovery of NWS com-
the Grid administrator, we generate two shell scripts, one
that runs on a machine that acts as as the autonomic mon-
ponents in the Prototype
itor manager, and another that runs on all machines that
would ping all other nodes in the network. The scripts are The current prototype handles two types of faults in the
transferred to the necessary nodes and executed using some NWS. One of these is simply when some components fails
secure invocation mechanism provided by the Grid itself, or to execute, or terminates unexpectedly. The other is when
some other mechanism such as ssh. that the autonomic monitor manager loses network access
For each Grid node, we measure and record the node due to trouble in the node hardware, OS, or network. (We
with minimum RTT excluding oneself, and record it as the currently do not distinguish between node failure and net-
most proximal node. Then, each node calculates the average work failure.)
of RTT amongst all other nodes that have responded. In the first case, the autonomic monitor manager simply
After all the RTT measurements have finished the aver- attempts to re-execute the failed component. If the compo-
age RTT data are sent to the autonomic monitor manager by nent fails again after repeated trial in a very short interval,
every node in the system. The autonomic monitor manager the node is deemed to have a problem, and regarded as a
in turn organizes the nodes into (clique) groups in the fol- node failure (i.e., the second case).
lowing, bottom-up fashion fashion. A node is chosen as the In the second case, if the node that failed was execut-
“current node”, and a singleton group is created with the ing either the NWS memoryhost or a nameserver, the auto-
current node being the singleton member, and designated nomic monitor manager must designate, prepare, and restart
as the “current group”. Then, the following process is re- the service on a replacement machine. Moreover, other
peated: components running on other nodes must be notified as
well; for example, when the nameserver is restarted on an-
• If the most proximal node from the current node other node with a different IP address, all the sensorhost and
belongs to another group, then the two groups are memoryhosts must be notified of this fact and re-registered
merged. Then a new group is created with some ar- with the new nameserver, based on this new configuration
bitrary non-member node being the singleton member, info. Similarly, if the memoryhost crashes and restarted on
and those are designated as the new current node and another node, then all the sensorhosts must be told to redi-
the current group, respectively. rect the data.
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
For the first case, the current prototype simply moni-
tors the component execution status using the ps command. Table 2. Time for initial configuration of our
Configuration info are held in a file where the autonomic prototype on the Titech Grid
monitoring manager is running. To detect whether a node is
running, we periodically check whether we can have ssh
connection to the machine. To restart the components a Clusters 3 6 10
shell script for doing so is generated by the manager, and RTT Measurement (sec) 21 39 76
executed appropriately on the replacement host. NWS execution (sec) 19 30 52
Total (sec) 40 69 128
6. Evaluation of the Prototype
For evaluation, we installed our prototype on the Cam- recovery process took 39 seconds according to our mea-
pus Grid at the Tokyo Institute of Technology, or the “Titech surements, 38 seconds of which was spent re-measuring the
Grid” for short. The Titech Grid has 15 PC clusters totaling RTT and status of the components. The actual configura-
over 800 processors spread out throughout the two Titech tion decision and restart took less than 1 second. Again, the
campuses which are situated approximately 30kms apart. entire process worked automatically as intended.
The entire campus is covered by SuperTITANET which is The results show that, (1) our prototype can cope well
a multi-gigabit campus backbone. Each Titech Grid node with the limited fault scenarios under the current configura-
is designed to be connected directly to the backbone via a tion, but (2) for the current system to scale beyond hundreds
managed switch, and can communicate peer-to-peer with of nodes (clusters), the measurement time must be reduced
any other nodes on the Grid. For the experiment, we em- drastically. Currently, much of the overhead is attributable
ployed the head login nodes of each PC cluster to run NWS to ssh execution of measurement processes; we must devise
components. faster, more persistent measuring scheme to amortize the
overhead.
6.1. Initial Setup
[tgn008001]
[tgn010001] Measurement of network
We show the result of time required for initial setup of S performance between
the system in Table 2. We see that the time required is in the M
representation
order of tens of seconds, but is proportional to the number [tgn011001]
S S M
of clusters. A breakdown shows that approximately 50 % [tgn007001]
S
of the time is being spent to collect RTT data, and the rest S
to execute the components via SSH. N
[tgn015001] S
S
The grouping algorithm worked very well, splitting the S [tgn002001] S [tgn013001]
clusters automatically into two groups, one for each cam- [tgn016001] [tgn018001] S
pus. This is due to the fact that the average RTT between Oookayama
[tgn014001]
campuses is approximately 2-3 times greater than RTT be-
N Suzukakedai
tween the nodes located at the farthest ends of the same NWS Nameserver
campus. (The groups are naturally cliques in this case since M Memoryhost
S Sensorhost
they all have p2p connectivity.) Data flow form sensor
to Memoryhost
6.2. Fault Recovery
We first investigate the scenario in which the cluster node Figure 2. Result for 10 clusters on the Titech
that executes the memoryhost crashes. The upper half Fig- Grid
ure 3 shows the configuration prior to the crash. There are
two groups in the Grid, and tgn015001 and tgn005001 ex-
ecute the memoryhost components (the cluster nodes are
real ones from the Titech Grid). Sensors send the moni-
tored data to the memoryhost of the same group. When 7. Conclusion and Future Work
tgn015001 crashes, recovery is performed automatically,
and tgn013001 is designated as a replacement. Appropri- We proposed that Grid monitoring systems need to con-
ate memoryhost restart as well as sensor redirections are figure themselves automatically and presented a prototype
performed, as shown in the bottom half of Figure 3. No- on top of NWS that deals with limited but common case
tice that, only the cluster nodes in the right hand group is of faults. The prototype worked well under a limited set-
affected. ting on the Titech Grid, showing reasonable startup as
Similarly, Figure 4 shows when the cluster node that well as recovery performance, and the system executed au-
turns the NWS nameserver in the Grid crashes. The whole tonomously as expected.
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
S S S
S S S
tgn009001 tgn002001 tgn012001
tgn016001 tgn009001 tgn012001
registration
N M S M S N S M S M S S
tgn016001 tgn015001 tgn013001 tgn005001
tgn015001 tgn015001
S S
registration
tgn002001 tgn013001
S S S
tgn009001 tgn002001 tgn012001 registration
N S M S M S N S
S S S
tgn015001 tgn016001 tgn013001 tgn005001
tgn016001 tgn009001 tgn012001
N M S M S registration
tgn015001 tgn013001
Acknowledgments
References
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE