You are on page 1of 7

Autonomous Configuration of Grid Monitoring Systems

Ken’ichiro Shirose Satoshi Matsuoka


Tokyo Institute of Technology Tokyo Institute of Technology
E-mail: Kenichiro.Shirose@is.titech.ac.jp National Institute of Informatics
E-mail: matsu@is.titech.ac.jp
Hidemoto Nakada
National Institute of Advanced Industrial Science and Technology
Tokyo Institute of Technology
E-mail: hide-nakada@aist.go.jp
Hirotaka Ogawa
National Institute of Advanced Industrial Science and Technology
E-mail: h-ogawa@aist.go.jp

Abstract and processing of data across the globe in a large datagrid


project.
The problem with practical, large-scale deployment of In all such practical deployments of shared resources on
Grid monitoring system is that, it takes considerable man- the Grid, Grid monitoring systems are absolute musts at all
agement cost and skills to maintain the level of quality re- levels, for users for observing the status of the Grid and
quired by production usage, since the monitoring system planning his or her run accordingly, for applications that
will be fundamentally be distributed, need to be running adapt to the available cycles and network bandwidth, as well
continuously, and will itself likely be affected by the var- as middleware that attempts to do the same such as job bro-
ious faults and dynamic reconfigurations of the Grid it- kers and schedulers, not to mention the administrators that
self. Although their automated management would be de- observe the overall “health” status of the Grid and react ac-
sirable, there are several difficulties, distributed faults and cordingly if any problem arises at their sites.
reconfigurations, component interdependencies, and scal- The problem with practical, large-scale monitoring de-
ing to maintain performance while minimizing probing ef- ployment across the Grid is that, it takes considerable man-
fect. Given our goal to develop a generalized autonomous agement cost and skills to maintain the level of quality re-
management framework for Grid monitoring, we have built quired by production usage, since the monitoring system
a prototype, on top of NWS, featuring automatic configura- will be fundamentally be distributed, needs to be running
tion of its “clique” groups as well as coping with single- continuously, and will itself likely be affected by the various
node faults without user intervention. An experimental de- faults and dynamic reconfigurations of the Grid itself. The
ployment on the Tokyo Institute of Technology’s Campus requirement of such automated management of monitoring
Grid (The Titech Grid) consisting of over 15 sites and 800 system itself, in the spirit of such industrial efforts as IBM’s
processors has shown the system to be robust in handling “Autonomic Computing”[9], may seem apparent but have
faults and reconfigurations, automatically deriving an ideal not been adequately addressed in existing Grid monitoring
clique configuration for the head login nodes of each PC systems work such as Network Weather Service (NWS)[11]
cluster in less than two minutes. or R-GMA[2].
The technical difficulties with automated management of
Grid monitoring systems are several fold. Firstly, their con-
stituent components will be fundamentally distributed and
1. Introduction subject to faults and reconfigurations, as mentioned earlier.
Another challenge is that, Grid monitoring systems usually
The users of Grids access numbers of distributed com- consist of several functional components that are heavily
putational resources in a concurrent fashion. As an exam- dependent on each other, interconnected by physical net-
ple, an Operations Research researcher performing a large works. Such dependencies may hamper faults and recon-
branch-and-bound parallel search on the Grid may need figurations as the effect of system alteration will propagate
hundreds of CPUs on machines situated across several sites. through the system, and as such they may not be isolated.
Another example would be multiple physicists requiring Also, single point of failure would not be desirable, due to
terascale or even petascale storage and databases access its distributed nature. Finally, the system should not add

Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
significant probing effect to the Grid system subject to mon- dependencies amongst them. For example, a Producer
itoring, as the monitoring itself already adds some level of may assume that a certain sensor exists in its accessi-
performance intrusiveness already. ble network reach in an efficient fashion, so that mon-
Given such a background, our goal is to develop a gener- itored data can be pulled off the sensor. Any faults or
alized autonomous management framework for Grid mon- reconfigurations that would invalidate such an depen-
itoring systems. Currently, a prototype has been built on dence assumption will require the system to take some
top of NWS, and will automatically configure its “clique” actions to cope with the situation, based on the depen-
groups as well as cope with single-node faults automati- dency information.
cally without user intervention. An experimental deploy-
ment on the Tokyo Institute of Technology’s Campus Grid Registration dependency In a similar fashion, Producers,
(The Titech Grid) consisting of over 15 sites and 800 pro- Consumers, Sensors, etc. register themselves with the
cessors on campus interconnected by a multi-Gigabit back- Directory Service, and as such there will be natural
bone, has shown the system to be robust in handling faults central dependence assumption that registration will
and reconfigurations, automatically deriving an ideal clique persist and accessible from the Directory Service. Any
configuration for all the nodes in less than two minutes. change in the system must have automated effect on
the component registration, as well as coping with the
failure of the directory service itself.
2. Overview of Existing Grid Monitoring Sys-
tems Here is examples of actual implementations of Grid
Monitoring systems used in some forms of production:
We first briefly survey the existing Grid monitoring sys-
tems to investigate their component-level architecture. 2.2 The Network Weather Service (NWS)

2.1. Grid Monitoring Architecture The Network Weather Service (NWS) is the wide area
distributed monitoring system developed at SDSC, and con-
sists of the following four components:
A Grid Monitoring and Performance working group at
the Global Grid Forum (GGF)[5] proposed a Grid Monitor- Sensorhost measures CPU, memory and disk usage, and
ing Architecture (GMA)[10] specification in one of its doc- performance of the network,
uments. The paper addresses the basic architecture of Grid
monitoring systems, identifying the needed functionalities Memoryhost stores and manages the short-term monitored
of each component, as well as allows for interoperability data temporarily, and provide them to client programs
between different Grid monitoring systems. In particular,
The GMA specification defines three components: Nameserver Synonymous to the GMA Directory Service.
The sensorhost and memoryhost will register them-
Producer retrieves performance data from various Sen- selves and their relevant info (e.g., the IP address of
sors, and makes them available to other GMA compo- the node they will be running on) to the Nameserver.
nents. Producers can be regarded as a component class
Client program & Forecaster The client program ex-
of data source. The GMA specification does not define
tracts monitored data for its own use, while the fore-
how the Producer and the Sensors mutually interact.
caster is a special client that makes near-term predic-
Consumer receives performance data from Producers and tions of monitored data.
processes them, such as filtering or archiving their
NWS requires the Grid administrators to make configu-
info. Consumers can be regarded as a component class
ration decisions upon starting up the system. For example,
of data sink.
each sensorhost must be registered to the nameserver upon
Directory Service supports information publication and startup, upon which an administrator must decide which
discovery of components as well as monitored data. memoryhost this sensorhost sends data to. This also im-
plies that the nameserver and memoryhost must be running
The GMA components, those that are defined above as prior to sensorhost startup.
well as external ones such as Sensors, are not stand-alone. Another issue with the older version of NWS had been
Rather, there are interdependent amongst themselves, and that bandwidth measurements between the Grid nodes were
an autonomous monitoring system must identify and main- required for every valid pair of nodes in the Grid; this meant
tain such dependencies in a low-overhead, automated fash- continuous bandwidth measurement of O(n2 ) (n: number
ion. In particular, there are two kinds of dependencies of machines) complexity, which would have put consider-
amongst GMA components, including the sensors. able network traffic pressure on the entire Grid. To resolve
this issue, the later versions of NWS introduced the network
Data transfer dependency: Since Producers, Consumers, “clique ” feature for greatly reducing the measurement cost.
as well as “external” GMA components such as sen- Each administrator at sites groups machines typically on lo-
sors or archives communicate with each other via net- cal sites with full mutual connectivity and without signifi-
works to transfer monitored data, there will be natural cant wide-area bandwidth differences into “clique” groups,

Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
and picks a representative node in each clique for perform- gathers data remotely from the agents. These roughly cor-
ing O(m2 ). bandwidth measurements, where m is now a respond to Sensors, Producers and Consumers in the GMA
number of cliques. The bandwidth measured between the model. There are command-line, GUI, and web front end
representing nodes in the cliques is now regarded as the tools to observe the monitored data.
bandwidth between any pair of nodes in respective pair of
cliques. 2.5. R-GMA
The problem then is that, sometimes clique grouping
may not be obvious, nor how the representing nodes would R-GMA is being developed as part of the Work Package
be chosen. In the days of high bandwidth wide-area net- 3 of the EU DataGrid Project[4]. It is based on the GMA
works, some machines in the local area may have less mu- architecture but also combines relational database and Java
tual bandwidth compared to the wide area. And it is not ob- servlet technology, in that R-GMA implements producer,
vious without performing proper bandwidth measurements consumer and registry as Java servlet, and uses relational
which Grid node would best serve as a representing clique databases for registration. In addition, Producer and con-
node. Moreover, when faults and changes occur in the net- sumer facilitates its own interface (s). Although the R-GMA
work, cliques as well as representing nodes may have to be is flexible enough to accommodate different types of sen-
changed to best suite the new network configurations. To sors, the EU Data Grid will employ Nagios for data collec-
expect the network administrators to coordinate to maintain tion.
such a structure in a very large Grid is a daunting task.
2.6. Summary of Existing Monitoring Systems
2.3. Globus MDS
As has been pointed out above, much of the Grid moni-
The Globus Alliance[6] distributes the Monitoring and toring systems have good commonalities in its architecture,
Discovery Service(MDS)[1] as part of the Globus Toolkit. although each system calls them differently, and there are
The MDS focuses on being a scalable and reliable Grid in- some subtle differences among them. Table1 is a summary
formation system, to embody various Grid configuration as of Grid monitoring components.
well as monitoring information. The MDS consists two
components:
GMA Components
GRIS(Grid Resource Information Service) serves as a
repository and collector of local resource information. Comsumer
For example, it collects and sends to GIIS by request
the information of local hardware (CPU, memory, stor- Directory
age, etc.), the software (OS version, installed applica- Service
tions and libraries, etc.). The Globus toolkit itself uses
GIIS for various purposes, but also arbitrary applica- Producer Registration
tions and middleware can utilize GRIS by registering dependency
its own private information.
Data transfar
GIIS(Grid Index Information Services) is part of a MDS
tree or DAG that maintains a hierarchical organization dependency
of information services, with GIIS as the interior nodes
and GRIS as the leaf nodes. GIIS receives for regis-
tration from sibling GRIS and GIIS using a dedicated Sensor Application Monitoring Database
protocol called GRRP (Grid Registration Protocol). System

Middleware and Applications look up monitoring infor-


Sources of Data
mation stored in the MDS hierarchy via extraction com-
mand, via a Globus proxy.
Figure 1. GMA Components and Sources of
2.4. Hawkeye Data

Hawkeye[7] is a monitoring system developed as a part


of the Condor Project[3]. It is based on Condor and ClassAd
technology, in that it uses the Condor job execution host- 3. Autonomous Configuration of Grid Moni-
ing mechanism itself to run the monitor process in a fault- toring Systems
tolerant way, and the protocol employed is basically the
Condor ClassAd. Hawkeye modules, which can be binaries Based on the analysis of the previous section, we are cur-
or scripts, collect data from individual machines, and agents rently designing and developing a general framework for
gather monitored data from modules. The Manager in turn autonomously configuring Grid monitoring systems. The

Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
Table 1. Summary of Grid Monitoring Systems
NWS MDS Hawkeye R-GMA
Directory Service nameserver GIIS Manager Registry Servlet
Producer memoryhost GRIS Agent & Manager Producer Servret
Consumer Commands Commands Commands Consumer Servlet
Source of Data sensorhost GRIS & middleware modules some sensors

system will be “aware” of the correct configuration of the communication. In this case some proxies may be des-
Grid, based on various info including its own probing info ignated, or the monitoring system may be temporar-
that the systems uses to determine the configuration of the ily split up and individually operate, and later merged
Grid, esp. the status of the nodes and the network topolo- when communication recovers.
gies. Here is the set of the requirements we imposed on
such a system: In all cases, the loss of communications between compo-
nents is the first sign of failure. The system then must pro-
• Applicability: Support of multiple, existing Grid mon- ceed to determine what fault had actually occurred, by prov-
itoring systems. ing the system dynamically to discover whether the subject
node is dead or alive, whether there are alternate network
• Scalability: Scalable to numerous numbers of nodes paths that exist, etc. The monitoring feature of the monitor-
interconnected with complex network topologies ing system itself may be used for this purpose when appro-
• Autonomy: Managing of the Grid monitoring system, priate.
including coping with dynamic faults and reconfigura- To satisfy the above goals, we apply the following com-
tions, must be largely autonomous with very little user ponent allocation and execution strategies in our current
intervention. prototype.

• Extensibility: The framework should be extensible to • The system first forecasts the network topology of Grid
incorporate various autonomic, self-management fea- nodes, as well as diagnoses whether particular Grid
tures. monitoring components will correctly execute on each
node.
The autonomic management of Grid monitoring systems
is largely into three issues. The first is the configuration of • It then determines and forms node groups that serve as
the monitoring system, identifying component dependen- cliques. For newly added nodes it will edit and reform
cies, registering with the directory service, starting the sen- group memberships accordingly.
sor, producer, and consumer processes, preparing the stor-
age for data collection, etc. This may not be done all at • It next decides on which nodes respective Grid moni-
once, but rather it must be possible for (re-)configurations toring components should execute on.
to occur gradually as new nodes enter and leave from the
Grid. Any groupings such as NWS cliques must also be • Finally, it actually starts up the components on as-
handled here, by observing and choosing the appropriate signed nodes, and registers them to the directory ser-
groupings as well as the representative nodes via dynamic vice(s) thereof.
instrumentation.
The second issue is detecting and handling of faults in Executability of each component could be determined by
the Grid monitoring system itself. There can be several ready-made diagnostics utilities that come with each Grid
types of faults monitoring system. Such a check is essential since some
components serve an important role such that all other mon-
• Monitoring process termination: when a process of the itoring components will depend on it. One such example
monitoring system gets terminated, such as accidental would be the directory server component.
process singalling or OS reboot. Once the system starts executing, the system must sup-
• Node loss: when a node is physically lost due to hard- port dynamic removal of faulty nodes from the group, and
ware failure, power loss, etc. In this case the system addition of machines which have recovered or designated
must recover what it can of the current monitoring info, as a replacement. If the group configuration changes in any
and also reconfigure an alternate node for running the way, such a change must be registered, advertised, and read-
component. This is dependent on what component the ily noted by other parts of the system. For example, when
node has been running. the representing node of a particular group changes, then
such a change must be known to all other representing nodes
• Network loss: although difficult to distinguish from in the group as well as other necessary components. Such
node loss, sometimes a network may become discon- an action must be done without incurring significant CPU
nected, but alternate paths may be available for indirect or networking costs.

Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
4. Prototype Implementation • If the most proximal node belongs to the same group,
then a new group is created with some arbitrary non-
We have implemented a prototype of autonomously con- member node being the singleton member, and those
figuring Grid monitoring system on top of NWS. The cur- are designated as the new current node and the current
rent prototype configures all the NWS components auto- group, respectively.
matically, and recover from process and node failures of
• If the proximal node does not belong to any group.
some if not all of the components. Currently, the au-
Then a new node is chosen as current node and it is
tonomous management functions are executed on a single added to the current group.
Grid node; this is not ideal as it hampers scalability, as well
as causing the node to be a single point of failure within The autonomic monitor manager then designates the
the system. We plan to replicate and distribute the manage- Grid node with the a) most ping connectivity with other ma-
ment functionalities to solve both problems. For proof-of- chines, and b) the minimum average RTT from other nodes
concept for many of the features, the current systems suf- recorded above, as the NWS name server. Then for each
fices. group, the node being designated as being the most proxi-
The prototype configures NWS automatically in the fol- mal from the most number of nodes in the group, is desig-
lowing manner: nated as being the NWS memory host, and that particular
• Sensorhosts are executed on all the nodes that the ad- node is also chosen to be the group (clique) representative.
ministrator listed. Now that the system has sufficient configuration, the sys-
tem determines the dependency information between the
• Nameserver is executed on one of the nodes the admin- NWS components and the nodes the components will be
istrator listed. actually running on. As described earlier, the NWS memo-
ryhost needs to know the node and the port number where
• Network distances between the nodes are determined the nameserver will run and listen to. The NWS sensorhost
by actively measuring and averaging the RTT between will also need the nameserver information as well as needs
the nodes. to be told which memoryhost on which host / port number it
• Given the RTT info, nodes are grouped into cliques, should send its data to, etc. Such configurations are format-
and a representative node is chosen. Also, a node is ted as command-line NWS component startup options to be
chosen to act as the memoryhost of each group. executed at respective nodes via some Grid execution ser-
vice, in the order of nameserver, memoryhost, and sensor
To determine the RTT upon node grouping above, we hosts.
systematically ICMP “ping” the nodes in parallel in a n-by-
n fashion. In particular from a list of machines provided by 5. Fault Handling and Recovery of NWS com-
the Grid administrator, we generate two shell scripts, one
that runs on a machine that acts as as the autonomic mon-
ponents in the Prototype
itor manager, and another that runs on all machines that
would ping all other nodes in the network. The scripts are The current prototype handles two types of faults in the
transferred to the necessary nodes and executed using some NWS. One of these is simply when some components fails
secure invocation mechanism provided by the Grid itself, or to execute, or terminates unexpectedly. The other is when
some other mechanism such as ssh. that the autonomic monitor manager loses network access
For each Grid node, we measure and record the node due to trouble in the node hardware, OS, or network. (We
with minimum RTT excluding oneself, and record it as the currently do not distinguish between node failure and net-
most proximal node. Then, each node calculates the average work failure.)
of RTT amongst all other nodes that have responded. In the first case, the autonomic monitor manager simply
After all the RTT measurements have finished the aver- attempts to re-execute the failed component. If the compo-
age RTT data are sent to the autonomic monitor manager by nent fails again after repeated trial in a very short interval,
every node in the system. The autonomic monitor manager the node is deemed to have a problem, and regarded as a
in turn organizes the nodes into (clique) groups in the fol- node failure (i.e., the second case).
lowing, bottom-up fashion fashion. A node is chosen as the In the second case, if the node that failed was execut-
“current node”, and a singleton group is created with the ing either the NWS memoryhost or a nameserver, the auto-
current node being the singleton member, and designated nomic monitor manager must designate, prepare, and restart
as the “current group”. Then, the following process is re- the service on a replacement machine. Moreover, other
peated: components running on other nodes must be notified as
well; for example, when the nameserver is restarted on an-
• If the most proximal node from the current node other node with a different IP address, all the sensorhost and
belongs to another group, then the two groups are memoryhosts must be notified of this fact and re-registered
merged. Then a new group is created with some ar- with the new nameserver, based on this new configuration
bitrary non-member node being the singleton member, info. Similarly, if the memoryhost crashes and restarted on
and those are designated as the new current node and another node, then all the sensorhosts must be told to redi-
the current group, respectively. rect the data.

Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
For the first case, the current prototype simply moni-
tors the component execution status using the ps command. Table 2. Time for initial configuration of our
Configuration info are held in a file where the autonomic prototype on the Titech Grid
monitoring manager is running. To detect whether a node is
running, we periodically check whether we can have ssh
connection to the machine. To restart the components a Clusters 3 6 10
shell script for doing so is generated by the manager, and RTT Measurement (sec) 21 39 76
executed appropriately on the replacement host. NWS execution (sec) 19 30 52
Total (sec) 40 69 128
6. Evaluation of the Prototype

For evaluation, we installed our prototype on the Cam- recovery process took 39 seconds according to our mea-
pus Grid at the Tokyo Institute of Technology, or the “Titech surements, 38 seconds of which was spent re-measuring the
Grid” for short. The Titech Grid has 15 PC clusters totaling RTT and status of the components. The actual configura-
over 800 processors spread out throughout the two Titech tion decision and restart took less than 1 second. Again, the
campuses which are situated approximately 30kms apart. entire process worked automatically as intended.
The entire campus is covered by SuperTITANET which is The results show that, (1) our prototype can cope well
a multi-gigabit campus backbone. Each Titech Grid node with the limited fault scenarios under the current configura-
is designed to be connected directly to the backbone via a tion, but (2) for the current system to scale beyond hundreds
managed switch, and can communicate peer-to-peer with of nodes (clusters), the measurement time must be reduced
any other nodes on the Grid. For the experiment, we em- drastically. Currently, much of the overhead is attributable
ployed the head login nodes of each PC cluster to run NWS to ssh execution of measurement processes; we must devise
components. faster, more persistent measuring scheme to amortize the
overhead.
6.1. Initial Setup
[tgn008001]
[tgn010001] Measurement of network
We show the result of time required for initial setup of S performance between
the system in Table 2. We see that the time required is in the M
representation
order of tens of seconds, but is proportional to the number [tgn011001]
S S M
of clusters. A breakdown shows that approximately 50 % [tgn007001]
S
of the time is being spent to collect RTT data, and the rest S
to execute the components via SSH. N
[tgn015001] S
S
The grouping algorithm worked very well, splitting the S [tgn002001] S [tgn013001]
clusters automatically into two groups, one for each cam- [tgn016001] [tgn018001] S
pus. This is due to the fact that the average RTT between Oookayama
[tgn014001]
campuses is approximately 2-3 times greater than RTT be-
N Suzukakedai
tween the nodes located at the farthest ends of the same NWS Nameserver
campus. (The groups are naturally cliques in this case since M Memoryhost
S Sensorhost
they all have p2p connectivity.) Data flow form sensor
to Memoryhost
6.2. Fault Recovery

We first investigate the scenario in which the cluster node Figure 2. Result for 10 clusters on the Titech
that executes the memoryhost crashes. The upper half Fig- Grid
ure 3 shows the configuration prior to the crash. There are
two groups in the Grid, and tgn015001 and tgn005001 ex-
ecute the memoryhost components (the cluster nodes are
real ones from the Titech Grid). Sensors send the moni-
tored data to the memoryhost of the same group. When 7. Conclusion and Future Work
tgn015001 crashes, recovery is performed automatically,
and tgn013001 is designated as a replacement. Appropri- We proposed that Grid monitoring systems need to con-
ate memoryhost restart as well as sensor redirections are figure themselves automatically and presented a prototype
performed, as shown in the bottom half of Figure 3. No- on top of NWS that deals with limited but common case
tice that, only the cluster nodes in the right hand group is of faults. The prototype worked well under a limited set-
affected. ting on the Titech Grid, showing reasonable startup as
Similarly, Figure 4 shows when the cluster node that well as recovery performance, and the system executed au-
turns the NWS nameserver in the Grid crashes. The whole tonomously as expected.

Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
S S S
S S S
tgn009001 tgn002001 tgn012001
tgn016001 tgn009001 tgn012001
registration

N M S M S N S M S M S S
tgn016001 tgn015001 tgn013001 tgn005001
tgn015001 tgn015001

S S
registration
tgn002001 tgn013001

S S S
tgn009001 tgn002001 tgn012001 registration

N S M S M S N S
S S S
tgn015001 tgn016001 tgn013001 tgn005001
tgn016001 tgn009001 tgn012001

N M S M S registration

tgn015001 tgn013001

Figure 4. Example of NWS name server recov-


S M S
ery.
tgn002001 tgn015001

S. Tuecke. Data management and transfer in highperfor-


Figure 3. Example of memoryhost fault recov- mance computational grid en vironments, 2001.
ery, tgn*** is the name of each node, N: Name- [2] Bob Byrom et al. Datagrid information and monitoring ser-
server, M: Memoryhost, S:Sensor, Arrows in- vices architecture: Design, requirements and evaluation cri-
dicate dataflow from sensor to memoryhost. teria, 2002.
[3] Condor Project: http://www.cs.wisc.edu/condor/.
[4] EU DataGrid Project: http://eu-datagrid.web.cern.ch/eu-
datagrid/.
[5] Global Grid Forum: http://www.ggf.org/.
As a future work, we must make the algorithm more dis- [6] Globus Alliance: http://www.globus.org/.
tributed for several reasons, One is performance, where we [7] Hawkeye: http://www.cs.wisc.edu/condor/hawkeye/.
need to parallelize the measurements to attain scalability. [8] National Research Grid Initiative: http://www.naregi.org/.
The other is to distribute the functions of the autonomic [9] P. Pattnaik, K. Ekanadham, and J. Jann. Autonomic ccom-
monitoring manager so that no single point of failure will puting and grid, in grid computing (edited by fran berman et
exist, and be able to cope with “double” faults. Algorithms al.), 2003.
[10] B. TIERNEY, R. AYDT, D. GUNTER, W. SMITH, V. TAY-
for determining the clique must be made distributed and im- LOR, R. WOLSKI, and M. SWANY. A grid monitoring
proved. Finally, we must test the effectiveness of our algo- architecture, 2002.
rithm on a much larger Grid. One idea is to regard every [11] R. Wolski, N. T. Spring, and J. Hayes. The network weather
cluster node of the Titech Grid as a Grid node, which will service: a distributed resource performance for ecasting ser-
be more than 400 nodes throughout the entire Grid. We will vice for metacomputing. Future Generation Computer Sys-
also employ the Grid testbed being built by the NAREGI tems, 15(5–6):757–768, 1999.
project[8] which will come in operation in the Spring of
2004.

Acknowledgments

This work is being partly supported by the National


Research Grid Initiative (NAREGI), initiated by the Min-
istry of Education, Sports, Culture, Science and Technology
(MEXT).

References

[1] W. Allcock, J. Bester, J. Bresnahan, A. Chervenak, I. F.


ster, C. Kesselman, S. Meder, V. Nefedova, D. Quesnel, and

Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE

You might also like