Vanmala-CMPI IEEE ITNG2007-USA

4th IEEE International Conference on Information Technology: New Generations ITNG 2007 April 2-4, 2007, Las
Vegas, Nevada, USA
Design and Implementation of a High Performance MPI for Large Scale

Computing System
Prabu D, Vanamala V, Sanjeeb Kumar Deka, Sridharan R, Prahlada Rao BB†, Mohanram N†
Centre for Development of Advanced Computing (C-DAC), Bangalore, INDIA
(prabud, vanamalav, sanjeebd, rsridharan, prahladab, mohan)@cdacb.ernet.in
Abstract
Rest of the paper is organized as follows. Section 2
The necessity to process large application in provides a brief overview of C-MPI. Section 3 is the
reasonable time has lead to the use of parallel brief overview of MPICH. In Section 4, we have
processing and the use of programming standards, presented a brief description of PARAMNet-II. In
which enable the running of user application in a Section 5, experimental results have been discussed
parallel environment. This paper presents C-DAC and compared to that of MPICH. Finally, in Section 6,
Message Passing Interface (C-MPI). C-MPI is an Conclusion and Future Work are presented.
optimized MPI for Cluster of Multiprocessors
(CLUMPS). It has been able to exhibit higher 2. C-DAC Message Passing Interface
performance compared to public domain MPI such as
MPICH. In this work we have presented the design C-MPI is a high performance implementation of the
flow of C-MPI and compared the various performance MPI standard for a Cluster of Multiprocessors
of C-MPI and MPICH over PARAMNet-II Interconnect (CLUMPS). By adhering to the MPI standard, C-MPI
using KSHIPRA as the underlying protocol. supports the execution of the multitude of MPI
applications with enhanced performance on CLUMPS.
1. Introduction C-MPI can work on both TCP/IP using Ethernet and
KSHIPRA [3] over PARAMNet-II. It also leverages on
The enormous advance in the field of PCs and high- the fact that most of the high performance networks
speed networks has led to low-cost clusters of personal provides a large exchange communication bandwidth.
computers, which have been able to provide the It helps in performing send and receive operation of
computing power equivalent to that of supercomputers. messages in a simultaneous manner, consequently
As per the Moore’s law, the processor speed doubles reducing the number of hops in the transmission path.
every 18 months [1]. The advances in hardware, CPU In C-MPI, different MPI collective communication
technologies, and high-speed communication network calls have been optimized for CLUMPS architecture.
enable the parallel application developers to use the The C-MPI algorithms effectively use the shared
Message Passing Interface (MPI) standard memory communications on multiprocessor nodes and
implementation. A panel of parallel programming thereby reducing the computation time of the user
industry leaders including representatives from the application. C-MPI provides the following advantages
national laboratories, universities, and key parallel over public domain MPI, MPICH [4]:
system vendors define the MPI standards [2]. This • Supports multiple protocols. It uses the
leads to the development of commercial MPI having network for remote communication and shared
low latency and high bandwidth. The central idea of the memory for local communication.
paper is to provide the user a high performance • Collective communication routines have been
optimized parallel programming Message Passing optimized for CLUMPS to achieve minimized
Interface over high speed interconnect to solve his execution time.
complex applications in a parallel programming C-MPI is designed to achieve high performance and
environment. portability. It is layered over Abstract Device Interface
______________________________________________________ (ADI) to maintain portability. On C-DAC’s PARAM
†Any further correspondence regarding this publication can be Padma [5], C-MPI employs both TCP/IP and
addressed to (mohan, prahladab)@cdacb.ernet.in
KSHIPRA in the underlying ADI layer. The C-MPI
Vegas, Nevada, USA
functions are implemented in terms of macros and MPICH supports one-side communication, dynamic
functions. The Fig 1 shows the C-MPI control flow. processes, intercommunicator collective operations,
The upper layer does the communication of control and expanded MPI-IO functionality. Moreover, it can
information and the lower layer performs the transfer of work over clusters consisting of both single-processor
data from one process address space to another. and SMP nodes. MPICH implementations are not
thread safe. MPICH uses TCP/UDP socket interfaces
User Task (C-MPI Application) to communicate messages between nodes. Because of
this, there have been great efforts in reducing the
overhead incurred in processing the TCP/IP stacks. To
overcome this problem MPICH is now enabled to
API (upper layer) support Virtual Interface Architecture (VIA). VIA[7]
Collective has defined different mechanisms that enable bypassing
Communication layers of protocol stacks and avoid intermediate copies
of data during sending and receiving of messages. This
allows significant increase in communication
Point-to-point Communication performance and decrease in processor utilization by
the communication subsystem.
4. PARAMNet-II
Protocol Module (lower layer) PARAMNet-II is high bandwidth, low latency
TCP/IP KSHIPRA Shared System Area Network (SAN) developed by C-DAC for
Memory its PARAM supercomputers. Its usability ranges from
commodity compute servers to high performance
scalable systems. The different components of
PARAMNet-II are: Network Interface Card (NIC), one
PARAMNet-II/C-VIPL Library or more routing switches and software. PARAMNet-II
[8] is the second-generation System Area Network of
PARAMNet series.
Figure 1. C-
C-MPI Control Flow
For achieving optimal performance characteristics, C-

MPI can directly work over the System Area Network
(SAN) in the user space using lightweight
communication protocols (KSHIPRA). This results in
decrease in communication time for MPI point-to-point
communication protocols. The communication, at the
lowest level, occurs in point-to-point manner as shown
in the Fig. 1. Hence, reduction in the communication
Figure 2. PARAMNet-
PARAMNet-II Card
time of the point-to-point communication calls lead to
reduced communication time for the collective
PARAMNet-II is having a 16-port non-blocking
communication calls as well.
crossbar based switch, which provides a bandwidth of
2.5 Gigabit per second. Multilevel switching strategy is
3. MPICH used for supporting the 16 ports of the switch. The
switch can operate in full duplex mode with latency as
MPICH[6] is an open source, portable low as 0.5 Microseconds. Distributed scheduler has
implementation of MPI standard which is a widely used been used for controlling the crossbar. Distributed
parallel programming paradigm for distributed memory schedulers allow individual routing table per port,
applications used in parallel computing. MPICH is allowing for any network topology. Each of the
available for different flavors of Unix (including schedulers gets the input data from a flow control block
Linux) and Microsoft Windows. MPICH is jointly that handles the low-level hardware handshake between
developed by Argonne National Laboratory and the end points. For uniform distribution of bandwidth
Mississippi State University. in a group, the switch uses Group adaptive routing
Vegas, Nevada, USA
based on Least Recently Used (LRU) algorithm. The Fig 3. depicts the results of running HPL benchmark
PARAMNet-II CCPIII NIC is based on C-DAC’s on 62 4-way nodes for C-MPI and MPICH. The
Communication Co-Processor III chip. It supports sustained performance for C-MPI is found to be 532
direct user level access with protection and up to 1024 Gigaflops against the calculated peak performance of
connections. Address translation, packetization and re- 992 Gigaflops. However, it is 495.80 Gigaflops for
assembly of messages are done in the hardware itself. MPICH. The results show that C-MPI clearly
PARAMNet-II uses an Application Programming outperforms MPICH in Fig 3.
Interface (API) called C-DAC’s Virtual Interface
Provider Library (C-VIPL). The later is the interface Performance of HPL on PARAM Padma
for SAN that adheres to C-DAC’s KSHIPRA. Block Size Nb =200
600
Performance
5. The Experimental Results
(Gflops)
400
200 C-MPI
5.1. Experimental Setup MPICH
0
The experimental evaluation for performance of C- 10000 43895 244000
Problem Size (N)
MPI is done on PARAM Padma at CTSF [9], C-DAC
Knowledge Park, Bangalore, India. The specification
of PARAM Padma shown in Table 1 currently has a Figure 3. Performance comparison for C-
C-MPI
peak performance of one Teraflop. The performance and MPICH for HPL Benchmark
testing of 32-bit C-MPI-1.0 for comparison with the
public domain MPI, 32-bit MPICH-1.2.4 is done on 2, 5.2.2. NAS: The NAS [11] Parallel Benchmarks suite
25 32 and 62, four way nodes running on AIX-5.1 is widely used to compare the performance of different
operating system. types of parallel computing platforms, since it contains
computational kernels that are representative of several
Table 1. Description of PARAM Padma different algorithms used in real-world applications.
NAS 2.4 constitutes eight CFD problems, coded in
Specification Compute node File servers
MPI and standard FORTAN 77/C.
62 nodes of 4 The results of NAS benchmarks on PARAM Padma
way SMP and 6 nodes. of are as follows. The performance interms of Mflops for
Configuration
one node. of 4 way SMP
selective benchmarks are shown in Table 2.
32 way SMP
Optimization Flags used are –qstrict –O3 –bmaxdata :
248(Power 4 24(UltraSparc
No. of processors 0x80000000 and RAND value is randi8
@1GHz) IV@900MHz)
Aggregate 0.5 96
memory Terabytes Gigabytes Table 2. NAS Performance of C-
C-MPI and MPICH
4.5 0.4 over PARAMNet-
PARAMNet-II
Internal storage Benchmarks Class / # of Performance (MFlops)
Terabytes Terabytes
Processors MPICH C-MPI
Operating system AIX Solaris
BT C/25 5223.28 5362.87
Peak computing CG D/32 757.68 834.97
power for 62 AIX 992GF(~ 1 TF) --
IS C/32 113.14 117.16
nodes
LU C/32 6325.97 9236.28
File system -- QFS D/32 7226.80 7681.42
MG C/32 5159.88 5906.21
D/32 6677.41 7056.79
5.2. Performance Evaluation
SP C/32 3139.60 3462.83
5.2.1. HPL: HPL [10] benchmark is a numerically

5.2.3. PMB: PMB (Pallas MPI Benchmark)[12] is
intensive test. It is a popular benchmark suite to
complex benchmark used for measuring MPI
evaluate the performance of Super Computers and
performance. It consists of a concise set of benchmarks
Clusters and involves solving a dense linear system in
targeted at evaluating most important MPI functions.
double precision (64 bits) arithmetic on distributed
The different benchmarks under PMB are PingPong,
memory system. HPL benchmark is used to test the
PingPing, Sendrecv, Exchange, Allreduce, Reduce,
PARAMPadma cluster efficiency.
Vegas, Nevada, USA
Reduce-scatter, Allgather, Allgatherv, Alltoall, Bcast platform. This work can be extended to enable C-MPI
and Barrier. In Table 3, output of PMB (Pallas MPI for InfiniBand Architecture Interconnect and also for
Benchmark) is depicted. The latency is obtained using the Grid environments [14].
Pingpong benchmark across two AIX nodes. It is found
that the latency of C-MPI is smaller compared to 7. References
MPICH.
[1] Carlo Kopp, “Moore’s Law and it’s Implication for
Table 3. Latency for C-
C-MPI and MPICH Information Warfare,” The 3rd International Association of
Latency for zero byte Old Crows (AOC) Electronic Warfare Conference
MPI Type
message Proceedings, Zurich, May 20-25, 2000.
MPICH 29.18µs http://www.ausairpower.net/moore-iw.pdf
C-MPI 21.38µs
[2] The Message Passing Interface Standard
http://www-unix.mcs.anl.gov/mpi/
5.2.4. P-COMS: P-COMS [13] comprises of a set of
MPI benchmarks used for measuring communication [3] KSHIPRA - Scalable Communication Substrate for
overheads on large message passing clusters (such as Cluster of Multi Processors
PARAM 10000, PARAM Padma). The benchmarks http://www.cdac.in/HTmL/ssdgblr/kshipra.asp
have been implemented using Message Passing
Interface standard. The different benchmarks under P- [4] MPICH-A Portable Implementation of MPI
http://www-unix.mcs.anl.gov/mpi/mpich1/
COMS are alltoall, ptp, advptp, cc, ccomp, gppong,
roundtrip, allgring, oneway, and circularshift. In Table [5] C-DAC, India, PARAM Padma Supercomputer
4, below shows the performance of the two MPIs for P- http://www.cdac.in/html/parampma.asp
COMS Benchmark. From the result it is evident that C-
MPI is having low latency and High bandwidth [6] W. Gropp, E. Lusk, N. Doss and A. Skjellum, “A high-
compared to MPICH over PARAMNet-II interconnect. performance, portable, implementation of the MPI Standard,
Parallel Computing: 22: 789-828, 1996.
Table 4. Latency and Bandwidth for C-MPI and http://www.globus.org/alliance/publications/papers/paper1.p
MPICH for P-COMS Benchmark under df
PARAMNet-II
[7] Virtual Interface Architecture
Communication
www.intel.com/intelpress/via
Overhead
MPICH C-MPI
Parameter
[8] C-DAC, PARAMNet-II, NIC
Latency for zero 22.39µs http://www.cdac.in/HTML/htdg/products/pnic.asp
29.82µs
byte message
Bandwidth for [9] C-DAC’s Tera Scale Supercomputing Facility, C-DAC
101.13MBps 120.00MBps Bangalore www.cdac.in/html/ctsf/
10MB
[10] A. Petitet, R. C. Whaley, J. Dongarra, A. Cleary, “HPL-

Performance test using HPL, NAS, Pallas, and A Portable Implementation of The High-Performance
PCOMS benchmarks clearly reveals that C-MPI Linpack Benchmark For Distributed-Memory Computers,”
performs better than that of MPICH over PARAMNet- Innovative Computing Laboratory, University Of Tennessee,
II. (Note the all HPC benchmark results are taken for January 2004. http://www.netlib.org/benchmark/hpl/
best performance). From Table 4, we have seen that the
[11] The Numerical Aerodynamic Simulation, NAS Parallel
bandwidth provided by C-MPI is much higher and low
Benchmarks. www.nas.nasa.gov/Software/NPB/
latency compared to MPICH.
[12] Pallas MPI Benchmark (PMB), Intel,
6. Conclusions and Future work http://www.pallas.com/e/products/index.htm
In this paper we present the design flow and [13] PARAM- Communication Overhead Measurement
Implementation of C-MPI over PARAMNet-II in Suites (P-COMS), Center for Development of Advanced
Computing http://www.cdac.in/html/betatest/hpc.asp
PARAM Padma. Based on the experimental results
shown in Section 5.0 using standard HPC benchmarks, [14] GARUDA C-DAC Bangalore INDIA,
we conclude the performance of C-MPI is better than The National Grid Computing Initiative.
that of MPICH over PARAMNet-II Interconnect http://www.garudaindia.in/tech_research.asp

Vanmala-CMPI IEEE ITNG2007-USA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vanmala-CMPI IEEE ITNG2007-USA

Uploaded by

Copyright:

Available Formats

4th IEEE International Conference on Information Technology: New Generations ITNG 2007 April 2-4, 2007, Las

Vegas, Nevada, USA

Design and Implementation of a High Performance MPI for Large Scale

For achieving optimal performance characteristics, C-

5.2.1. HPL: HPL [10] benchmark is a numerically

[10] A. Petitet, R. C. Whaley, J. Dongarra, A. Cleary, “HPL-

You might also like