Professional Documents
Culture Documents
This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Unites States Government.
Data Communication:
P0
Consoles
P1
P2
P3
Other Computers
Computation:
VSIPL
Definitions VSIPL = Vector, Signal, and Image Processing Library MPI = Message-passing interface MPI/RT = MPI real-time DRI = Data Re-org Interface CORBA = Common Object Request Broker Architecture HP-CORBA = High Performance CORBA
Demonstrate
o Pr ) 3x y( vit cti du
rie tO jec Ob
Po r
nd a rd s
Performance (1.5x)
Slide-3 SC2002 Tutorial
Portability lines-of-code changed to port/scale to new system Productivity lines-of-code added to add new functionality Lincoln Laboratory MIT Performance computation and communication benchmarks
A Re pplie se arc d h
d nte
lop ve De
Self-optimization
prototype
Development:
Fault tolerance
Fault tolerance
prototype
Object-Oriented Standards
VSIPL++
Demonstration:
Object-Oriented Standards
Parallel VSIPL++
Demonstrate scalability
MIT Lincoln Laboratory
Demonstrate 3x portability
Slide-4 SC2002 Tutorial
Functionality
Development:
Demonstration:
Unified Comp/Comm Lib
Outline
$2.0
Estimated DoD expenditures for embedded signal and image processing hardware and software ($B)
$1.0
$0.0
FY 9 8
COTS acquisition practices have shifted the burden from point design COTS acquisition practices have shifted the burden from point design hardware to point design software (i.e. COTS HW requires COTS SW) hardware to point design software (i.e. COTS HW requires COTS SW) Software costs for embedded systems could be reduced by one-third Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards with improved programming models, methodologies, and standards
MIT Lincoln Laboratory
Medical
1000.0
Radar
100.0
al o G
Faster Networks
Sonar
10.0
1.0
Scientific
y da To
1
10
Moores Law
1000
TS CO
10000 100000
0.1
100
Encoding
Requires high performance computing and networking Requires high performance computing and networking
Slide-8 SC2002 Tutorial
EMBEDDED PROCESSING REQUIREMENTS WILL EMBEDDED PROCESSING REQUIREMENTS WILL EXCEED 10 TFLOPS IN THE 2005-2010 TIME FRAME EXCEED 10 TFLOPS IN THE 2005-2010 TIME FRAME
Signal processing drives computing requirements Signal processing drives computing Rapid technology insertion is critical for sensor dominance
Infrastructure Assessment
Highly distributedMIT Lincoln Laboratory Highly distributed computing computing Fewer very large data movements Fewer very large data movements
Parallel Pipeline
Signal Processing Algorithm
Filter
XOUT = FIR(XIN )
Beamform
XOUT = w *XIN
Detect
XOUT = |XIN|>c
Mapping
Parallel Computer
Data Parallel within stages Data Parallel within stages Task/Pipeline Parallel across stages Task/Pipeline Parallel across stages
MIT Lincoln Laboratory
Filtering
Xin
Nchannel
Xout
Nsamples
XOUT = FIR(XIN,h)
Nchannel Nsamples/Ndecimation
Fundamental signal processing operation Fundamental signal processing operation Converts data from wideband to narrowband via filter Converts data from wideband to narrowband via filter O(Nsamples Nchannel Nh // Ndecimation) O(Nsamples Nchannel Nh Ndecimation) Degrees of parallelism: Nchannel Degrees of parallelism: Nchannel
MIT Lincoln Laboratory
Beamforming
Xin
Nchannel Nsamples
Xout
XOUT = w *XIN
Nbeams Nsamples
Fundamental operation for all multi-channel receiver systems Fundamental operation for all multi-channel receiver systems Converts data from channels to beams via matrix multiply Converts data from channels to beams via matrix multiply O(Nsamples Nchannel Nbeams) O(Nsamples Nchannel Nbeams) Key: weight matrix can be computed in advance Key: weight matrix can be computed in advance Degrees of Parallelism: Nsamples Degrees of Parallelism: Nsamples
Slide-13 SC2002 Tutorial
Detection
Xin
Nbeams Ndetects Nsamples
Xout
XOUT = |XIN|>c
Fundamental operation for all processing chains Fundamental operation for all processing chains Converts data from a stream to a list of detections via Converts data from a stream to a list of detections via thresholding O(Nsamples Nbeams) thresholding O(Nsamples Nbeams) Number detections is data dependent Number detections is data dependent Degrees of parallelism: Nbeams Nchannels or Ndetects Degrees of parallelism: Nbeams Nchannels or Ndetects
Slide-14 SC2002 Tutorial
Types of Parallelism
Input Input
Scheduler Scheduler
Pipeline Pipeline
BeamBeamformer 1 former 1
BeamBeamformer 2 former 2
Detector Detector 1 1
Detector Detector 2 2
Outline
FIR Overview
FIR
y = filtered data [#samples] x = unfiltered data [#samples] f = filter [#coefficients] o = convolution operator
Algorithm Parameters: #channels, #samples, #coefficents, #decimation Implementation Parameters: Direct Sum or FFT based
MIT Lincoln Laboratory MIT Lincoln Laboratory
time
DC
frequency
FFT FFT
time
DC
frequency
Finite Impulse Response (FIR) allows a range of frequencies to be selected O(N Nh)
(Example: Band-Pass Filter)
DC
f1
f2
freq
Delay h1 h2
Delay h3
Delay hL
Outline
Beamforming Overview
Beamform
y = beamformed data [#samples x #beams] x = channel data [#samples x #channels] w = (tapered) stearing vectors [#channels x #beams]
ro nt s
W av ef
of n n io tio ct ga ire a D op pr
Parallel Beamformer
Outline
x[n] = cell under test T[n] = Sum(xi)/2M, Ngaurd < |i - N| < M + Ngaurd Angle estimate: take ratio of beams; do lookup
Algorithm Parameters: #samples, #beams, steering vectors, #noise samples, #max detects Implementation Parameters: Greatest Of, Censored Greatest Of, Ordered Statistics, Averaging vs Sorting
MIT Lincoln Laboratory MIT Lincoln Laboratory
....
1/M
....
Noise Estimate Buffer b[i] Range cell under test Guard cells
x[i ]
T L
zT [n] ; n = 1 , ... M
Range
Excise x[i ]
2
> <
Retain
Reference: S. L. Wilson, Analysis of NRLs two-pass greatest-of excision CFAR, Internal Memorandum, MIT Lincoln Laboratory, October 5 1998.
Slide-27 SC2002 Tutorial
T T T T T T T T M
L L L L L L L L M
T L
zT [n] ; n = 1 , ... M
Target x[i ]
2
> <
Noise
T2 = f (M , T1 , PFA )
Slide-28 SC2002 Tutorial
Outline
0.5 seconds
0.5 seconds
Beamform
XOUT = w *XIN
0.3 seconds
Detect
XOUT = |XIN|>c
0.8 seconds
Parallel Computer
Latency: total processing + communication time for one Latency: total processing + communication time for one frame of data (sum of times) frame of data (sum of times) Throughput: rate at which frames can be input (max of times) Throughput: rate at which frames can be input (max of times)
Slide-31 SC2002 Tutorial
Filter
Latency = 2/N
Beamform
Latency = 1/N
100
Component Latency
Filter Hardware
Local Optimum
Hardware < 32
32 24 16 8 0
System Latency
Ha
Latency
10
rd G wa O lo r pt b e im al < 32 um
Latency < 8
1
Slide-32 SC2002 Tutorial
System Graph
Filter
Beamform
Detect
System Graph can store the hardware resource System Graph can store the hardware resource usage of every possible Task & Conduit usage of every possible Task & Conduit
Beamform
XIN XIN mult mult W3 W3 XOUT XOUT
Matched Filter
XIN XIN W4 W4 FFT FFT IFFT IFFT XOUT XOUT
PowerPC Cluster
Embedded Board
Hardware
MIT Lincoln Laboratory
Need to automate process of Need to automate process of mapping algorithm to hardware mapping algorithm to hardware
Slide-34 SC2002 Tutorial
Outline
Input Channel N
2 1
Data enters system via different channels Filtering performed in a channel parallel fashion Beamforming requires combining data from multiple channels
Processor Samples
Channels
Channels
Samples
Each processor Each processor sends data to each sends data to each other processor other processor Half the data moves across the bisection of the machine
Slide-37 SC2002 Tutorial
Pulse
Pulse
P1 Processors
P2 Processors
All-to-all communication where each of P1 processors sends a message of size B to each of P2 processors Total data cube size is P1P2B
Slide-38 SC2002 Tutorial
Outline
Detection
Estimation
Work
Pixels (static)
0.24
Work
Detections (dynamic)
0.11 0.15 0.13 0.08 0.10 0.97 0.30
0.24
Load: balanced
Load: unbalanced
Static parallelism implementations lead to unbalanced loads MIT Lincoln Laboratory Static parallelism implementations lead to unbalanced loads
1000
100
15% efficient
10
50% efficient Random fluctuations bound performance Random fluctuations bound performance Much worse if targets are correlated Much worse if targets are correlated Sets max targets in nearly every system Sets max targets in nearly every system
1
1
M = # units of work f = allowed failure rate
Slide-41 SC2002 Tutorial
Static Derivation
Nd speedup Nf Nd Total detections N f Allowed detections with failure rate f N p Number of processors
Nd N p
N f : P (N f ) P (N) =
Slide-42 SC2002 Tutorial
Np
= 1 f
n= 0,N
n e
n!
MIT Lincoln Laboratory
Dynamic Parallelism
1000
50% efficient
100
94% efficient
10
Assign work to processors as needed Assign work to processors as needed Large improvement even in worst case Large improvement even in worst case
1
1
M = # units of work f = allowed failure rate
Slide-43 SC2002 Tutorial
Dynamic Derivation
Np Nd Nd = = worst case speedup = + gNd Nd N p + gNd 1 + gN p Nd Total detections N p Number of processors g granularity of work
Nd N p
100
10
50% efficient
1 1 4 16 64 256 1024
Number of Processors
Dynamic parallelism delivers good performance Dynamic parallelism delivers good performance even in worst case even in worst case Static parallelism is limited by random Static parallelism is limited by random fluctuations (up to 85% of processors are idle) fluctuations (up to 85% of processors are idle)
Outline
Industry standards (e.g. VSIPL, MPI) represent a significant improvement over coding with vendor-specific libraries Next generation of object oriented standards will provide enough support to write truly portable scalable applications
Scalable/portable code provides high productivity Scalable/portable code provides high productivity
Slide-48 SC2002 Tutorial
Code
while(!done) { if ( rank()==1 || rank()==2 ) stage1 (); else if ( rank()==3 || rank()==4 ) stage2(); } while(!done) { if ( rank()==1 || rank()==2 ) stage1(); else if ( rank()==3 || rank()==4) || rank()==5 || rank==6 ) stage2(); }
Algorithm and hardware mapping are linked Algorithm and hardware mapping are linked MIT Resulting code is non-scalable and non-portable Lincoln Laboratory Resulting code is non-scalable and non-portable
Scalable Approach
Single Processor Mapping #include <Vector.h> #include <AddPvl.h> void addVectors(aMap, bMap, cMap) { Vector< Complex<Float> > a(a, aMap, LENGTH); Vector< Complex<Float> > b(b, bMap, LENGTH); Vector< Complex<Float> > c(c, cMap, LENGTH); Multi Processor Mapping b = 1; c = 2; a=b+c; }
A =B +C
A =B +C
Slide-50 SC2002 Tutorial
Single processor and multi-processor code are the same Single processor and multi-processor code are the same Maps can be changed without changing software Maps can be changed without changing software High level code is compact High level code is compact
MIT Lincoln Laboratory
PVL Evolution
Applicability
= Scientific (non-real-time) computing
PVL STAPL
C Object-based C++ Objectoriented
ScaLAPACK
Fortran Object-based
MPI
C Object-based
C Object-based
VSIPL
C Object-based
PETE
90 91 92 93 94 95 96 97 98 99 2000
C++ Object-oriented
1988
89
Slide-51 SC2002 Tutorial
Transition technology from scientific computing to real-time Transition technology from scientific computing to real-time Moving from procedural (Fortran) to object oriented (C++) Moving from procedural (Fortran) to object oriented (C++)
MIT Lincoln Laboratory
All PVL objects All PVL objects contain maps contain maps
Map
PVL Maps PVL Maps contain contain Grid Grid List of nodes List of nodes Distribution Distribution Overlap Overlap
Grid
{0,2,4,6,8,10}
List of Nodes
Slide-52 SC2002 Tutorial
Distribution
Overlap
Library Components
Class Vector/Matrix
Signal Processing & Control
Parallelism Data
Computation
Performs signal/image processing functions Data & Task on matrices/vectors (e.g. FFT, FIR, QR) Supports algorithm decomposition (i.e. the boxes in a signal flow diagram) Task & Pipeline
Task
Supports data movement between tasks (i.e. Task & the arrows on a signal flow diagram) Pipeline Specifies how Tasks, Vectors/Matrices, and Computations are distributed on processor Organizes processors into a 2D layout Data, Task & Pipeline
Mapping
MIT Lincoln Laboratory Simple mappable components support data, task and pipeline parallelism Simple mappable components support data, task and pipeline parallelism
Input
Vector/Matrix Vector/Matrix
Task
User Interface
Grid
Map Distribution
Hardware Interface
Hardware
Workstation PowerPC Cluster Embedded Board Embedded Multi-computer
Layers enable simple interfaces between the Lincoln Laboratory Layers enable simple interfaces between the MIT application, the library, and the hardware application, the library, and the hardware
Outline
Main
1. Pass B and C references to operator +
Parse Tree
Expression Type BinaryNode<OpAssign, Vector, BinaryNode<OpAdd, Vector BinaryNode<OpMultiply, Vector, Vector >>>
Operator +
, B& & C
2. Create expression parse tree 3. Return expression parse tree 4. Pass expression tree reference to operator
+
B& C&
co
py
Parse trees, not vectors, created Parse trees, not vectors, created
Operator =
co p
y&
Expression Templates enhance performance Expression Templates enhance performance by allowing temporary variables to be avoided by allowing temporary variables to be avoided
Slide-56 SC2002 Tutorial
Experimental Platform
Communication
Gigabit ethernet, 8-port switch Isolated network
Software
Linux kernel release 2.2.14 GNU C++ Compiler MPICH communication library over TCP/IP
MIT Lincoln Laboratory
A=B+C
Relative Execution Time Relative Execution Time
1.3 1.2 1.1 1 0.9
A=B+C*D
1.1 1 0.9 0.8 0.7
VSIPL PVL/VSIPL PVL/PETE
A=B+C*D/E+fft(F)
Relative Execution Time
1.2
1.2
Vector Length
Vector Length
Vector Length
PVL with VSIPL has a small overhead PVL with VSIPL has a small overhead PVL with PETE can surpass VSIPL PVL with PETE can surpass VSIPL
MIT Lincoln Laboratory
Experiment 2: Multi-Processor
(simple communication)
A=B+C
Relative Execution Time
1.5 1.4 1.3 1.2 1.1 1
2 8 32 128 512 048 192 2768 107 2 8 3 13
C C++/VSIPL C++/PETE
A=B+C*D
Relative Execution Time
1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 2 8 32 128 512 048 192 2768 107 2 8 3 13
C C++/VSIPL C++/PETE
A=B+C*D/E+fft(F)
Relative Execution Time
1.1
1
C C++/VSIPL C++/PETE
0.9
0.9
Vector Length
Vector Length
Vector Length
PVL with VSIPL has a small overhead PVL with VSIPL has a small overhead PVL with PETE can surpass VSIPL PVL with PETE can surpass VSIPL
MIT Lincoln Laboratory
Experiment 3: Multi-Processor
(complex communication)
A=B+C
Relative Execution Time
1. 1
C C++/VSIPL C++/PETE
A=B+C*D
1.1 1.1
A=B+C*D/E+fft(F)
Relative Execution Time
1
C C++/VSIPL C++/PETE
0.9
C C++/VSIPL C++/PETE
0.9
0.8
Vector Length
Vector Length
Vector Length
Outline
Task Task
Filter
XOUT = FIR(XIN) Conduit
Detect
XOUT = |XIN|>c
Each compute stage can be mapped to Each compute stage can be mapped to different sets of hardware and timed different sets of hardware and timed
MIT Lincoln Laboratory
S3P Engine
Application Application Program Program Algorithm Algorithm Information Information System System Constraints Constraints Hardware Hardware Information Information S3P Engine S3P Engine
Map Generator constructs the system graph for all candidate mappings Map Generator constructs the system graph for all candidate mappings Map Timer times each node and edge of the system graph Map Timer times each node and edge of the system graph MIT Lincoln Laboratory Map Selector searches the system graph for the optimal set of maps Map Selector searches the system graph for the optimal set of maps
Input
Beamform
16.1
8.3 8.7 3.3 2.6 7.3 8.3 9.4 8.0 17 14 14 13
Matched Filter
31.4
3.2
31.5
1.4
15.7
1.0
10.4
0.7
8.2
12 31 57 16 17 28 14 9.1 18 18 15 14
9.8
18.0
6.5
13.7
3.3
11.5
Dynamic Programming
Graph construct is very general Graph construct is very general Widely used for optimization problems Widely used for optimization problems Many efficient techniques for choosing best path (under constraints) Many efficient techniques for choosing best path (under constraints) such as Dynamic Programming such as Dynamic Programming
N = to l hardware unis ta t M = number o tasks f Pi = number o mapp f ings fo task i r t= M pathTab [ le M][N] = ali in te we l nf i ight pa ths { fo (j . ) fo ( k . Pj ) fo (ij .Nr :1 .M { r :1 . { r :+1. t+1) i(i s f - ize[k >= j) ] { i(j > 1 ) f { w = weight thTab [-1 [- ize ]] + [pa le j ]is [k ] we t ] + igh [k we t igh [edge[l t thTab [-1 [-s as [pa le j ]i ize[k ], ] ]] k p = addVer [pathTab [- [-s [k ] k tex le j 1] i ize ], ] }e lse{ w = weight ] [k p = makePath k [] } i( we ght thTab []i] > w ) f i [pa le j[] { pa thTab []i = p le j[] }}} } t = t- 1 }
Slide-65 SC2002 Tutorial
25
Latency (seconds)
1-1-1-1
predicted achieved
predicted achieved
20
1.0
Find Find
15
0.5 5.0
1-2-2-2
2-2-2-2
10 0.25
1-3-2-2 1-2-2-2 1-3-2-2
S3P selects correct S3P selects correct optimal mapping optimal mapping Excellent agreement Excellent agreement between S3P predicted between S3P predicted and achieved latencies and achieved latencies and throughputs and throughputs
Throughput (frames/sec)
4.0
1-2-2-1 1-1-2-2
0.20
3.0
1-1-1-1
1-1-2-1
predicted achieved
0.15
1-1-2-1 1-1-1-1
predicted achieved
2.0 4 5
0.10
#CPU
#CPU
Outline
Vector/Matrix Vector/Matrix
Task
Messaging Kernel
Hardware
Workstation PowerPC Cluster
Can build any parallel Can build any parallel application/library on top application/library on top of a few basic messaging of a few basic messaging capabilities capabilities MatlabMPI provides this MatlabMPI provides this Messaging Kernel Messaging Kernel
MIT Lincoln Laboratory
Sender variable
save
Data file
create
Lock file
detect
Sender saves variable in Data file, then creates Lock file Sender saves variable in Data file, then creates Lock file Receiver detects Lock file, then loads Data file Receiver detects Lock file, then loads Data file
Slide-70 SC2002 Tutorial
Slide-71 SC2002 Tutorial
Uses standard message passing techniques Uses standard message passing techniques Will run anywhere Matlab runs Will run anywhere Matlab runs Only requires a common file system Only requires a common file system
MIT Lincoln Laboratory
Bandwidth (Bytes/sec)
Bandwidth matches native C MPI at large message size Bandwidth matches native C MPI at large message size MIT Lincoln Laboratory Primary difference is latency (35 milliseconds vs. 30 microseconds) Primary difference is latency (35 milliseconds vs. 30 microseconds)
100
Speedup
10
Gigaflops
10
1 1 2 4 8 16 32 64
0 1 10 100 1000
Number of Processors
Number of Processors
Achieved classic super-linear speedup on fixed problem Achieved speedup of ~300 on 304 processors on scaled problem
Slide-73 SC2002 Tutorial
Lines of Code
Matlab
Programmed image Programmed image filtering several ways filtering several ways
Matlab Matlab VSIPL VSIPL VSIPL/OpenMPI VSIPL/OpenMPI VSIPL/MPI VSIPL/MPI PVL PVL MatlabMPI MatlabMPI
Current Research
PVL VSIPL/MPI C++
1000
Summary
Exploiting parallel processing for streaming applications presents unique software challenges. The community is developing software librariea to address many of these challenges:
Exploits C++ to easily express data/task parallelism Seperates parallel hardware dependencies from software Allows a variety of strategies for implementing dynamic applications(e.g. for fault tolerance) Delivers high performance execution comparable to or better than standard approaches
Our future efforts will focus on adding to and exploiting the features of this technology to:
Exploit dynamic parallelism Integrate high performance parallel software underneath mainstream programming environments (e.g Matlab, IDL, ) Use self-optimizing techniques to maintain performance