Cs 2002 s14

DoD Sensor Processing: Applications and Supporting Software Technology
Dr. Jeremy Kepner MIT Lincoln Laboratory
This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Unites States Government.
MIT Lincoln Laboratory

Slide-1 SC2002 Tutorial
Preamble: Existing Standards

Parallel Embedded Processor
System Controller Node Controller
Data Communication:
MPI, MPI/RT, DRI

Control Communication:
CORBA, HP-CORBA SCA
P0
Consoles
P1
P2
P3
Other Computers
Computation:
VSIPL
A variety of software standards support existing DoD signal processing systems
Definitions VSIPL = Vector, Signal, and Image Processing Library MPI = Message-passing interface MPI/RT = MPI real-time DRI = Data Re-org Interface CORBA = Common Object Request Broker Architecture HP-CORBA = High Performance CORBA
Preamble: Next Generation Standards

Software Initiative Goal: transition research into commercial standards Software Initiative Goal: transition research into commercial standards
Demonstrate
tab ilit y( Op 3x en ) Sta
o Pr ) 3x y( vit cti du
rie tO jec Ob
Po r
nd a rd s
HPEC Software Initiative
Interoperable & Scalable
Performance (1.5x)
Portability lines-of-code changed to port/scale to new system Productivity lines-of-code added to add new functionality Lincoln Laboratory MIT Performance computation and communication benchmarks
A Re pplie se arc d h
d nte
lop ve De
HPEC-SI: VSIPL++ and Parallel VSIPL

Time Phase 3 Applied Research: Phase 2 Applied Research: Phase 1 Applied Research: Development: Demonstration:
Existing Standards
VSIPL++ VSIPL MPI
Self-optimization
prototype
Development:
Fault tolerance
Fault tolerance
prototype
Object-Oriented Standards
VSIPL++
Demonstration:
Object-Oriented Standards
Parallel VSIPL++
Demonstrate insertions into fielded systems (e.g., CIP)
High-level code abstraction
Unified embedded computation/ communication standard
Reduce code size 3x
Demonstrate scalability
Demonstrate 3x portability
Functionality
Unified Comp/Comm Lib
Parallel Unified Comp/Comm Lib VSIPL++
Development:
Demonstration:
Unified Comp/Comm Lib
Preamble: The Links

High Performance Embedded Computing Workshop http://www.ll.mit.edu/HPEC High Performance Embedded Computing Software Initiative http://www.hpec-si.org/ Vector, Signal, and Image Processing Library http://www.vsipl.org/ MPI Software Technologies, Inc. http://www.mpi-softtech.com/ Data Reorganization Initiative http://www.data-re.org/ CodeSourcery, LLC http://www.codesourcery.com/ MatlabMPI http://www.ll.mit.edu/MatlabMPI Lincoln Laboratory MIT
Outline
Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary
DoD Needs Parallel Stream Computing Basic Pipeline Processing
Why Is DoD Concerned with Embedded Software?

$3.0
Source: HPEC Market Study March 2001
$2.0
Estimated DoD expenditures for embedded signal and image processing hardware and software ($B)
$1.0
$0.0
FY 9 8
COTS acquisition practices have shifted the burden from point design COTS acquisition practices have shifted the burden from point design hardware to point design software (i.e. COTS HW requires COTS SW) hardware to point design software (i.e. COTS HW requires COTS SW) Software costs for embedded systems could be reduced by one-third Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards with improved programming models, methodologies, and standards
Embedded Stream Processing

Video Wireless
Peak Bisection Bandwidth (GB/s)
10000.0
Medical
1000.0
Desired region of performance
Radar
100.0
al o G
Faster Networks
Sonar
10.0
1.0
Scientific
y da To
1
10
Moores Law
1000
TS CO
10000 100000
0.1
100
Encoding
Peak Processor Power (Gflop/s)
Requires high performance computing and networking Requires high performance computing and networking
Military Embedded Processing

REQUIREMENTS INCREASING BY AN ORDER OF MAGNITUDE EVERY 5 YEARS
EMBEDDED PROCESSING REQUIREMENTS WILL EMBEDDED PROCESSING REQUIREMENTS WILL EXCEED 10 TFLOPS IN THE 2005-2010 TIME FRAME EXCEED 10 TFLOPS IN THE 2005-2010 TIME FRAME
Signal processing drives computing requirements Signal processing drives computing Rapid technology insertion is critical for sensor dominance
MIT Lincoln Laboratory MIT Lincoln Laboratory requirements
Military Query Processing

Sensors High Speed Networks Parallel Computing Software Missions
Targeting
Wide Area Imaging BoSSNET Force Location
Parallel Multi Distributed Sensor Software Algorithms

SAR/GMTI
Infrastructure Assessment
Hyper Spec Imaging
Highly distributedMIT Lincoln Laboratory Highly distributed computing computing Fewer very large data movements Fewer very large data movements
Parallel Pipeline
Signal Processing Algorithm
Filter
XOUT = FIR(XIN )
Beamform
XOUT = w *XIN
Detect
XOUT = |XIN|>c
Mapping
Parallel Computer
Data Parallel within stages Data Parallel within stages Task/Pipeline Parallel across stages Task/Pipeline Parallel across stages
Filtering
Xin
Nchannel
Xout
Nsamples
XOUT = FIR(XIN,h)
Nchannel Nsamples/Ndecimation
Fundamental signal processing operation Fundamental signal processing operation Converts data from wideband to narrowband via filter Converts data from wideband to narrowband via filter O(Nsamples Nchannel Nh // Ndecimation) O(Nsamples Nchannel Nh Ndecimation) Degrees of parallelism: Nchannel Degrees of parallelism: Nchannel
Beamforming
Xin
Nchannel Nsamples
Xout
XOUT = w *XIN
Nbeams Nsamples
Fundamental operation for all multi-channel receiver systems Fundamental operation for all multi-channel receiver systems Converts data from channels to beams via matrix multiply Converts data from channels to beams via matrix multiply O(Nsamples Nchannel Nbeams) O(Nsamples Nchannel Nbeams) Key: weight matrix can be computed in advance Key: weight matrix can be computed in advance Degrees of Parallelism: Nsamples Degrees of Parallelism: Nsamples
Detection
Xin
Nbeams Ndetects Nsamples
Xout
XOUT = |XIN|>c
Fundamental operation for all processing chains Fundamental operation for all processing chains Converts data from a stream to a list of detections via Converts data from a stream to a list of detections via thresholding O(Nsamples Nbeams) thresholding O(Nsamples Nbeams) Number detections is data dependent Number detections is data dependent Degrees of parallelism: Nbeams Nchannels or Ndetects Degrees of parallelism: Nbeams Nchannels or Ndetects
Types of Parallelism
Input Input
Scheduler Scheduler
FIR FIR FIlters FIlters
Task Parallel Task Parallel
Pipeline Pipeline
BeamBeamformer 1 former 1
BeamBeamformer 2 former 2
Round Robin Round Robin
Detector Detector 1 1
Detector Detector 2 2
Data Parallel Data Parallel
Outline

Filtering Beamforming Detection
FIR Overview
FIR
Uses: pulse compression, equalizaton, Formulation: y=hox
y = filtered data [#samples] x = unfiltered data [#samples] f = filter [#coefficients] o = convolution operator
Algorithm Parameters: #channels, #samples, #coefficents, #decimation Implementation Parameters: Direct Sum or FFT based
MIT Lincoln Laboratory MIT Lincoln Laboratory
Basic Filtering via FFT
Fourier Transform (FFT) allows specific frequencies to be selected O(N log N)
time
DC
frequency
FFT FFT
time
DC
frequency
Basic Filtering via FIR
Finite Impulse Response (FIR) allows a range of frequencies to be selected O(N Nh)
(Example: Band-Pass Filter)
x Power in any frequency
FIR(x,h) Power only

between f1 and f2
DC
f1
f2
freq
Delay h1 h2
Delay h3
Delay hL
Multi-Channel Parallel FIR filter
Channel 1 Channel 2 Channel 3 Channel 4
FIR FIR FIR FIR
Parallel Mapping Constraints:

#channels MOD #processors = 0 1st parallelize across channels 2nd parallelize within a channel based on #samples and #coefficients
Outline

Beamforming Overview
Beamform
Uses: angle estimation Formulation: y = wHx
y = beamformed data [#samples x #beams] x = channel data [#samples x #channels] w = (tapered) stearing vectors [#channels x #beams]
Algorithm Parameters: #channels, #samples, #beams, (tapered) steering vectors,
Basic Beamforming Physics

Source
Received phasefront creates
complex exponential across array with frequency directly related to direction of propagation Estimating frequency of impinging phasefront indicates direction of propagation Direction of propagation is also known as angle-of-arrival (AOA) or direction-of arrival (DOA)
ro nt s
W av ef
of n n io tio ct ga ire a D op pr
Received Phasefront e j1() e j2() e j3() e j4() e j5() e j6() e j7()
Parallel Beamformer
Segment 1 Segment 2 Segment 3 Segment 4
Beamform Beamform Beamform Beamform

#segment MOD #processors = 0 1st parallelize across segments 2nd parallelize across beams
Outline

CFAR Detection Overview

CFAR
Constant False Alarm Rate (CFAR) Formulation: x[n] > T[n]
x[n] = cell under test T[n] = Sum(xi)/2M, Ngaurd < |i - N| < M + Ngaurd Angle estimate: take ratio of beams; do lookup
Algorithm Parameters: #samples, #beams, steering vectors, #noise samples, #max detects Implementation Parameters: Greatest Of, Censored Greatest Of, Ordered Statistics, Averaging vs Sorting
Two-Pass Greatest-Of Excision CFAR

(First Pass)
Input Data x[i] M T T T T T T T T G G M L L L L L L L L 1/M
....
1/M
....
Noise Estimate Buffer b[i] Range cell under test Guard cells
x[i ]
T L
Trailing training cells
zT [n] ; n = 1 , ... M
Range
Leading training cells z L [n ] ; n = 1 , ... M

2
Excise x[i ]
2
> <
M M T1 2 max zT [n] , z L [n] M n =1 n =1
Retain
Reference: S. L. Wilson, Analysis of NRLs two-pass greatest-of excision CFAR, Internal Memorandum, MIT Lincoln Laboratory, October 5 1998.
Two-Pass Greatest-Of Excision CFAR

(Second Pass)
Input Data x[i] Noise Estimate Buffer b[i]
T T T T T T T T M
L L L L L L L L M
Cell under test x[i ] Guard cells
T L
Trailing training cells
zT [n] ; n = 1 , ... M
Leading training cells z L [n ] ; n = 1 , ... M
Target x[i ]
2
> <
M M T2 2 max zT [n] , z L [n] M n =1 n =1
Noise
T2 = f (M , T1 , PFA )
Parallel CFAR Detection
Segment 1 Segment 2 Segment 3 Segment 4
CFAR CFAR CFAR CFAR

#segment MOD #processors = 0 1st parallelize across segments 2nd parallelize across beams
Outline

Latency vs. Throughput Corner Turn Dynamic Load Balancing
Latency and throughput

Signal Processing Algorithm
Filter
XOUT = FIR(XIN)
0.5 seconds
0.5 seconds
Beamform
XOUT = w *XIN
1.0 seconds Latency = 0.5+0.5+1.0+0.3+0.8 = 3.1 seconds Throughput = 1/max(0.5,0.5,1.0,0.3,0.8) = 1/second
0.3 seconds
Detect
XOUT = |XIN|>c
0.8 seconds
Parallel Computer
Latency: total processing + communication time for one Latency: total processing + communication time for one frame of data (sum of times) frame of data (sum of times) Throughput: rate at which frames can be input (max of times) Throughput: rate at which frames can be input (max of times)
Example: Optimum System Latency

Simple two component system Simple two component system Local optimum fails to satisfy Local optimum fails to satisfy global constraints global constraints Need system view to find Need system view to find global optimum global optimum
Filter
Latency = 2/N
Beamform
Latency = 1/N
100
Component Latency
Filter Hardware
Local Optimum
Hardware < 32
32 24 16 8 0
System Latency
Ha
Latency
10
Latency < 8 Filter Beamform
rd G wa O lo r pt b e im al < 32 um
Latency < 8
1
8 16 24 32 Hardware Units (N)
8 MIT Lincoln Laboratory 16 24 32 Beamform Hardware
System Graph
Filter
Beamform
Detect
Node is a unique parallel mapping of a computation task
Edge is the conduit between a pair of parallel mappings
System Graph can store the hardware resource System Graph can store the hardware resource usage of every possible Task & Conduit usage of every possible Task & Conduit
Optimal Mapping of Complex Algorithms

Application
Input
XIN XIN
Low Pass Filter

XIN XIN FIR1 FIR1 W1 W1 FIR2 FIR2 W2 W2 XOUT XOUT
Beamform
XIN XIN mult mult W3 W3 XOUT XOUT
Matched Filter
XIN XIN W4 W4 FFT FFT IFFT IFFT XOUT XOUT
Different Optimal Maps

Intel Cluster Workstation Embedded Multi-computer
PowerPC Cluster
Embedded Board
Hardware
Need to automate process of Need to automate process of mapping algorithm to hardware mapping algorithm to hardware
Outline

Channel Space -> Beam Space
Input Channel N
Weights Inpute Channel 1 Weights Beam 1 Beam 2 Beam M
2 1
Data enters system via different channels Filtering performed in a channel parallel fashion Beamforming requires combining data from multiple channels
Corner Turn Operation

Filter Beamform
Processor Samples
Channels
Original Data Matrix
Channels
Cornerturned Data Matrix
Samples
Each processor Each processor sends data to each sends data to each other processor other processor Half the data moves across the bisection of the machine
Corner Turn for Signal Processing

Corner turn changes matrix distribution to exploit parallelism in successive pipeline stages
Sample Channel Channel Sample
Pulse
Pulse
Corner Turn Model TCT = P1P2 ( + B/) Q

B = Bytes per message Q = Parallel paths = Message startup cost = Link bandwidth
P1 Processors
P2 Processors
All-to-all communication where each of P1 processors sends a message of size B to each of P2 processors Total data cube size is P1P2B
Outline

Dynamic Load Balancing

Image Processing Pipeline
0.11 0.15 0.13 0.08 0.10 0.97 0.30
Detection
Estimation
Work
Pixels (static)
0.24
Work
Detections (dynamic)
0.11 0.15 0.13 0.08 0.10 0.97 0.30
Static Parallel Implementation
0.24
Load: balanced
Load: unbalanced
Static parallelism implementations lead to unbalanced loads MIT Lincoln Laboratory Static parallelism implementations lead to unbalanced loads
Static Parallelism and Poissons Wall

i.e. Ball into Bins
1000
100
15% efficient
10

50% efficient Random fluctuations bound performance Random fluctuations bound performance Much worse if targets are correlated Much worse if targets are correlated Sets max targets in nearly every system Sets max targets in nearly every system
1
1
M = # units of work f = allowed failure rate
Static Derivation
Nd speedup Nf Nd Total detections N f Allowed detections with failure rate f N p Number of processors
Nd N p
N f : P (N f ) P (N) =
Np
= 1 f
n= 0,N
n e
n!
Dynamic Parallelism
1000
50% efficient
100
94% efficient
10
Assign work to processors as needed Assign work to processors as needed Large improvement even in worst case Large improvement even in worst case
1
1
M = # units of work f = allowed failure rate
Dynamic Derivation
Np Nd Nd = = worst case speedup = + gNd Nd N p + gNd 1 + gN p Nd Total detections N p Number of processors g granularity of work
Nd N p
Static vs Dynamic Parallelism

1000
Parallel Speedup
100
Linear Dynamic Static
50% efficient 94% efficient 15% efficient
10
50% efficient
1 1 4 16 64 256 1024
Number of Processors
Dynamic parallelism delivers good performance Dynamic parallelism delivers good performance even in worst case even in worst case Static parallelism is limited by random Static parallelism is limited by random fluctuations (up to 85% of processors are idle) fluctuations (up to 85% of processors are idle)
Outline

PVL PETE S3P MatlabMPI
Current Standards for Parallel Coding
Vendor Vendor Supplied Supplied Libraries Libraries
Current Current Industry Industry Standards Standards
Parallel Parallel OO OO Standards Standards
Industry standards (e.g. VSIPL, MPI) represent a significant improvement over coding with vendor-specific libraries Next generation of object oriented standards will provide enough support to write truly portable scalable applications
Goal: Write Once/Run Anywhere/Anysize

Develop code on a workstation A = B + C; D = FFT(A); . (matlab like) Demo Real-Time with a cluster (no code changes; roll-on/roll-off) Deploy on Embedded System (no code changes)
Scalable/portable code provides high productivity Scalable/portable code provides high productivity
Current Approach to Parallel Code

Algorithm + Mapping
Stage 1 Proc Proc 1 1 Proc Proc 2 2 Stage 2 Proc Proc 3 3 Proc Proc 4 4
Code
while(!done) { if ( rank()==1 || rank()==2 ) stage1 (); else if ( rank()==3 || rank()==4 ) stage2(); } while(!done) { if ( rank()==1 || rank()==2 ) stage1(); else if ( rank()==3 || rank()==4) || rank()==5 || rank==6 ) stage2(); }
Proc Proc 5 5 Proc Proc 6 6
Algorithm and hardware mapping are linked Algorithm and hardware mapping are linked MIT Resulting code is non-scalable and non-portable Lincoln Laboratory Resulting code is non-scalable and non-portable
Scalable Approach
Single Processor Mapping #include <Vector.h> #include <AddPvl.h> void addVectors(aMap, bMap, cMap) { Vector< Complex<Float> > a(a, aMap, LENGTH); Vector< Complex<Float> > b(b, bMap, LENGTH); Vector< Complex<Float> > c(c, cMap, LENGTH); Multi Processor Mapping b = 1; c = 2; a=b+c; }
A =B +C
A =B +C

Single processor and multi-processor code are the same Single processor and multi-processor code are the same Maps can be changed without changing software Maps can be changed without changing software High level code is compact High level code is compact
PVL Evolution
Applicability
= Scientific (non-real-time) computing
PVL STAPL
C Object-based C++ Objectoriented
Parallel Processing Library
= Real-time signal processing
ScaLAPACK
Fortran Object-based
MPI/RT Parallel Communications LAPACK

Fortran
MPI
C Object-based
C Object-based
VSIPL
C Object-based
Single processor Library
PETE
90 91 92 93 94 95 96 97 98 99 2000
C++ Object-oriented
1988
89

Transition technology from scientific computing to real-time Transition technology from scientific computing to real-time Moving from procedural (Fortran) to object oriented (C++) Moving from procedural (Fortran) to object oriented (C++)
Anatomy of a PVL Map

Vector/Matrix Computation Conduit Task
All PVL objects All PVL objects contain maps contain maps
Map
PVL Maps PVL Maps contain contain Grid Grid List of nodes List of nodes Distribution Distribution Overlap Overlap
Grid
{0,2,4,6,8,10}
List of Nodes
Distribution
Overlap
Library Components
Class Vector/Matrix
Signal Processing & Control
Description Used to perform matrix/vector algebra on data spanning multiple processors
Parallelism Data
Computation
Performs signal/image processing functions Data & Task on matrices/vectors (e.g. FFT, FIR, QR) Supports algorithm decomposition (i.e. the boxes in a signal flow diagram) Task & Pipeline
Task
Conduit Map Grid
Supports data movement between tasks (i.e. Task & the arrows on a signal flow diagram) Pipeline Specifies how Tasks, Vectors/Matrices, and Computations are distributed on processor Organizes processors into a 2D layout Data, Task & Pipeline
Mapping
MIT Lincoln Laboratory Simple mappable components support data, task and pipeline parallelism Simple mappable components support data, task and pipeline parallelism
PVL Layered Architecture

Application
Productivity Analysis Output
Input
Vector/Matrix Vector/Matrix
Comp Comp Conduit
Task
Parallel Vector Performance Library

Portability
User Interface
Grid
Map Distribution
Math Kernel (VSIPL) Intel Cluster
Messaging Kernel (MPI)
Hardware Interface
Hardware
Workstation PowerPC Cluster Embedded Board Embedded Multi-computer
Layers enable simple interfaces between the Lincoln Laboratory Layers enable simple interfaces between the MIT application, the library, and the hardware application, the library, and the hardware
Outline

C++ Expression Templates and PETE

Expression A=B+C*D A=B+C*D
Exp Temress pla ion tes
Main
1. Pass B and C references to operator +
Parse Tree
Expression Type BinaryNode<OpAssign, Vector, BinaryNode<OpAdd, Vector BinaryNode<OpMultiply, Vector, Vector >>>
Operator +
, B& & C
2. Create expression parse tree 3. Return expression parse tree 4. Pass expression tree reference to operator
+
B& C&
co
py
Parse trees, not vectors, created Parse trees, not vectors, created
Operator =
co p
y&
5. Calculate result and perform assignment B+C
Expression Templates enhance performance Expression Templates enhance performance by allowing temporary variables to be avoided by allowing temporary variables to be avoided
Experimental Platform
Network of 8 Linux workstations

800 MHz Pentium III processors
Communication
Gigabit ethernet, 8-port switch Isolated network
Software
Linux kernel release 2.2.14 GNU C++ Compiler MPICH communication library over TCP/IP
Experiment 1: Single Processor
A=B+C
Relative Execution Time Relative Execution Time
1.3 1.2 1.1 1 0.9
A=B+C*D
1.1 1 0.9 0.8 0.7
VSIPL PVL/VSIPL PVL/PETE
A=B+C*D/E+fft(F)
Relative Execution Time
1.2
1.2
1.1 1 0.9 0.8
2 8 32 128 512 048 192 2768 107 2 8 3 13
Vector Length
0.6 2 8 32 128 512 048 192 2768 107 2 8 3 13
Vector Length
2 8 32 128 512 048 192 2768 107 2 8 3 13
Vector Length
PVL with VSIPL has a small overhead PVL with VSIPL has a small overhead PVL with PETE can surpass VSIPL PVL with PETE can surpass VSIPL
Experiment 2: Multi-Processor
(simple communication)
A=B+C
1.5 1.4 1.3 1.2 1.1 1
2 8 32 128 512 048 192 2768 107 2 8 3 13
C C++/VSIPL C++/PETE
A=B+C*D
1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 2 8 32 128 512 048 192 2768 107 2 8 3 13
A=B+C*D/E+fft(F)
1.1
1
0.9
0.9
Vector Length
Vector Length
2 8 32 128 512 048 192 2768 107 2 8 3 13
Vector Length
PVL with VSIPL has a small overhead PVL with VSIPL has a small overhead PVL with PETE can surpass VSIPL PVL with PETE can surpass VSIPL
Experiment 3: Multi-Processor
(complex communication)
A=B+C
1. 1
A=B+C*D
1.1 1.1
A=B+C*D/E+fft(F)
1
0.9
0.9
2 8 32 128 512 048 192 2768 107 2 8 3 13
0.8
Vector Length
2 8 32 128 512 048 192 2768 107 2 8 3 13
Vector Length
0.9 2 8 32 128 512 048 192 2768 107 2 8 3 13
Vector Length
Communication dominates performance Communication dominates performance
Outline

S3P Framework Requirements

Decomposable Task into Tasks (comp) and Conduits Beamform (comm)
XOUT = w *XIN Conduit
Task Task
Filter
XOUT = FIR(XIN) Conduit
Mappable to different sets of hardware
Detect
XOUT = |XIN|>c
Measurable resource usage of each mapping

Each compute stage can be mapped to Each compute stage can be mapped to different sets of hardware and timed different sets of hardware and timed
S3P Engine
Application Application Program Program Algorithm Algorithm Information Information System System Constraints Constraints Hardware Hardware Information Information S3P Engine S3P Engine
Best Best System System Mapping Mapping
Map Map Generator Generator
Map Map Timer Timer
Map Map Selector Selector
Map Generator constructs the system graph for all candidate mappings Map Generator constructs the system graph for all candidate mappings Map Timer times each node and edge of the system graph Map Timer times each node and edge of the system graph MIT Lincoln Laboratory Map Selector searches the system graph for the optimal set of maps Map Selector searches the system graph for the optimal set of maps
Test Case: Min(#CPU | Throughput)

Vary number of processors used on each stage Vary number of processors used on each stage Time each computation stage and communication conduit Time each computation stage and communication conduit Find path with minimum bottleneck Find path with minimum bottleneck
Input
Low Pass Filter

52 49 46 42 47 27 21 24 44 29 20 24 60 33 23 15
Beamform
16.1
8.3 8.7 3.3 2.6 7.3 8.3 9.4 8.0 17 14 14 13
Matched Filter
31.4
1 CPU 2 CPU 3 CPU 4 CPU
3.2
31.5
1.4
15.7
1.0
10.4
0.7
8.2
12 31 57 16 17 28 14 9.1 18 18 15 14
9.8
18.0
33 frames/sec (1.6 MHz BW) 66 frames/sec (3.2 MHz BW)
6.5
13.7
3.3
11.5
Dynamic Programming
Graph construct is very general Graph construct is very general Widely used for optimization problems Widely used for optimization problems Many efficient techniques for choosing best path (under constraints) Many efficient techniques for choosing best path (under constraints) such as Dynamic Programming such as Dynamic Programming
N = to l hardware unis ta t M = number o tasks f Pi = number o mapp f ings fo task i r t= M pathTab [ le M][N] = ali in te we l nf i ight pa ths { fo (j . ) fo ( k . Pj ) fo (ij .Nr :1 .M { r :1 . { r :+1. t+1) i(i s f - ize[k >= j) ] { i(j > 1 ) f { w = weight thTab [-1 [- ize ]] + [pa le j ]is [k ] we t ] + igh [k we t igh [edge[l t thTab [-1 [-s as [pa le j ]i ize[k ], ] ]] k p = addVer [pathTab [- [-s [k ] k tex le j 1] i ize ], ] }e lse{ w = weight ] [k p = makePath k [] } i( we ght thTab []i] > w ) f i [pa le j[] { pa thTab []i = p le j[] }}} } t = t- 1 }
Predicted and Achieved Latency and Throughput

Problem Size Small (48x4K) Large (48x128K)
1.5
1-1-1-1
25
Latency (seconds)
1-1-1-1
predicted achieved
predicted achieved
20
1.0
1-1-1-2 1-1-2-1 1-1-2-2 1-2-2-1 1-2-2-2 1-2-2-2 1-2-2-3
Find Find
15
Min(latency || #CPU) Min(latency #CPU) Max(throughput || #CPU) Max(throughput #CPU)
0.5 5.0
1-2-2-2
2-2-2-2
10 0.25
1-3-2-2 1-2-2-2 1-3-2-2
S3P selects correct S3P selects correct optimal mapping optimal mapping Excellent agreement Excellent agreement between S3P predicted between S3P predicted and achieved latencies and achieved latencies and throughputs and throughputs
Throughput (frames/sec)
4.0
1-2-2-1 1-1-2-2
0.20
3.0
1-1-1-1
1-1-2-1
predicted achieved
0.15
1-1-2-1 1-1-1-1
predicted achieved
2.0 4 5
0.10
#CPU
#CPU
Outline

Modern Parallel Software Layers

Application Parallel Library
Math Kernel Intel Cluster Input Analysis Output
Vector/Matrix Vector/Matrix
Comp Comp Conduit
Task
User Interface Hardware Interface
Messaging Kernel
Hardware
Workstation PowerPC Cluster
Can build any parallel Can build any parallel application/library on top application/library on top of a few basic messaging of a few basic messaging capabilities capabilities MatlabMPI provides this MatlabMPI provides this Messaging Kernel Messaging Kernel
MatlabMPI Core Lite
Parallel computing requires eight capabilities

MPI_Run launches a Matlab script on multiple processors MPI_Comm_size returns the number of processors MPI_Comm_rank returns the id of each processor MPI_Send sends Matlab variable(s) to another processor MPI_Recv receives Matlab variable(s) from another processor MPI_Init called at beginning of program MPI_Finalize called at end of program
MatlabMPI: Point-to-point Communication

MPI_Send (dest, tag, comm, variable);
Sender variable
Shared File System
Receiver load variable
save
Data file
create
Lock file
detect
variable = MPI_Recv (source, tag, comm);
Sender saves variable in Data file, then creates Lock file Sender saves variable in Data file, then creates Lock file Receiver detects Lock file, then loads Data file Receiver detects Lock file, then loads Data file
Example: Basic Send and Receive

Initialize Initialize Get processor ranks Get processor ranks
M PI_In t i; % In t l ze M PI. iiai com m = MPI_CO M M _ W O RLD; % Create com munica or t . com m_s ize = MPI_Com m_size(com m); % Get s . ize my_rank = M PI_Co m m_rank(com m); % Get rank. source = 0 ; % Set source. dest = 1 ; % Set des i ion tnat . tag = 1 ; % Set message tag . i( f com m_s ze == 2) i % Check s ze. i i (my_rank == source) f % I source. f da = 1 10; ta : % Create data . MPI_Send(des , t tag, com m,data) % Send data ; . end i (my_rank == des ) f t % I des inat . f t ion da ta=MPI_Recv(source tag , ,com m); % Rece ive data . end end M PI_F lze; ina i ex t i; % Fina i Mat l ze lab MPI. % Ex t Mat ab i l
Execute send Execute send Execute recieve Execute recieve
Finalize Finalize Exit Exit

Uses standard message passing techniques Uses standard message passing techniques Will run anywhere Matlab runs Will run anywhere Matlab runs Only requires a common file system Only requires a common file system
MatlabMPI vs MPI bandwidth

Bandwidth (SGI Origin2000)
1.E+08
Bandwidth (Bytes/sec)
1.E+07 C MPI MatlabMPI 1.E+06
1.E+05 1K 4K 16K 64K 256K 1M 4M 32M
Message Size (Bytes)
Bandwidth matches native C MPI at large message size Bandwidth matches native C MPI at large message size MIT Lincoln Laboratory Primary difference is latency (35 milliseconds vs. 30 microseconds) Primary difference is latency (35 milliseconds vs. 30 microseconds)
Image Filtering Parallel Performance

100
Fixed Problem Size (SGI O2000) Parallel performance

Linear MatlabMP
100
Scaled Problem Size (IBM SP2)

MatlabMPI Linear
Speedup
10
Gigaflops
10
1 1 2 4 8 16 32 64
0 1 10 100 1000
Achieved classic super-linear speedup on fixed problem Achieved speedup of ~300 on 304 processors on scaled problem
Productivity vs. Performance

0.1 10 Peak Performance 1 10 100 1000
Lines of Code
Matlab
Parallel Matlab* Matlab MatlabMPI
Programmed image Programmed image filtering several ways filtering several ways
Matlab Matlab VSIPL VSIPL VSIPL/OpenMPI VSIPL/OpenMPI VSIPL/MPI VSIPL/MPI PVL PVL MatlabMPI MatlabMPI
100 Current Practice

VSIPL
Current Research
PVL VSIPL/MPI C++
MatlabMPI provides MatlabMPI provides

C
high productivity high productivity high performance high performance
1000
VSIPL/OpenMP Single Processor Shared Memory
VSIPL/MPI Distributed Memory
Summary

Exploiting parallel processing for streaming applications presents unique software challenges. The community is developing software librariea to address many of these challenges:
Exploits C++ to easily express data/task parallelism Seperates parallel hardware dependencies from software Allows a variety of strategies for implementing dynamic applications(e.g. for fault tolerance) Delivers high performance execution comparable to or better than standard approaches
Our future efforts will focus on adding to and exploiting the features of this technology to:
Exploit dynamic parallelism Integrate high performance parallel software underneath mainstream programming environments (e.g Matlab, IDL, ) Use self-optimizing techniques to maintain performance

Cs 2002 s14

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cs 2002 s14

Uploaded by

Copyright:

Available Formats

DoD Sensor Processing: Applications and Supporting Software Technology

Dr. Jeremy Kepner MIT Lincoln Laboratory

MIT Lincoln Laboratory

Preamble: Existing Standards

MPI, MPI/RT, DRI

CORBA, HP-CORBA SCA

A variety of software standards support existing DoD signal processing systems

Slide-2 SC2002 Tutorial

MIT Lincoln Laboratory

Preamble: Next Generation Standards

tab ilit y( Op 3x en ) Sta

HPEC Software Initiative

Interoperable & Scalable

HPEC-SI: VSIPL++ and Parallel VSIPL

Demonstrate insertions into fielded systems (e.g., CIP)

High-level code abstraction

Unified embedded computation/ communication standard

Reduce code size 3x

Unified Comp/Comm Lib

Parallel Unified Comp/Comm Lib VSIPL++

Preamble: The Links

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary

DoD Needs Parallel Stream Computing Basic Pipeline Processing

Slide-6 SC2002 Tutorial

MIT Lincoln Laboratory

Why Is DoD Concerned with Embedded Software?

Slide-7 SC2002 Tutorial

Embedded Stream Processing

Desired region of performance

Peak Processor Power (Gflop/s)

MIT Lincoln Laboratory

Military Embedded Processing

Slide-9 SC2002 Tutorial

MIT Lincoln Laboratory MIT Lincoln Laboratory requirements

Military Query Processing

Wide Area Imaging BoSSNET Force Location

Parallel Multi Distributed Sensor Software Algorithms

Hyper Spec Imaging

Slide-10 SC2002 Tutorial

Slide-11 SC2002 Tutorial

Slide-12 SC2002 Tutorial

MIT Lincoln Laboratory

MIT Lincoln Laboratory

FIR FIR FIlters FIlters

Task Parallel Task Parallel

Round Robin Round Robin

Data Parallel Data Parallel

Slide-15 SC2002 Tutorial

MIT Lincoln Laboratory

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary

Slide-16 SC2002 Tutorial

MIT Lincoln Laboratory

Uses: pulse compression, equalizaton, Formulation: y=hox

Slide-17 SC2002 Tutorial

Basic Filtering via FFT

Fourier Transform (FFT) allows specific frequencies to be selected O(N log N)

Slide-18 SC2002 Tutorial

MIT Lincoln Laboratory

Basic Filtering via FIR

x Power in any frequency

FIR(x,h) Power only

Slide-19 SC2002 Tutorial