You are on page 1of 75

DoD Sensor Processing: Applications and Supporting Software Technology

Dr. Jeremy Kepner MIT Lincoln Laboratory

This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Unites States Government.

MIT Lincoln Laboratory


Slide-1 SC2002 Tutorial

Preamble: Existing Standards


Parallel Embedded Processor
System Controller Node Controller

Data Communication:

MPI, MPI/RT, DRI


Control Communication:

CORBA, HP-CORBA SCA

P0
Consoles

P1

P2

P3

Other Computers

Computation:

VSIPL

A variety of software standards support existing DoD signal processing systems

Definitions VSIPL = Vector, Signal, and Image Processing Library MPI = Message-passing interface MPI/RT = MPI real-time DRI = Data Re-org Interface CORBA = Common Object Request Broker Architecture HP-CORBA = High Performance CORBA

Slide-2 SC2002 Tutorial

MIT Lincoln Laboratory

Preamble: Next Generation Standards


Software Initiative Goal: transition research into commercial standards Software Initiative Goal: transition research into commercial standards

Demonstrate

tab ilit y( Op 3x en ) Sta

o Pr ) 3x y( vit cti du
rie tO jec Ob

Po r

nd a rd s

HPEC Software Initiative

Interoperable & Scalable

Performance (1.5x)
Slide-3 SC2002 Tutorial

Portability lines-of-code changed to port/scale to new system Productivity lines-of-code added to add new functionality Lincoln Laboratory MIT Performance computation and communication benchmarks

A Re pplie se arc d h

d nte

lop ve De

HPEC-SI: VSIPL++ and Parallel VSIPL


Time Phase 3 Applied Research: Phase 2 Applied Research: Phase 1 Applied Research: Development: Demonstration:
Existing Standards
VSIPL++ VSIPL MPI

Self-optimization
prototype

Development:
Fault tolerance

Fault tolerance
prototype

Object-Oriented Standards

VSIPL++

Demonstration:
Object-Oriented Standards
Parallel VSIPL++

Demonstrate insertions into fielded systems (e.g., CIP)

High-level code abstraction

Unified embedded computation/ communication standard

Reduce code size 3x

Demonstrate scalability
MIT Lincoln Laboratory

Demonstrate 3x portability
Slide-4 SC2002 Tutorial

Functionality

Unified Comp/Comm Lib

Parallel Unified Comp/Comm Lib VSIPL++

Development:

Demonstration:
Unified Comp/Comm Lib

Preamble: The Links


High Performance Embedded Computing Workshop http://www.ll.mit.edu/HPEC High Performance Embedded Computing Software Initiative http://www.hpec-si.org/ Vector, Signal, and Image Processing Library http://www.vsipl.org/ MPI Software Technologies, Inc. http://www.mpi-softtech.com/ Data Reorganization Initiative http://www.data-re.org/ CodeSourcery, LLC http://www.codesourcery.com/ MatlabMPI http://www.ll.mit.edu/MatlabMPI Lincoln Laboratory MIT
Slide-5 SC2002 Tutorial

Outline

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary

DoD Needs Parallel Stream Computing Basic Pipeline Processing

Slide-6 SC2002 Tutorial

MIT Lincoln Laboratory

Why Is DoD Concerned with Embedded Software?


$3.0
Source: HPEC Market Study March 2001

$2.0

Estimated DoD expenditures for embedded signal and image processing hardware and software ($B)

$1.0

$0.0
FY 9 8

COTS acquisition practices have shifted the burden from point design COTS acquisition practices have shifted the burden from point design hardware to point design software (i.e. COTS HW requires COTS SW) hardware to point design software (i.e. COTS HW requires COTS SW) Software costs for embedded systems could be reduced by one-third Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards with improved programming models, methodologies, and standards
MIT Lincoln Laboratory

Slide-7 SC2002 Tutorial

Embedded Stream Processing


Video Wireless
Peak Bisection Bandwidth (GB/s)
10000.0

Medical

1000.0

Desired region of performance

Radar

100.0

al o G
Faster Networks

Sonar

10.0

1.0

Scientific

y da To
1
10

Moores Law
1000

TS CO
10000 100000

0.1

100

Encoding

Peak Processor Power (Gflop/s)

Requires high performance computing and networking Requires high performance computing and networking
Slide-8 SC2002 Tutorial

MIT Lincoln Laboratory

Military Embedded Processing


REQUIREMENTS INCREASING BY AN ORDER OF MAGNITUDE EVERY 5 YEARS

EMBEDDED PROCESSING REQUIREMENTS WILL EMBEDDED PROCESSING REQUIREMENTS WILL EXCEED 10 TFLOPS IN THE 2005-2010 TIME FRAME EXCEED 10 TFLOPS IN THE 2005-2010 TIME FRAME

Slide-9 SC2002 Tutorial

Signal processing drives computing requirements Signal processing drives computing Rapid technology insertion is critical for sensor dominance

MIT Lincoln Laboratory MIT Lincoln Laboratory requirements

Military Query Processing


Sensors High Speed Networks Parallel Computing Software Missions
Targeting

Wide Area Imaging BoSSNET Force Location

Parallel Multi Distributed Sensor Software Algorithms


SAR/GMTI

Infrastructure Assessment

Hyper Spec Imaging

Slide-10 SC2002 Tutorial

Highly distributedMIT Lincoln Laboratory Highly distributed computing computing Fewer very large data movements Fewer very large data movements

Parallel Pipeline
Signal Processing Algorithm
Filter
XOUT = FIR(XIN )

Beamform
XOUT = w *XIN

Detect
XOUT = |XIN|>c

Mapping

Parallel Computer
Data Parallel within stages Data Parallel within stages Task/Pipeline Parallel across stages Task/Pipeline Parallel across stages
MIT Lincoln Laboratory

Slide-11 SC2002 Tutorial

Filtering

Xin
Nchannel

Xout
Nsamples

XOUT = FIR(XIN,h)
Nchannel Nsamples/Ndecimation

Fundamental signal processing operation Fundamental signal processing operation Converts data from wideband to narrowband via filter Converts data from wideband to narrowband via filter O(Nsamples Nchannel Nh // Ndecimation) O(Nsamples Nchannel Nh Ndecimation) Degrees of parallelism: Nchannel Degrees of parallelism: Nchannel
MIT Lincoln Laboratory

Slide-12 SC2002 Tutorial

Beamforming

Xin
Nchannel Nsamples

Xout

XOUT = w *XIN
Nbeams Nsamples

Fundamental operation for all multi-channel receiver systems Fundamental operation for all multi-channel receiver systems Converts data from channels to beams via matrix multiply Converts data from channels to beams via matrix multiply O(Nsamples Nchannel Nbeams) O(Nsamples Nchannel Nbeams) Key: weight matrix can be computed in advance Key: weight matrix can be computed in advance Degrees of Parallelism: Nsamples Degrees of Parallelism: Nsamples
Slide-13 SC2002 Tutorial

MIT Lincoln Laboratory

Detection
Xin
Nbeams Ndetects Nsamples

Xout

XOUT = |XIN|>c

Fundamental operation for all processing chains Fundamental operation for all processing chains Converts data from a stream to a list of detections via Converts data from a stream to a list of detections via thresholding O(Nsamples Nbeams) thresholding O(Nsamples Nbeams) Number detections is data dependent Number detections is data dependent Degrees of parallelism: Nbeams Nchannels or Ndetects Degrees of parallelism: Nbeams Nchannels or Ndetects
Slide-14 SC2002 Tutorial

MIT Lincoln Laboratory

Types of Parallelism

Input Input

Scheduler Scheduler

FIR FIR FIlters FIlters

Task Parallel Task Parallel

Pipeline Pipeline

BeamBeamformer 1 former 1

BeamBeamformer 2 former 2

Round Robin Round Robin

Detector Detector 1 1

Detector Detector 2 2

Data Parallel Data Parallel

Slide-15 SC2002 Tutorial

MIT Lincoln Laboratory

Outline

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary


Filtering Beamforming Detection

Slide-16 SC2002 Tutorial

MIT Lincoln Laboratory

FIR Overview
FIR

Uses: pulse compression, equalizaton, Formulation: y=hox

y = filtered data [#samples] x = unfiltered data [#samples] f = filter [#coefficients] o = convolution operator

Algorithm Parameters: #channels, #samples, #coefficents, #decimation Implementation Parameters: Direct Sum or FFT based
MIT Lincoln Laboratory MIT Lincoln Laboratory

Slide-17 SC2002 Tutorial

Basic Filtering via FFT

Fourier Transform (FFT) allows specific frequencies to be selected O(N log N)

time

DC

frequency

FFT FFT

time

DC

frequency

Slide-18 SC2002 Tutorial

MIT Lincoln Laboratory

Basic Filtering via FIR

Finite Impulse Response (FIR) allows a range of frequencies to be selected O(N Nh)
(Example: Band-Pass Filter)

x Power in any frequency

FIR(x,h) Power only


between f1 and f2

DC

f1

f2

freq

Delay h1 h2

Delay h3

Delay hL

Slide-19 SC2002 Tutorial

MIT Lincoln Laboratory

Multi-Channel Parallel FIR filter

Channel 1 Channel 2 Channel 3 Channel 4

FIR FIR FIR FIR

Parallel Mapping Constraints:


#channels MOD #processors = 0 1st parallelize across channels 2nd parallelize within a channel based on #samples and #coefficients

Slide-20 SC2002 Tutorial

MIT Lincoln Laboratory MIT Lincoln Laboratory

Outline

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary


Filtering Beamforming Detection

Slide-21 SC2002 Tutorial

MIT Lincoln Laboratory

Beamforming Overview
Beamform

Uses: angle estimation Formulation: y = wHx

y = beamformed data [#samples x #beams] x = channel data [#samples x #channels] w = (tapered) stearing vectors [#channels x #beams]

Algorithm Parameters: #channels, #samples, #beams, (tapered) steering vectors,

Slide-22 SC2002 Tutorial

MIT Lincoln Laboratory MIT Lincoln Laboratory

Basic Beamforming Physics


Source
Received phasefront creates
complex exponential across array with frequency directly related to direction of propagation Estimating frequency of impinging phasefront indicates direction of propagation Direction of propagation is also known as angle-of-arrival (AOA) or direction-of arrival (DOA)

ro nt s

W av ef

of n n io tio ct ga ire a D op pr

Received Phasefront e j1() e j2() e j3() e j4() e j5() e j6() e j7()

Slide-23 SC2002 Tutorial

MIT Lincoln Laboratory

Parallel Beamformer

Segment 1 Segment 2 Segment 3 Segment 4

Beamform Beamform Beamform Beamform

Parallel Mapping Constraints:


#segment MOD #processors = 0 1st parallelize across segments 2nd parallelize across beams

Slide-24 SC2002 Tutorial

MIT Lincoln Laboratory MIT Lincoln Laboratory

Outline

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary


Filtering Beamforming Detection

Slide-25 SC2002 Tutorial

MIT Lincoln Laboratory

CFAR Detection Overview


CFAR

Constant False Alarm Rate (CFAR) Formulation: x[n] > T[n]

x[n] = cell under test T[n] = Sum(xi)/2M, Ngaurd < |i - N| < M + Ngaurd Angle estimate: take ratio of beams; do lookup

Algorithm Parameters: #samples, #beams, steering vectors, #noise samples, #max detects Implementation Parameters: Greatest Of, Censored Greatest Of, Ordered Statistics, Averaging vs Sorting
MIT Lincoln Laboratory MIT Lincoln Laboratory

Slide-26 SC2002 Tutorial

Two-Pass Greatest-Of Excision CFAR


(First Pass)
Input Data x[i] M T T T T T T T T G G M L L L L L L L L 1/M

....

1/M

....

Noise Estimate Buffer b[i] Range cell under test Guard cells

x[i ]

T L

Trailing training cells

zT [n] ; n = 1 , ... M

Range

Leading training cells z L [n ] ; n = 1 , ... M


2

Excise x[i ]
2
> <

M M T1 2 max zT [n] , z L [n] M n =1 n =1

Retain
Reference: S. L. Wilson, Analysis of NRLs two-pass greatest-of excision CFAR, Internal Memorandum, MIT Lincoln Laboratory, October 5 1998.
Slide-27 SC2002 Tutorial

MIT Lincoln Laboratory

Two-Pass Greatest-Of Excision CFAR


(Second Pass)
Input Data x[i] Noise Estimate Buffer b[i]

T T T T T T T T M

L L L L L L L L M

Cell under test x[i ] Guard cells

T L

Trailing training cells

zT [n] ; n = 1 , ... M

Leading training cells z L [n ] ; n = 1 , ... M

Target x[i ]
2
> <

M M T2 2 max zT [n] , z L [n] M n =1 n =1

Noise
T2 = f (M , T1 , PFA )
Slide-28 SC2002 Tutorial

MIT Lincoln Laboratory

Parallel CFAR Detection

Segment 1 Segment 2 Segment 3 Segment 4

CFAR CFAR CFAR CFAR

Parallel Mapping Constraints:


#segment MOD #processors = 0 1st parallelize across segments 2nd parallelize across beams

Slide-29 SC2002 Tutorial

MIT Lincoln Laboratory MIT Lincoln Laboratory

Outline

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary


Latency vs. Throughput Corner Turn Dynamic Load Balancing

Slide-30 SC2002 Tutorial

MIT Lincoln Laboratory

Latency and throughput


Signal Processing Algorithm
Filter
XOUT = FIR(XIN)

0.5 seconds

0.5 seconds

Beamform
XOUT = w *XIN

1.0 seconds Latency = 0.5+0.5+1.0+0.3+0.8 = 3.1 seconds Throughput = 1/max(0.5,0.5,1.0,0.3,0.8) = 1/second

0.3 seconds

Detect
XOUT = |XIN|>c

0.8 seconds

Parallel Computer
Latency: total processing + communication time for one Latency: total processing + communication time for one frame of data (sum of times) frame of data (sum of times) Throughput: rate at which frames can be input (max of times) Throughput: rate at which frames can be input (max of times)
Slide-31 SC2002 Tutorial

MIT Lincoln Laboratory

Example: Optimum System Latency


Simple two component system Simple two component system Local optimum fails to satisfy Local optimum fails to satisfy global constraints global constraints Need system view to find Need system view to find global optimum global optimum

Filter
Latency = 2/N

Beamform
Latency = 1/N

100

Component Latency
Filter Hardware
Local Optimum
Hardware < 32

32 24 16 8 0

System Latency
Ha

Latency

10

Latency < 8 Filter Beamform

rd G wa O lo r pt b e im al < 32 um

Latency < 8

1
Slide-32 SC2002 Tutorial

8 16 24 32 Hardware Units (N)

8 MIT Lincoln Laboratory 16 24 32 Beamform Hardware

System Graph

Filter

Beamform

Detect

Node is a unique parallel mapping of a computation task

Edge is the conduit between a pair of parallel mappings

Slide-33 SC2002 Tutorial

System Graph can store the hardware resource System Graph can store the hardware resource usage of every possible Task & Conduit usage of every possible Task & Conduit

MIT Lincoln Laboratory

Optimal Mapping of Complex Algorithms


Application
Input
XIN XIN

Low Pass Filter


XIN XIN FIR1 FIR1 W1 W1 FIR2 FIR2 W2 W2 XOUT XOUT

Beamform
XIN XIN mult mult W3 W3 XOUT XOUT

Matched Filter
XIN XIN W4 W4 FFT FFT IFFT IFFT XOUT XOUT

Different Optimal Maps


Intel Cluster Workstation Embedded Multi-computer

PowerPC Cluster

Embedded Board

Hardware
MIT Lincoln Laboratory

Need to automate process of Need to automate process of mapping algorithm to hardware mapping algorithm to hardware
Slide-34 SC2002 Tutorial

Outline

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary


Latency vs. Throughput Corner Turn Dynamic Load Balancing

Slide-35 SC2002 Tutorial

MIT Lincoln Laboratory

Channel Space -> Beam Space

Input Channel N

Weights Inpute Channel 1 Weights Beam 1 Beam 2 Beam M

2 1

Data enters system via different channels Filtering performed in a channel parallel fashion Beamforming requires combining data from multiple channels

Slide-36 SC2002 Tutorial

MIT Lincoln Laboratory

Corner Turn Operation


Filter Beamform

Processor Samples

Channels

Original Data Matrix

Channels

Cornerturned Data Matrix

Samples

Each processor Each processor sends data to each sends data to each other processor other processor Half the data moves across the bisection of the machine
Slide-37 SC2002 Tutorial

MIT Lincoln Laboratory

Corner Turn for Signal Processing


Corner turn changes matrix distribution to exploit parallelism in successive pipeline stages
Sample Channel Channel Sample

Pulse

Pulse

Corner Turn Model TCT = P1P2 ( + B/) Q


B = Bytes per message Q = Parallel paths = Message startup cost = Link bandwidth

P1 Processors

P2 Processors

All-to-all communication where each of P1 processors sends a message of size B to each of P2 processors Total data cube size is P1P2B
Slide-38 SC2002 Tutorial

MIT Lincoln Laboratory

Outline

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary


Latency vs. Throughput Corner Turn Dynamic Load Balancing

Slide-39 SC2002 Tutorial

MIT Lincoln Laboratory

Dynamic Load Balancing


Image Processing Pipeline
0.11 0.15 0.13 0.08 0.10 0.97 0.30

Detection

Estimation

Work

Pixels (static)

0.24

Work

Detections (dynamic)
0.11 0.15 0.13 0.08 0.10 0.97 0.30

Static Parallel Implementation

0.24

Load: balanced

Load: unbalanced

Slide-40 SC2002 Tutorial

Static parallelism implementations lead to unbalanced loads MIT Lincoln Laboratory Static parallelism implementations lead to unbalanced loads

Static Parallelism and Poissons Wall


i.e. Ball into Bins

1000

100

15% efficient

10

50% efficient Random fluctuations bound performance Random fluctuations bound performance Much worse if targets are correlated Much worse if targets are correlated Sets max targets in nearly every system Sets max targets in nearly every system

1
1
M = # units of work f = allowed failure rate
Slide-41 SC2002 Tutorial

MIT Lincoln Laboratory

Static Derivation
Nd speedup Nf Nd Total detections N f Allowed detections with failure rate f N p Number of processors

Nd N p
N f : P (N f ) P (N) =
Slide-42 SC2002 Tutorial

Np

= 1 f

n= 0,N

n e
n!
MIT Lincoln Laboratory

Dynamic Parallelism
1000
50% efficient

100
94% efficient

10
Assign work to processors as needed Assign work to processors as needed Large improvement even in worst case Large improvement even in worst case

1
1
M = # units of work f = allowed failure rate
Slide-43 SC2002 Tutorial

MIT Lincoln Laboratory

Dynamic Derivation

Np Nd Nd = = worst case speedup = + gNd Nd N p + gNd 1 + gN p Nd Total detections N p Number of processors g granularity of work

Nd N p

Slide-44 SC2002 Tutorial

MIT Lincoln Laboratory

Static vs Dynamic Parallelism


1000
Parallel Speedup

100

Linear Dynamic Static

50% efficient 94% efficient 15% efficient

10

50% efficient

1 1 4 16 64 256 1024
Number of Processors

Dynamic parallelism delivers good performance Dynamic parallelism delivers good performance even in worst case even in worst case Static parallelism is limited by random Static parallelism is limited by random fluctuations (up to 85% of processors are idle) fluctuations (up to 85% of processors are idle)

Slide-45 SC2002 Tutorial

MIT Lincoln Laboratory

Outline

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary


PVL PETE S3P MatlabMPI

Slide-46 SC2002 Tutorial

MIT Lincoln Laboratory

Current Standards for Parallel Coding

Vendor Vendor Supplied Supplied Libraries Libraries

Current Current Industry Industry Standards Standards

Parallel Parallel OO OO Standards Standards

Industry standards (e.g. VSIPL, MPI) represent a significant improvement over coding with vendor-specific libraries Next generation of object oriented standards will provide enough support to write truly portable scalable applications

Slide-47 SC2002 Tutorial

MIT Lincoln Laboratory

Goal: Write Once/Run Anywhere/Anysize


Develop code on a workstation A = B + C; D = FFT(A); . (matlab like) Demo Real-Time with a cluster (no code changes; roll-on/roll-off) Deploy on Embedded System (no code changes)

Scalable/portable code provides high productivity Scalable/portable code provides high productivity
Slide-48 SC2002 Tutorial

MIT Lincoln Laboratory

Current Approach to Parallel Code


Algorithm + Mapping
Stage 1 Proc Proc 1 1 Proc Proc 2 2 Stage 2 Proc Proc 3 3 Proc Proc 4 4

Code
while(!done) { if ( rank()==1 || rank()==2 ) stage1 (); else if ( rank()==3 || rank()==4 ) stage2(); } while(!done) { if ( rank()==1 || rank()==2 ) stage1(); else if ( rank()==3 || rank()==4) || rank()==5 || rank==6 ) stage2(); }

Proc Proc 5 5 Proc Proc 6 6

Slide-49 SC2002 Tutorial

Algorithm and hardware mapping are linked Algorithm and hardware mapping are linked MIT Resulting code is non-scalable and non-portable Lincoln Laboratory Resulting code is non-scalable and non-portable

Scalable Approach
Single Processor Mapping #include <Vector.h> #include <AddPvl.h> void addVectors(aMap, bMap, cMap) { Vector< Complex<Float> > a(a, aMap, LENGTH); Vector< Complex<Float> > b(b, bMap, LENGTH); Vector< Complex<Float> > c(c, cMap, LENGTH); Multi Processor Mapping b = 1; c = 2; a=b+c; }

A =B +C

A =B +C


Slide-50 SC2002 Tutorial

Single processor and multi-processor code are the same Single processor and multi-processor code are the same Maps can be changed without changing software Maps can be changed without changing software High level code is compact High level code is compact
MIT Lincoln Laboratory

PVL Evolution
Applicability
= Scientific (non-real-time) computing

PVL STAPL
C Object-based C++ Objectoriented

Parallel Processing Library

= Real-time signal processing

ScaLAPACK
Fortran Object-based

MPI/RT Parallel Communications LAPACK


Fortran

MPI
C Object-based

C Object-based

VSIPL
C Object-based

Single processor Library

PETE
90 91 92 93 94 95 96 97 98 99 2000

C++ Object-oriented

1988

89


Slide-51 SC2002 Tutorial

Transition technology from scientific computing to real-time Transition technology from scientific computing to real-time Moving from procedural (Fortran) to object oriented (C++) Moving from procedural (Fortran) to object oriented (C++)
MIT Lincoln Laboratory

Anatomy of a PVL Map


Vector/Matrix Computation Conduit Task

All PVL objects All PVL objects contain maps contain maps

Map

PVL Maps PVL Maps contain contain Grid Grid List of nodes List of nodes Distribution Distribution Overlap Overlap

Grid

{0,2,4,6,8,10}
List of Nodes
Slide-52 SC2002 Tutorial

Distribution

Overlap

MIT Lincoln Laboratory

Library Components
Class Vector/Matrix
Signal Processing & Control

Description Used to perform matrix/vector algebra on data spanning multiple processors

Parallelism Data

Computation

Performs signal/image processing functions Data & Task on matrices/vectors (e.g. FFT, FIR, QR) Supports algorithm decomposition (i.e. the boxes in a signal flow diagram) Task & Pipeline

Task

Conduit Map Grid

Supports data movement between tasks (i.e. Task & the arrows on a signal flow diagram) Pipeline Specifies how Tasks, Vectors/Matrices, and Computations are distributed on processor Organizes processors into a 2D layout Data, Task & Pipeline

Mapping

Slide-53 SC2002 Tutorial

MIT Lincoln Laboratory Simple mappable components support data, task and pipeline parallelism Simple mappable components support data, task and pipeline parallelism

PVL Layered Architecture


Application
Productivity Analysis Output

Input

Vector/Matrix Vector/Matrix

Comp Comp Conduit

Task

Parallel Vector Performance Library


Portability

User Interface

Grid

Map Distribution

Math Kernel (VSIPL) Intel Cluster

Messaging Kernel (MPI)

Hardware Interface

Hardware
Workstation PowerPC Cluster Embedded Board Embedded Multi-computer

Slide-54 SC2002 Tutorial

Layers enable simple interfaces between the Lincoln Laboratory Layers enable simple interfaces between the MIT application, the library, and the hardware application, the library, and the hardware

Outline

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary


PVL PETE S3P MatlabMPI

Slide-55 SC2002 Tutorial

MIT Lincoln Laboratory

C++ Expression Templates and PETE


Expression A=B+C*D A=B+C*D
Exp Temress pla ion tes

Main
1. Pass B and C references to operator +

Parse Tree

Expression Type BinaryNode<OpAssign, Vector, BinaryNode<OpAdd, Vector BinaryNode<OpMultiply, Vector, Vector >>>

Operator +

, B& & C

2. Create expression parse tree 3. Return expression parse tree 4. Pass expression tree reference to operator

+
B& C&

co

py

Parse trees, not vectors, created Parse trees, not vectors, created

Operator =

co p

y&

5. Calculate result and perform assignment B+C

Expression Templates enhance performance Expression Templates enhance performance by allowing temporary variables to be avoided by allowing temporary variables to be avoided
Slide-56 SC2002 Tutorial

MIT Lincoln Laboratory

Experimental Platform

Network of 8 Linux workstations


800 MHz Pentium III processors

Communication
Gigabit ethernet, 8-port switch Isolated network

Software
Linux kernel release 2.2.14 GNU C++ Compiler MPICH communication library over TCP/IP
MIT Lincoln Laboratory

Slide-57 SC2002 Tutorial

Experiment 1: Single Processor

A=B+C
Relative Execution Time Relative Execution Time
1.3 1.2 1.1 1 0.9

A=B+C*D
1.1 1 0.9 0.8 0.7
VSIPL PVL/VSIPL PVL/PETE

A=B+C*D/E+fft(F)
Relative Execution Time
1.2

1.2

VSIPL PVL/VSIPL PVL/PETE

1.1 1 0.9 0.8

VSIPL PVL/VSIPL PVL/PETE

2 8 32 128 512 048 192 2768 107 2 8 3 13

Vector Length

0.6 2 8 32 128 512 048 192 2768 107 2 8 3 13

Vector Length

2 8 32 128 512 048 192 2768 107 2 8 3 13

Vector Length

PVL with VSIPL has a small overhead PVL with VSIPL has a small overhead PVL with PETE can surpass VSIPL PVL with PETE can surpass VSIPL
MIT Lincoln Laboratory

Slide-58 SC2002 Tutorial

Experiment 2: Multi-Processor
(simple communication)
A=B+C
Relative Execution Time
1.5 1.4 1.3 1.2 1.1 1
2 8 32 128 512 048 192 2768 107 2 8 3 13
C C++/VSIPL C++/PETE

A=B+C*D
Relative Execution Time
1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 2 8 32 128 512 048 192 2768 107 2 8 3 13
C C++/VSIPL C++/PETE

A=B+C*D/E+fft(F)
Relative Execution Time
1.1

1
C C++/VSIPL C++/PETE

0.9

0.9

Vector Length

Vector Length

2 8 32 128 512 048 192 2768 107 2 8 3 13

Vector Length

PVL with VSIPL has a small overhead PVL with VSIPL has a small overhead PVL with PETE can surpass VSIPL PVL with PETE can surpass VSIPL
MIT Lincoln Laboratory

Slide-59 SC2002 Tutorial

Experiment 3: Multi-Processor
(complex communication)

A=B+C
Relative Execution Time
1. 1
C C++/VSIPL C++/PETE

A=B+C*D
1.1 1.1

A=B+C*D/E+fft(F)
Relative Execution Time

Relative Execution Time

1
C C++/VSIPL C++/PETE

0.9
C C++/VSIPL C++/PETE

0.9

2 8 32 128 512 048 192 2768 107 2 8 3 13

0.8

Vector Length

2 8 32 128 512 048 192 2768 107 2 8 3 13

Vector Length

0.9 2 8 32 128 512 048 192 2768 107 2 8 3 13

Vector Length

Communication dominates performance Communication dominates performance

Slide-60 SC2002 Tutorial

MIT Lincoln Laboratory

Outline

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary


PVL PETE S3P MatlabMPI

Slide-61 SC2002 Tutorial

MIT Lincoln Laboratory

S3P Framework Requirements


Decomposable Task into Tasks (comp) and Conduits Beamform (comm)
XOUT = w *XIN Conduit

Task Task

Filter
XOUT = FIR(XIN) Conduit

Mappable to different sets of hardware

Detect
XOUT = |XIN|>c

Measurable resource usage of each mapping


Slide-62 SC2002 Tutorial

Each compute stage can be mapped to Each compute stage can be mapped to different sets of hardware and timed different sets of hardware and timed
MIT Lincoln Laboratory

S3P Engine
Application Application Program Program Algorithm Algorithm Information Information System System Constraints Constraints Hardware Hardware Information Information S3P Engine S3P Engine

Best Best System System Mapping Mapping

Map Map Generator Generator

Map Map Timer Timer

Map Map Selector Selector

Slide-63 SC2002 Tutorial

Map Generator constructs the system graph for all candidate mappings Map Generator constructs the system graph for all candidate mappings Map Timer times each node and edge of the system graph Map Timer times each node and edge of the system graph MIT Lincoln Laboratory Map Selector searches the system graph for the optimal set of maps Map Selector searches the system graph for the optimal set of maps

Test Case: Min(#CPU | Throughput)



Vary number of processors used on each stage Vary number of processors used on each stage Time each computation stage and communication conduit Time each computation stage and communication conduit Find path with minimum bottleneck Find path with minimum bottleneck

Input

Low Pass Filter


52 49 46 42 47 27 21 24 44 29 20 24 60 33 23 15

Beamform
16.1
8.3 8.7 3.3 2.6 7.3 8.3 9.4 8.0 17 14 14 13

Matched Filter
31.4

1 CPU 2 CPU 3 CPU 4 CPU

3.2

31.5

1.4

15.7

1.0

10.4

0.7

8.2

12 31 57 16 17 28 14 9.1 18 18 15 14

9.8

18.0

33 frames/sec (1.6 MHz BW) 66 frames/sec (3.2 MHz BW)

6.5

13.7

3.3

11.5

Slide-64 SC2002 Tutorial

MIT Lincoln Laboratory

Dynamic Programming
Graph construct is very general Graph construct is very general Widely used for optimization problems Widely used for optimization problems Many efficient techniques for choosing best path (under constraints) Many efficient techniques for choosing best path (under constraints) such as Dynamic Programming such as Dynamic Programming
N = to l hardware unis ta t M = number o tasks f Pi = number o mapp f ings fo task i r t= M pathTab [ le M][N] = ali in te we l nf i ight pa ths { fo (j . ) fo ( k . Pj ) fo (ij .Nr :1 .M { r :1 . { r :+1. t+1) i(i s f - ize[k >= j) ] { i(j > 1 ) f { w = weight thTab [-1 [- ize ]] + [pa le j ]is [k ] we t ] + igh [k we t igh [edge[l t thTab [-1 [-s as [pa le j ]i ize[k ], ] ]] k p = addVer [pathTab [- [-s [k ] k tex le j 1] i ize ], ] }e lse{ w = weight ] [k p = makePath k [] } i( we ght thTab []i] > w ) f i [pa le j[] { pa thTab []i = p le j[] }}} } t = t- 1 }
Slide-65 SC2002 Tutorial

MIT Lincoln Laboratory

Predicted and Achieved Latency and Throughput


Problem Size Small (48x4K) Large (48x128K)
1.5
1-1-1-1

25

Latency (seconds)

1-1-1-1

predicted achieved

predicted achieved

20

1.0

1-1-1-2 1-1-2-1 1-1-2-2 1-2-2-1 1-2-2-2 1-2-2-2 1-2-2-3

Find Find

15

Min(latency || #CPU) Min(latency #CPU) Max(throughput || #CPU) Max(throughput #CPU)

0.5 5.0
1-2-2-2

2-2-2-2

10 0.25
1-3-2-2 1-2-2-2 1-3-2-2

S3P selects correct S3P selects correct optimal mapping optimal mapping Excellent agreement Excellent agreement between S3P predicted between S3P predicted and achieved latencies and achieved latencies and throughputs and throughputs

Throughput (frames/sec)

4.0
1-2-2-1 1-1-2-2

0.20

3.0
1-1-1-1

1-1-2-1
predicted achieved

0.15
1-1-2-1 1-1-1-1
predicted achieved

2.0 4 5

0.10

#CPU

#CPU

Slide-66 SC2002 Tutorial

MIT Lincoln Laboratory

Outline

Introduction Processing Algorithms Parallel System Analysis Software Frameworks Summary


PVL PETE S3P MatlabMPI

Slide-67 SC2002 Tutorial

MIT Lincoln Laboratory

Modern Parallel Software Layers


Application Parallel Library
Math Kernel Intel Cluster Input Analysis Output

Vector/Matrix Vector/Matrix

Comp Comp Conduit

Task

User Interface Hardware Interface

Messaging Kernel

Hardware
Workstation PowerPC Cluster

Can build any parallel Can build any parallel application/library on top application/library on top of a few basic messaging of a few basic messaging capabilities capabilities MatlabMPI provides this MatlabMPI provides this Messaging Kernel Messaging Kernel
MIT Lincoln Laboratory

Slide-68 SC2002 Tutorial

MatlabMPI Core Lite

Parallel computing requires eight capabilities


MPI_Run launches a Matlab script on multiple processors MPI_Comm_size returns the number of processors MPI_Comm_rank returns the id of each processor MPI_Send sends Matlab variable(s) to another processor MPI_Recv receives Matlab variable(s) from another processor MPI_Init called at beginning of program MPI_Finalize called at end of program

Slide-69 SC2002 Tutorial

MIT Lincoln Laboratory

MatlabMPI: Point-to-point Communication


MPI_Send (dest, tag, comm, variable);

Sender variable

Shared File System

Receiver load variable

save

Data file

create

Lock file

detect

variable = MPI_Recv (source, tag, comm);

Sender saves variable in Data file, then creates Lock file Sender saves variable in Data file, then creates Lock file Receiver detects Lock file, then loads Data file Receiver detects Lock file, then loads Data file
Slide-70 SC2002 Tutorial

MIT Lincoln Laboratory

Example: Basic Send and Receive


Initialize Initialize Get processor ranks Get processor ranks
M PI_In t i; % In t l ze M PI. iiai com m = MPI_CO M M _ W O RLD; % Create com munica or t . com m_s ize = MPI_Com m_size(com m); % Get s . ize my_rank = M PI_Co m m_rank(com m); % Get rank. source = 0 ; % Set source. dest = 1 ; % Set des i ion tnat . tag = 1 ; % Set message tag . i( f com m_s ze == 2) i % Check s ze. i i (my_rank == source) f % I source. f da = 1 10; ta : % Create data . MPI_Send(des , t tag, com m,data) % Send data ; . end i (my_rank == des ) f t % I des inat . f t ion da ta=MPI_Recv(source tag , ,com m); % Rece ive data . end end M PI_F lze; ina i ex t i; % Fina i Mat l ze lab MPI. % Ex t Mat ab i l

Execute send Execute send Execute recieve Execute recieve

Finalize Finalize Exit Exit


Slide-71 SC2002 Tutorial

Uses standard message passing techniques Uses standard message passing techniques Will run anywhere Matlab runs Will run anywhere Matlab runs Only requires a common file system Only requires a common file system
MIT Lincoln Laboratory

MatlabMPI vs MPI bandwidth


Bandwidth (SGI Origin2000)
1.E+08

Bandwidth (Bytes/sec)

1.E+07 C MPI MatlabMPI 1.E+06

1.E+05 1K 4K 16K 64K 256K 1M 4M 32M

Message Size (Bytes)

Slide-72 SC2002 Tutorial

Bandwidth matches native C MPI at large message size Bandwidth matches native C MPI at large message size MIT Lincoln Laboratory Primary difference is latency (35 milliseconds vs. 30 microseconds) Primary difference is latency (35 milliseconds vs. 30 microseconds)

Image Filtering Parallel Performance


100

Fixed Problem Size (SGI O2000) Parallel performance


Linear MatlabMP

100

Scaled Problem Size (IBM SP2)


MatlabMPI Linear

Speedup

10

Gigaflops

10

1 1 2 4 8 16 32 64

0 1 10 100 1000

Number of Processors

Number of Processors

Achieved classic super-linear speedup on fixed problem Achieved speedup of ~300 on 304 processors on scaled problem
Slide-73 SC2002 Tutorial

MIT Lincoln Laboratory

Productivity vs. Performance


0.1 10 Peak Performance 1 10 100 1000

Lines of Code

Matlab

Parallel Matlab* Matlab MatlabMPI

Programmed image Programmed image filtering several ways filtering several ways
Matlab Matlab VSIPL VSIPL VSIPL/OpenMPI VSIPL/OpenMPI VSIPL/MPI VSIPL/MPI PVL PVL MatlabMPI MatlabMPI

100 Current Practice


VSIPL

Current Research
PVL VSIPL/MPI C++

MatlabMPI provides MatlabMPI provides


C

high productivity high productivity high performance high performance

1000

VSIPL/OpenMP Single Processor Shared Memory

VSIPL/MPI Distributed Memory

Slide-74 SC2002 Tutorial

MIT Lincoln Laboratory

Summary

Exploiting parallel processing for streaming applications presents unique software challenges. The community is developing software librariea to address many of these challenges:
Exploits C++ to easily express data/task parallelism Seperates parallel hardware dependencies from software Allows a variety of strategies for implementing dynamic applications(e.g. for fault tolerance) Delivers high performance execution comparable to or better than standard approaches

Our future efforts will focus on adding to and exploiting the features of this technology to:
Exploit dynamic parallelism Integrate high performance parallel software underneath mainstream programming environments (e.g Matlab, IDL, ) Use self-optimizing techniques to maintain performance

Slide-75 SC2002 Tutorial

MIT Lincoln Laboratory

You might also like