You are on page 1of 45

ARRAY PROCESSORS

SIMD Computer Organization

SIMD Computer Organization:


I/O Instruction and data
CU memory

CU

CP

Control

Data bus

PE0
PEM0

PE1
PEM1

PEN-1
PEMN-1

Interconnection N/W CONFIG I

I/O Config II: CU memory CU

PE0

PE1

PEN-1

Alignment Network Data Bus

M0

M1

Mp-1

Formally, an SIMD computer is characterized by the following set of parameters: C = ( N, F, I, M)

N: Number of PEs in the system. Illiac IV ___ 64 MPP __ 16,384 (Massively parallel processing) F: Set of data routing functions provided by the interconnection network. Example: Mesh, star, Omega and butterfly.
data routing and network manipulation operations.

I: The set of instructions for scalar, vector,

M: Set of masking schemes, where each mask partitions the PEs into 2 disjoint subnets of enabled PEs & disable PEs.

Components in a PE:
Index Register Destination Register Ai Di Bi Ii Ci Data Routing Register

Ri

Si
ALU

Status Register

Ai, Bi and Ci are general purpose registers.


Only the contents of Ri are transformed to other PEs during data transfer.

If N = 2^m ( m = no. of bits required to identify a PE) PEs are


there, then Di will hold m bit address of the destined PE.

Each PEi is either active or inactive during instruction cycle:

Si = 1 Active

Si = 0 Inactive

Necessity of Data Routing:

Consider an Array: A = (A0, A1,..,An-1)


n-1

Now, for computing: i=0 S (n) = Ai For n = 8 with N = 8, addition is performed in log2 N steps i.e., 3 steps.

In SISD, the same thing would take 8 steps/loops by formula:

sum = sum + A[i]


If you want to make it faster, youll burn this loop into hardware steps.

But the SAME thing would take ONLY 3 steps in SIMD!!

SIMD:
1 2 3 1 1 + 2=3 2 + 3=5 3 + 4=7 4 + 5=9 5 + 6=11 6 + 7=13 7 + 8=15 1 3 5(1+5=6) 7(3+7=10) 9(5+9=14) 11(7+11=18) 13(9+13=22) 15(11+15=26) 1 3 6 10 1+!4 3+18

S(0) S(1)

S(2)
S(3) S(4) S(5) S(6) S(7)

4
5 6 7 8

6+22
10+26=36

Step 1

Step 2

Step 3

Step # 1: Ai would transfer data in Ri Ai Ri Ri Ri+1 Ai + Ri Step # 2: Ai Ri Ri Ri+2

Algorithm:

i = 0-6 i = 0-6

Ai

i = 1-7

i = 0-5 i = 0-5

Ai + Ri
Step # 3: Ai Ri Ri Ri+4 Ai + Ri

Ai

i = 2-7

i = 0-3 i = 0-3 Ai i = 4-7

During Data Routing:

Masking Scheme:

Step # 1: PE7 is disabled. Step # 2: PE6 and PE7 are disabled. Step # 3: PE4 PE7 are disabled.

During Addition:
Step # 1: PE0 is not involved.

Step # 2: PE0, PE1 are not involved.


Step # 3: PE0-PE3 are not involved.

SIMD Interconnection network


Interconnection networks

Static network

Dynamic network

1-D

2-D

3-D and hypercube

Bus Based

Switch Based

Single Stage

Multistage

crossbar

Static Networks (1D) Linear Array


N nodes connected by n-1 links; Internal nodes have degree 2 End nodes have degree 1. Diameter = n-1

2D Ring Network

Like a linear array, but the two end nodes are connected by an n
th

link.

Can be unidirectional or bi-directional.


Node degree (d)= 2 Diameter (D)

Unidirectional = n-1 Bidirectional = n/2

Star network

Internal node degree = n-1


External nodes have degree = 1 Network diameter = 2

Static Interconnection Networks (2D)


A n-dimensional

mesh [torus or wraparound mesh] is an extension of the linear array [ring]. Degree: 2-4 Examples: Intel Paragon (2D mesh),

Mesh interconnection network (ILLIAC IV N/W)

It is a chordal ring network.


a e 0 b 1 c 2 d 3 f

10

11

12 a

13 b

14 c

15 d

In the Illiac IV, each processor i was connected to processors: Ex: N=16 {i+1, i1, i+4, and i4} (mod 16). Here are the routing functions:

R+1(i) (i + 1) mod N R1(i) (i 1) mod N R+r (i) (i + r) mod N Rr (i) (i r) mod N


where r = N Where 0 i N-1. The diameter of an Illiac-IV mesh is N 1.

Chordal Ring N/W is ILLIAC-IV n/w. Also called as partially connected n/w. (Diag.)
15 14 13 12 11 10 9

1 2 3 4 5 6

By adding additional links (e.g. chords in a circle), the


node degree is increased, and the network diameter is reduced.

Barrel shifting network:


A barrel shifter is sometimes called a plus-minus- 2i network. Routing functions:

B+i (j ) = (j + 2i) mod N Bi (j ) = (j 2i) mod N

where 0 j N 1 and 0 i < log2N.

B+0, B-0, B+1, B-1, B+2, B-2, B+3 and B-3

In general, the diameter of a barrel shifter is (log2N)/2.

How Barrel Shifter Network is an enhancement over Mesh Interconnection network?


In Mesh Network: R+1=(0 1 2 3 .15) R-1 =(15 14 13.0) R+4=(0 4 8 12) (1 5 9 12) (2 6 10 14) (3 7 11 15) R-4 =(15 11 7 3) (14 10 6 2) (12 9 5 1) (12 8 4 0) In Barrel Shifting Network: B+0=(0 1 2 3 .15) B-0=(15 14 13.0) B+1=(0 2 4 6 8..14)(1 3 5 .15) B-1=(15 13 11.1) (14 12 ..0) B+2=(0 4 8 12) (1 5 9 13) (2 6 10 14) (3 7 11 15) B-2=(15 11 7 3) (14 10 6 2) (13 9 5 1) (12 8 4 0) B+3= (0 8) (1 9) (2 10) (3 11) (4 12) (5 13) (6 14) (7 15) B-3= (15 7) (14 6) (13 5) (12 4) (11 3) (10 2) (9 1) (8 0)

Contd

Comparing the two networks


B+0= R+1 B-0 = R-1 B+2= R+4 B-2 =R-4

Which means in general B+n/2 = R+r B-n/2 = R-r where n=log 2N

3D- Fully Connected


In the limit, we obtain a fully-connected network, with a node degree of n -1 and a diameter of 1.

3D networks :
000

3- cube
001 010 011 101

100 110

111

A hypercube is a generalized cube. In a hypercube, there are 2n nodes, for some n. Each node is connected to all other nodes whose

numbers differ from it in only one bit position. The node degree of n cube equals n and so does the

network diameter.

3-cube connected cyclic (CCC) network:


Is obtained from 3-cube n/w. The idea is to cutoff corner nodes of 3-cube and replace each by a cycle of 3 nodes. CCC n/w is constructed from k-cube with n = 2k cycles

nodes.
Hence a 3-cube can be transformed to a 3-CCC with k x 2k nodes.

4D hypercube
4D hypercube = two 3D hypercubes with an additional link connecting corresponding processors

Multistage Interconnection Network


Switch Modules

A x B switch module A inputs and B outputs In practice, A = B = power of 2 Each input is connected to one or more outputs

(conflicts must be avoided)

One-to-one (permutation) and one-to-many are allowed

Multistage Interconnection Network


Switch Modules

A 2 2 switch can be configured for


Straight-through Crossover Upper broadcast (upper input to both outputs) Lower broadcast (lower input to both outputs)

Binary Switch
2x2 Switch Legitimate States = 4

Perfect-shuffle interconnection:
This interconnection network is defined by the routing function S (an1 a1a0)2 = (an2 a1a0 an1)2
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

Perfect Shuffle

Inverse Shuffle

a shuffle network is not a complete interconnection network. This can be seen by looking at what happens as data is reci rculated through the network.

An exchange permutation can be added to a shuffle network to make


it into a complete interconnection structure.

E (an1 a1 a0)2 = an1 a1 0

with a shuffle-exchange network, arbitrary cyclic shifts of an N-element array can be performed in log N steps. Here

is a diagram of a multistage omega network for N = 8.

Exch. 1
0 1 2 3 4 5 6 7 0 4 1 5 2 6 3 7

Exch. 2
0 2 4 6 1 3 5 7

Exch. 3
0 1 2 3 4 5 6 7

Shuffle 1

Shuffle 2

Shuffle 3

Omega network features

There are log p stages each with p/2 switching elements each = p/2 * log p total Simple routing algorithm At each stage, look at the corresponding bit (starting with the msb) of the

source and destination address


If the bits are the same, messages passes through, otherwise is crossedover

Path Contention
0 1 4 2 3 5 4 0 1 2 3 4

5
6

5
6 7

Path Contention
0 1 2 3 4 4 0 1 2 3 4

5
6 5

5
6 7

Path Contention
0 1 2 3 4 0 1 2

3 4

5
6

5
6 5 7

Path Contention
0 1 2 3 4 0 1 2 3 4

5
6 4 5

5
6 7

Path Contention
0 1 2 3 4 0 1 2 3 4

5
6

5
6 7

Path Contention
0 1 2 3 4 0 1 2 3 4

5
6 5

5
6 7

Path Contention
0 1 2 3 4 0 1 2 3 4 5

5
6

5
6 7

Path Contention
0 1 2 3 4 0 1 2 3 4 5 5 6 7

5
6

Extra Problems

. Find the following in a 16x16 omega network


a) b)

Number of stages. Number of 2 x 2 switches needed in each stage?.

c)

Draw a 16-input Omega network using 2 x 2 switches as building


blocks.

d)

Show switch settings for routing a message

from node 1101 to node 0101 and from node 0111 to node 1001
simultaneously.

From node 2 to 7 and 6 to 4 simultaneously.

e)

Does blocking exist in above two case?

Show the necessity of data routing and masking mechanisms during


the addition of 15 numbers. Assume each PE holds one element.

Find the number of steps required to add 16 elements. Calculate the different routing functions. Show the routing and addition in each step.

You might also like