You are on page 1of 19

ON-CHIP-OPTICAL-

COMMUNICATION

Presented by

Aman Chitransh
“Moore’s Gap”
Pe rfo rm a n ce
( GOPS )
Tiled Multicore ors
100
s ist The
0 a n GOPS
10 Multicore
Tr
Gap
0
10 SMT, FGMT, CGMT

OOO
1 Superscalar §Diminishing returns from
Pipelining single CPU mechanisms
0. (pipelining, caching, etc.)
1 §Wire delays
0 .0 §Power envelopes
1 §
tim e
1992 1998 2002 2006 2010
2
Multicore Scaling Trends
 Today Tomorrow
§ A few large cores on each chip 100’s to 1000’s of simpler cores
§ Only option for future scaling [S. Borkar, Intel, 2007]
is to add more cores Simple cores are more power and
 area efficient
Global structures do not scale;
all resources must be
distributed

m m m m
p p p p
p p sw
it
sw
it
sw
it
sw
it
ch ch ch ch
m m m m
p p p p
c c sw
it
sw
it
sw
it
sw
it
ch ch ch ch
m m m m
BUS p p p p
sw sw sw sw
it it it it
ch ch ch ch
m m m m
L2 Cache p p p p
sw sw sw sw
it it it it
ch ch ch ch

3
The Future of Multicore
Number of cores doubles
every 18 months Parallelism replaces
clock frequency
scaling and core
complexity

Resulting
Challenges…
Scalability
Programming
Power

IB M X C e ll8 i Tilera
MIT RAW Sun Ultrasparc T2 TILE64

4
Multicore Challenges
 Scalability
 How do we turn additional cores into additional performance?
 Must accelerate single apps, not just run more apps in parallel
 Efficient core-to-core communication is crucial
 Architectures that grow easily with each new technology
generation

 Programming
 Traditional parallel programming techniques are hard
 Parallel machines were rare and used only by rocket scientists
 Multicores are ubiquitous and must be programmable by
anyone

 Power
 Already a first-order design constraint
 More cores and more communication  more power
 Previous tricks (e.g. lower Vdd) are running out of steam

5
Multicore Communication Today
Bus - based
Interconnect
§ Single shared resource
p p
§ Uniform communication cost
c c
§ Communication through
BUS
memory
L2 Cache § Doesn’t scale to many cores
due to contention and
long wires
§ Scalable up to about 8
DRAM
cores

6
Multicore Communication Tomorrow
Point - to - Point Mesh Network

m
p
m
p
m
p
m
p Examples: MIT Raw, Tilera TILEPro64,
Intel Terascale Prototype
sw sw sw sw
it it it it
ch ch ch ch
m m m m
p
sw
it
p
sw
it
p
sw
it
p
sw
it
Neighboring tiles are connected
ch ch ch ch
m
p
m
p
m
p
m
p
Distributed communication
sw
it
ch
sw
it
ch
sw
it
ch
sw
it
ch
resources
m m m m
p
sw
p
sw
p
sw
p
sw
Non-uniform costs:
Latency depends on distance
it it it it
ch ch ch ch

Encourages direct
communication
More energy efficient than bus
DRAM

DRAM

DRAM

DRAM

Scalable to hundreds of cores

7
ATAC Architecture
Electrical Mesh Interconnect

m m m m

p p p p

switch switch switch switch

m m m m

p p p p

switch switch switch switch

m m m m

p p p p

switch switch switch switch

m m m m

p p p p

switch switch switch switch

Optical Broadcast WDM Interconnect


8
Optical Broadcast Network
§ Waveguide passes
through every
core
§ Multiple
wavelengths
(WDM) eliminates
contention
§ Signal reaches all
cores in <2ns
§ Same signal can be
received by all
cores
optical waveguide 

9
Optical Broadcast Network
§Electronic-photonic
integration using
standard CMOS
process
N cores §Cores communicate
via optical WDM
broadcast and
select network
§Each core sends on
its own dedicated
wavelength using
modulators
§Cores can receive
from some set of
senders using
optical filters

10
Optical bit transmission

§Each core sends data using a different wavelength  no


contention
§Data is sent once, any or all cores can receive it  efficient
broadcast

multi-wavelength source waveguide


modulator

data waveguide

modulator fil
driver ter transimpedance
amplifier
photodetector

flip-flop flip-flop

sending core receiving core


11
ATAC Bandwidth

64 cores, 32 lines, 1 Gb/s


Transmit BW: 64 cores x 1 Gb/s x 32 lines = 2 Tb/s

Receive-Weighted BW: 2 Tb/s * 63 receivers = 126

Tb/s
 Good metric for broadcast networks – reflects WDM

ATAC allows better utilization of computational


resources because less time is spent performing
communication


System Capabilities and Performance

Baseline: Raw Multicore Chip ATAC Multicore Chip


§Leading-edge tiled multicore §Future optical interconnect
§ multicore
64-core system (65nm process) §
§Peak performance: 64 GOPS 64-core system (65nm process)
§Chip power: 24 W §Peak performance: 64 GOPS
§Theoretical power eff.: 2.7 §Chip power: 25.5 W
GOPS/W §Theoretical power eff.: 2.5
§Effective performance: 7 . 3 GOPS GOPS/W
§Effective power eff: 0.3 §Effective performance: 38 . 0
GOPS / W GOPS
§Total system power: 150 W §Effective power eff.: 1.5
GOPS / W
§ §Total system power: 153 W
§
Optical communications require a small
amount of additional system power but allow
for much better utilization of
computational resources .

13
Programming ATAC
Cores can directly communicate with any other core

in one hop (<2ns)


Broadcasts require just one send

No complicated routing on network required


Cheap broadcast enables frequent global


communications
Broadcast-based cache update/remote store

protocol
 All “subscribers” are notified when a writing core issues
a store (“publish”)
Uniform communication latency simplifies

scheduling

14
Communication-centric Computing

§ATAC reduces off-chip memory calls, and hence energy and latency

§View of extended global memory can be enabled cheaply with on-


chip distributed cache memory and ATAC network

Operation Energy Latency


memory
500pJ 3pJ Network 3pJ 3 cycles
500pJ transfer
500pJ 500pJ

p
ALU add 2pJ 1 cycle
p 3pJ
operation
3pJ
c c
32KB cache 50pJ 1 cycle
BUS 3pJ read
L2 Cache
Off-chip 500pJ 250
memory read cycles
Bus-Based ATAC
Multicore

15
ATAC is an Efficient Network
•Modulators are Primary Source of Power Consumption
–Receive Power: Require only ~2 fJ/bit even with -5dB link loss
–Modulator Power:
•Ge-Si EA design ~75 fJ/bit (assume 50 fJ/bit for modulator
driver)
•Example: 64-Core Communication
(i.e. N = 64 cores = 64 s; for 32 bit word: 2048 drops/core and 32 adds/core)
–Receive Power: 2 fJ/bit x 1Gbit/s x 32 bits x N2 = 262 W
–Modulator Power: 75 fJ/bit x 1Gbit/s x 32 bits x N = 153 W
–Total energy/bit = 75 fJ/bit + 2 fJ/bit x (N-1) = 201 fJ/bit

•Comparison: Electrical Broadcast Across 64 Cores
–Require 64 x 150fJ/bit = 10 pJ/bit (~50X more power)
(Assumes 150fJ/mm/bit, 1-mm spaced tiles)

16
Summary
 ATAC uses optical networks to enable multicore
programming and performance scaling

 ATAC encourages communication-centric


architecture, which helps multicore performance and
power scalability

 ATAC simplifies programming with a contention-free


all-to-all broadcast network

 ATAC is enabled by recent advances in CMOS


integration of optical components
§

17
What Does the Future Look Like?
Corollary of Moore’s law: Number of cores
will double every 18 months

‘02 ‘05 ‘08 ‘11 ‘14

Research 16 64 256 1024 4096


Industry 4 16 64 256 1024

1K cores by 2014 ! Are we


ready?
(Cores minimally big enough to run a self respecting O
18
Scaling to 1000 Cores

memory BNet
Proc
ONet $
HUB
ENet Dir $

memory NET

Electrical Networks
64 Optically-Connected Clusters Connect 16 Cores to
Optical Hub

Purely optical design scales to about 64 cores


After that, clusters of cores share optical hubs

 ENet and BNet move data to/from optical hub


 Dedicated, special-purpose electrical networks

19

You might also like