Aman Occ Final

ON-CHIP-OPTICAL-
COMMUNICATION
Presented by
Aman Chitransh
“Moore’s Gap”
Pe rfo rm a n ce
( GOPS )
Tiled Multicore ors
100
s ist The
0 a n GOPS
10 Multicore
Tr
Gap
0
10 SMT, FGMT, CGMT
OOO
1 Superscalar §Diminishing returns from
Pipelining single CPU mechanisms
0. (pipelining, caching, etc.)
1 §Wire delays
0 .0 §Power envelopes
1 §
tim e
1992 1998 2002 2006 2010
2
Multicore Scaling Trends
 Today Tomorrow
§ A few large cores on each chip 100’s to 1000’s of simpler cores
§ Only option for future scaling [S. Borkar, Intel, 2007]
is to add more cores Simple cores are more power and
 area efficient
Global structures do not scale;
all resources must be
distributed
m m m m
p p p p
p p sw
it
sw
it
sw
it
sw
it
ch ch ch ch
m m m m
p p p p
c c sw
it
sw
it
sw
it
sw
it
ch ch ch ch
m m m m
BUS p p p p
sw sw sw sw
it it it it
ch ch ch ch
m m m m
L2 Cache p p p p
sw sw sw sw
it it it it
ch ch ch ch
3
The Future of Multicore
Number of cores doubles
every 18 months Parallelism replaces
clock frequency
scaling and core
complexity
Resulting
Challenges…
Scalability
Programming
Power
IB M X C e ll8 i Tilera
MIT RAW Sun Ultrasparc T2 TILE64
4
Multicore Challenges
 Scalability
 How do we turn additional cores into additional performance?
 Must accelerate single apps, not just run more apps in parallel
 Efficient core-to-core communication is crucial
 Architectures that grow easily with each new technology
generation

 Programming
 Traditional parallel programming techniques are hard
 Parallel machines were rare and used only by rocket scientists
 Multicores are ubiquitous and must be programmable by
anyone

 Power
 Already a first-order design constraint
 More cores and more communication  more power
 Previous tricks (e.g. lower Vdd) are running out of steam
5
Multicore Communication Today
Bus - based
Interconnect
§ Single shared resource
p p
§ Uniform communication cost
c c
§ Communication through
BUS
memory
L2 Cache § Doesn’t scale to many cores
due to contention and
long wires
§ Scalable up to about 8
DRAM
cores
6
Multicore Communication Tomorrow
Point - to - Point Mesh Network
m
p
m
p
m
p
m
p Examples: MIT Raw, Tilera TILEPro64,
Intel Terascale Prototype
sw sw sw sw
it it it it
ch ch ch ch
m m m m
p
sw
it
p
sw
it
p
sw
it
p
sw
it
Neighboring tiles are connected
ch ch ch ch
m
p
m
p
m
p
m
p
Distributed communication
sw
it
ch
sw
it
ch
sw
it
ch
sw
it
ch
resources
m m m m
p
sw
p
sw
p
sw
p
sw
Non-uniform costs:
Latency depends on distance
it it it it
ch ch ch ch
Encourages direct
communication
More energy efficient than bus
DRAM
DRAM
DRAM
DRAM
Scalable to hundreds of cores
7
ATAC Architecture
Electrical Mesh Interconnect
m m m m
p p p p
switch switch switch switch
m m m m
p p p p
m m m m
p p p p
m m m m
p p p p
Optical Broadcast WDM Interconnect

8
Optical Broadcast Network
§ Waveguide passes
through every
core
§ Multiple
wavelengths
(WDM) eliminates
contention
§ Signal reaches all
cores in <2ns
§ Same signal can be
received by all
cores
optical waveguide 
9
Optical Broadcast Network
§Electronic-photonic
integration using
standard CMOS
process
N cores §Cores communicate
via optical WDM
broadcast and
select network
§Each core sends on
its own dedicated
wavelength using
modulators
§Cores can receive
from some set of
senders using
optical filters
10
Optical bit transmission
§Each core sends data using a different wavelength  no

contention
§Data is sent once, any or all cores can receive it  efficient
broadcast
multi-wavelength source waveguide

modulator
data waveguide
modulator fil
driver ter transimpedance
amplifier
photodetector
flip-flop flip-flop
sending core receiving core

11
ATAC Bandwidth
64 cores, 32 lines, 1 Gb/s

Transmit BW: 64 cores x 1 Gb/s x 32 lines = 2 Tb/s
Receive-Weighted BW: 2 Tb/s * 63 receivers = 126
Tb/s
 Good metric for broadcast networks – reflects WDM
ATAC allows better utilization of computational

resources because less time is spent performing
communication

System Capabilities and Performance
Baseline: Raw Multicore Chip ATAC Multicore Chip

§Leading-edge tiled multicore §Future optical interconnect
§ multicore
64-core system (65nm process) §
§Peak performance: 64 GOPS 64-core system (65nm process)
§Chip power: 24 W §Peak performance: 64 GOPS
§Theoretical power eff.: 2.7 §Chip power: 25.5 W
GOPS/W §Theoretical power eff.: 2.5
§Effective performance: 7 . 3 GOPS GOPS/W
§Effective power eff: 0.3 §Effective performance: 38 . 0
GOPS / W GOPS
§Total system power: 150 W §Effective power eff.: 1.5
GOPS / W
§ §Total system power: 153 W
§
Optical communications require a small
amount of additional system power but allow
for much better utilization of
computational resources .
13
Programming ATAC
Cores can directly communicate with any other core

in one hop (<2ns)

Broadcasts require just one send

No complicated routing on network required


Cheap broadcast enables frequent global


communications
Broadcast-based cache update/remote store

protocol
 All “subscribers” are notified when a writing core issues
a store (“publish”)
Uniform communication latency simplifies

scheduling
14
Communication-centric Computing
§ATAC reduces off-chip memory calls, and hence energy and latency
§View of extended global memory can be enabled cheaply with on-

chip distributed cache memory and ATAC network
Operation Energy Latency

memory
500pJ 3pJ Network 3pJ 3 cycles
500pJ transfer
500pJ 500pJ
p
ALU add 2pJ 1 cycle
p 3pJ
operation
3pJ
c c
32KB cache 50pJ 1 cycle
BUS 3pJ read
L2 Cache
Off-chip 500pJ 250
memory read cycles
Bus-Based ATAC
Multicore
15
ATAC is an Efficient Network
•Modulators are Primary Source of Power Consumption
–Receive Power: Require only ~2 fJ/bit even with -5dB link loss
–Modulator Power:
•Ge-Si EA design ~75 fJ/bit (assume 50 fJ/bit for modulator
driver)
•Example: 64-Core Communication
(i.e. N = 64 cores = 64 s; for 32 bit word: 2048 drops/core and 32 adds/core)
–Receive Power: 2 fJ/bit x 1Gbit/s x 32 bits x N2 = 262 W
–Modulator Power: 75 fJ/bit x 1Gbit/s x 32 bits x N = 153 W
–Total energy/bit = 75 fJ/bit + 2 fJ/bit x (N-1) = 201 fJ/bit
•
•Comparison: Electrical Broadcast Across 64 Cores
–Require 64 x 150fJ/bit = 10 pJ/bit (~50X more power)
(Assumes 150fJ/mm/bit, 1-mm spaced tiles)
16
Summary
 ATAC uses optical networks to enable multicore
programming and performance scaling

 ATAC encourages communication-centric

architecture, which helps multicore performance and
power scalability

 ATAC simplifies programming with a contention-free

all-to-all broadcast network

 ATAC is enabled by recent advances in CMOS

integration of optical components
§

17
What Does the Future Look Like?
Corollary of Moore’s law: Number of cores
will double every 18 months
‘02 ‘05 ‘08 ‘11 ‘14
Research 16 64 256 1024 4096

Industry 4 16 64 256 1024
1K cores by 2014 ! Are we

ready?
(Cores minimally big enough to run a self respecting O
18
Scaling to 1000 Cores
memory BNet
Proc
ONet $
HUB
ENet Dir $
memory NET
Electrical Networks
64 Optically-Connected Clusters Connect 16 Cores to
Optical Hub
Purely optical design scales to about 64 cores

After that, clusters of cores share optical hubs
 ENet and BNet move data to/from optical hub

 Dedicated, special-purpose electrical networks
19

Aman Occ Final

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aman Occ Final

Uploaded by

Copyright:

Available Formats

ON-CHIP-OPTICAL-

Scalable to hundreds of cores

switch switch switch switch

switch switch switch switch

switch switch switch switch

switch switch switch switch

Optical Broadcast WDM Interconnect

§Each core sends data using a different wavelength  no

multi-wavelength source waveguide

sending core receiving core

64 cores, 32 lines, 1 Gb/s

Receive-Weighted BW: 2 Tb/s * 63 receivers = 126

ATAC allows better utilization of computational

Baseline: Raw Multicore Chip ATAC Multicore Chip

in one hop (<2ns)

No complicated routing on network required

Cheap broadcast enables frequent global

§View of extended global memory can be enabled cheaply with on-

Operation Energy Latency

 ATAC encourages communication-centric

 ATAC simplifies programming with a contention-free

 ATAC is enabled by recent advances in CMOS

‘02 ‘05 ‘08 ‘11 ‘14

Research 16 64 256 1024 4096

1K cores by 2014 ! Are we

Purely optical design scales to about 64 cores

 ENet and BNet move data to/from optical hub

You might also like