Professional Documents
Culture Documents
COMMUNICATION
Presented by
Aman Chitransh
“Moore’s Gap”
Pe rfo rm a n ce
( GOPS )
Tiled Multicore ors
100
s ist The
0 a n GOPS
10 Multicore
Tr
Gap
0
10 SMT, FGMT, CGMT
OOO
1 Superscalar §Diminishing returns from
Pipelining single CPU mechanisms
0. (pipelining, caching, etc.)
1 §Wire delays
0 .0 §Power envelopes
1 §
tim e
1992 1998 2002 2006 2010
2
Multicore Scaling Trends
Today Tomorrow
§ A few large cores on each chip 100’s to 1000’s of simpler cores
§ Only option for future scaling [S. Borkar, Intel, 2007]
is to add more cores Simple cores are more power and
area efficient
Global structures do not scale;
all resources must be
distributed
m m m m
p p p p
p p sw
it
sw
it
sw
it
sw
it
ch ch ch ch
m m m m
p p p p
c c sw
it
sw
it
sw
it
sw
it
ch ch ch ch
m m m m
BUS p p p p
sw sw sw sw
it it it it
ch ch ch ch
m m m m
L2 Cache p p p p
sw sw sw sw
it it it it
ch ch ch ch
3
The Future of Multicore
Number of cores doubles
every 18 months Parallelism replaces
clock frequency
scaling and core
complexity
Resulting
Challenges…
Scalability
Programming
Power
IB M X C e ll8 i Tilera
MIT RAW Sun Ultrasparc T2 TILE64
4
Multicore Challenges
Scalability
How do we turn additional cores into additional performance?
Must accelerate single apps, not just run more apps in parallel
Efficient core-to-core communication is crucial
Architectures that grow easily with each new technology
generation
Programming
Traditional parallel programming techniques are hard
Parallel machines were rare and used only by rocket scientists
Multicores are ubiquitous and must be programmable by
anyone
Power
Already a first-order design constraint
More cores and more communication more power
Previous tricks (e.g. lower Vdd) are running out of steam
5
Multicore Communication Today
Bus - based
Interconnect
§ Single shared resource
p p
§ Uniform communication cost
c c
§ Communication through
BUS
memory
L2 Cache § Doesn’t scale to many cores
due to contention and
long wires
§ Scalable up to about 8
DRAM
cores
6
Multicore Communication Tomorrow
Point - to - Point Mesh Network
m
p
m
p
m
p
m
p Examples: MIT Raw, Tilera TILEPro64,
Intel Terascale Prototype
sw sw sw sw
it it it it
ch ch ch ch
m m m m
p
sw
it
p
sw
it
p
sw
it
p
sw
it
Neighboring tiles are connected
ch ch ch ch
m
p
m
p
m
p
m
p
Distributed communication
sw
it
ch
sw
it
ch
sw
it
ch
sw
it
ch
resources
m m m m
p
sw
p
sw
p
sw
p
sw
Non-uniform costs:
Latency depends on distance
it it it it
ch ch ch ch
Encourages direct
communication
More energy efficient than bus
DRAM
DRAM
DRAM
DRAM
7
ATAC Architecture
Electrical Mesh Interconnect
m m m m
p p p p
m m m m
p p p p
m m m m
p p p p
m m m m
p p p p
9
Optical Broadcast Network
§Electronic-photonic
integration using
standard CMOS
process
N cores §Cores communicate
via optical WDM
broadcast and
select network
§Each core sends on
its own dedicated
wavelength using
modulators
§Cores can receive
from some set of
senders using
optical filters
10
Optical bit transmission
data waveguide
modulator fil
driver ter transimpedance
amplifier
photodetector
flip-flop flip-flop
Tb/s
Good metric for broadcast networks – reflects WDM
System Capabilities and Performance
13
Programming ATAC
Cores can directly communicate with any other core
communications
Broadcast-based cache update/remote store
protocol
All “subscribers” are notified when a writing core issues
a store (“publish”)
Uniform communication latency simplifies
scheduling
14
Communication-centric Computing
§ATAC reduces off-chip memory calls, and hence energy and latency
p
ALU add 2pJ 1 cycle
p 3pJ
operation
3pJ
c c
32KB cache 50pJ 1 cycle
BUS 3pJ read
L2 Cache
Off-chip 500pJ 250
memory read cycles
Bus-Based ATAC
Multicore
15
ATAC is an Efficient Network
•Modulators are Primary Source of Power Consumption
–Receive Power: Require only ~2 fJ/bit even with -5dB link loss
–Modulator Power:
•Ge-Si EA design ~75 fJ/bit (assume 50 fJ/bit for modulator
driver)
•Example: 64-Core Communication
(i.e. N = 64 cores = 64 s; for 32 bit word: 2048 drops/core and 32 adds/core)
–Receive Power: 2 fJ/bit x 1Gbit/s x 32 bits x N2 = 262 W
–Modulator Power: 75 fJ/bit x 1Gbit/s x 32 bits x N = 153 W
–Total energy/bit = 75 fJ/bit + 2 fJ/bit x (N-1) = 201 fJ/bit
•
•Comparison: Electrical Broadcast Across 64 Cores
–Require 64 x 150fJ/bit = 10 pJ/bit (~50X more power)
(Assumes 150fJ/mm/bit, 1-mm spaced tiles)
16
Summary
ATAC uses optical networks to enable multicore
programming and performance scaling
17
What Does the Future Look Like?
Corollary of Moore’s law: Number of cores
will double every 18 months
memory BNet
Proc
ONet $
HUB
ENet Dir $
memory NET
Electrical Networks
64 Optically-Connected Clusters Connect 16 Cores to
Optical Hub
19