You are on page 1of 17

........................................................................................................................................................................................................................................................

NVIDIA TESLA: A UNIFIED


GRAPHICS AND
COMPUTING ARCHITECTURE
........................................................................................................................................................................................................................................................
TO ENABLE FLEXIBLE, PROGRAMMABLE GRAPHICS AND HIGH-PERFORMANCE COMPUTING,

NVIDIA HAS DEVELOPED THE TESLA SCALABLE UNIFIED GRAPHICS AND PARALLEL
COMPUTING ARCHITECTURE. ITS SCALABLE PARALLEL ARRAY OF PROCESSORS IS

MASSIVELY MULTITHREADED AND PROGRAMMABLE IN C OR VIA GRAPHICS APIS.

...... The modern 3D graphics process-


ing unit (GPU) has evolved from a fixed-
In this article, we discuss the require-
ments that drove the unified graphics and
function graphics pipeline to a programma- parallel computing processor architecture,
ble parallel processor with computing power describe the Tesla architecture, and how it is
exceeding that of multicore CPUs. Tradi- enabling widespread deployment of parallel
tional graphics pipelines consist of separate computing and graphics applications.
programmable stages of vertex processors
executing vertex shader programs and pixel The road to unification
Erik Lindholm fragment processors executing pixel shader The first GPU was the GeForce 256,
programs. (Montrym and Moreton provide introduced in 1999. It contained a fixed-
John Nickolls additional background on the traditional function 32-bit floating-point vertex trans-
graphics processor architecture.1) form and lighting processor and a fixed-
Stuart Oberman NVIDIA’s Tesla architecture, introduced function integer pixel-fragment pipeline,
in November 2006 in the GeForce 8800 which were programmed with OpenGL
John Montrym GPU, unifies the vertex and pixel processors and the Microsoft DX7 API.5 In 2001,
and extends them, enabling high-perfor- the GeForce 3 introduced the first pro-
NVIDIA mance parallel computing applications writ- grammable vertex processor executing vertex
ten in the C language using the Compute shaders, along with a configurable 32-bit
Unified Device Architecture (CUDA2–4) floating-point fragment pipeline, pro-
parallel programming model and develop- grammed with DX85 and OpenGL.6 The
ment tools. The Tesla unified graphics and Radeon 9700, introduced in 2002, featured
computing architecture is available in a a programmable 24-bit floating-point pixel-
scalable family of GeForce 8-series GPUs fragment processor programmed with DX9
and Quadro GPUs for laptops, desktops, and OpenGL.7,8 The GeForce FX added 32-
workstations, and servers. It also provides bit floating-point pixel-fragment processors.
the processing architecture for the Tesla The XBox 360 introduced an early unified
GPU computing platforms introduced in GPU in 2005, allowing vertices and pixels
2007 for high-performance computing. to execute on the same processor.9
........................................................................

0272-1732/08/$20.00 G 2008 IEEE Published by the IEEE Computer Society. 39


.........................................................................................................................................................................................................................
HOT CHIPS 19

Vertex processors operate on the vertices texture units. The generality required of a
of primitives such as points, lines, and unified processor opened the door to a
triangles. Typical operations include trans- completely new GPU parallel-computing
forming coordinates into screen space, capability. The downside of this generality
which are then fed to the setup unit and was the difficulty of efficient load balancing
the rasterizer, and setting up lighting and between different shader types.
texture parameters to be used by the pixel- Other critical hardware design require-
fragment processors. Pixel-fragment proces- ments were architectural scalability, perfor-
sors operate on rasterizer output, which fills mance, power, and area efficiency.
the interior of primitives, along with the The Tesla architects developed the
interpolated parameters. graphics feature set in coordination with
Vertex and pixel-fragment processors the development of the Microsoft Direct3D
have evolved at different rates: Vertex DirectX 10 graphics API.10 They developed
processors were designed for low-latency, the GPU’s computing feature set in coor-
high-precision math operations, whereas dination with the development of the
pixel-fragment processors were optimized CUDA C parallel programming language,
for high-latency, lower-precision texture compiler, and development tools.
filtering. Vertex processors have tradition-
ally supported more-complex processing, so Tesla architecture
they became programmable first. For the The Tesla architecture is based on a
last six years, the two processor types scalable processor array. Figure 1 shows a
have been functionally converging as the block diagram of a GeForce 8800 GPU
result of a need for greater programming with 128 streaming-processor (SP) cores
generality. However, the increased general- organized as 16 streaming multiprocessors
ity also increased the design complexity, (SMs) in eight independent processing units
area, and cost of developing two separate called texture/processor clusters (TPCs).
processors. Work flows from top to bottom, starting
Because GPUs typically must process at the host interface with the system PCI-
more pixels than vertices, pixel-fragment Express bus. Because of its unified-processor
processors traditionally outnumber vertex design, the physical Tesla architecture
processors by about three to one. However, doesn’t resemble the logical order of
typical workloads are not well balanced, graphics pipeline stages. However, we will
leading to inefficiency. For example, use the logical graphics pipeline flow to
with large triangles, the vertex processors explain the architecture.
are mostly idle, while the pixel processors At the highest level, the GPU’s scalable
are fully busy. With small triangles, streaming processor array (SPA) performs
the opposite is true. The addition of all the GPU’s programmable calculations.
more-complex primitive processing in The scalable memory system consists of
DX10 makes it much harder to select a external DRAM control and fixed-function
fixed processor ratio.10 All these factors raster operation processors (ROPs) that
influenced the decision to design a unified perform color and depth frame buffer
architecture. operations directly on memory. An inter-
A primary design objective for Tesla was connection network carries computed
to execute vertex and pixel-fragment shader pixel-fragment colors and depth values from
programs on the same unified processor the SPA to the ROPs. The network also
architecture. Unification would enable dy- routes texture memory read requests from
namic load balancing of varying vertex- and the SPA to DRAM and read data from
pixel-processing workloads and permit the DRAM through a level-2 cache back to the
introduction of new graphics shader stages, SPA.
such as geometry shaders in DX10. It also The remaining blocks in Figure 1 deliver
let a single team focus on designing a fast input work to the SPA. The input assembler
and efficient processor and allowed the collects vertex work as directed by the input
sharing of expensive hardware such as the command stream. The vertex work distri-
.......................................................................

40 IEEE MICRO
Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming
multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor.

bution block distributes vertex work packets Command processing


to the various TPCs in the SPA. The TPCs The GPU host interface unit communi-
execute vertex shader programs, and (if cates with the host CPU, responds to
enabled) geometry shader programs. The commands from the CPU, fetches data from
resulting output data is written to on-chip system memory, checks command consisten-
buffers. These buffers then pass their results cy, and performs context switching.
to the viewport/clip/setup/raster/zcull block The input assembler collects geometric
to be rasterized into pixel fragments. The primitives (points, lines, triangles, line
pixel work distribution unit distributes pixel strips, and triangle strips) and fetches
fragments to the appropriate TPCs for associated vertex input attribute data. It
pixel-fragment processing. Shaded pixel- has peak rates of one primitive per clock
fragments are sent across the interconnec- and eight scalar attributes per clock at the
tion network for processing by depth and GPU core clock, which is typically
color ROP units. The compute work 600 MHz.
distribution block dispatches compute The work distribution units forward the
thread arrays to the TPCs. The SPA accepts input assembler’s output stream to the array
and processes work for multiple logical of processors, which execute vertex, geom-
streams simultaneously. Multiple clock etry, and pixel shader programs, as well as
domains for GPU units, processors, computing programs. The vertex and com-
DRAM, and other units allow independent pute work distribution units deliver work to
power and performance optimizations. processors in a round-robin scheme. Pixel
........................................................................

MARCH–APRIL 2008 41
.........................................................................................................................................................................................................................
HOT CHIPS 19

Figure 2. Texture/processor cluster (TPC).

work distribution is based on the pixel tions to texture operations, one texture unit
location. serves two SMs. This architectural ratio can
vary as needed.
Streaming processor array
The SPA executes graphics shader thread Geometry controller
programs and GPU computing programs The geometry controller maps the logical
and provides thread control and manage- graphics vertex pipeline into recirculation
ment. Each TPC in the SPA roughly on the physical SMs by directing all
corresponds to a quad-pixel unit in previous primitive and vertex attribute and topology
architectures.1 The number of TPCs deter- flow in the TPC. It manages dedicated on-
mines a GPU’s programmable processing chip input and output vertex attribute
performance and scales from one TPC in a storage and forwards contents as required.
small GPU to eight or more TPCs in high- DX10 has two stages dealing with vertex
performance GPUs. and primitive processing: the vertex shader
and the geometry shader. The vertex shader
Texture/processor cluster processes one vertex’s attributes indepen-
As Figure 2 shows, each TPC contains a dently of other vertices. Typical operations
geometry controller, an SM controller are position space transforms and color and
(SMC), two streaming multiprocessors texture coordinate generation. The geome-
(SMs), and a texture unit. Figure 3 expands try shader follows the vertex shader and
each SM to show its eight SP cores. To deals with a whole primitive and its vertices.
balance the expected ratio of math opera- Typical operations are edge extrusion for
.......................................................................

42 IEEE MICRO
for transcendental functions and attribute
interpolation—the interpolation of pixel
attributes from vertex attributes defining a
primitive. Each SFU also contains four
floating-point multipliers. The SM uses the
TPC texture unit as a third execution unit
and uses the SMC and ROP units to
implement external memory load, store,
and atomic accesses. A low-latency inter-
connect network between the SPs and the
shared-memory banks provides shared-
memory access.
The GeForce 8800 Ultra clocks the SPs
and SFU units at 1.5 GHz, for a peak of 36
Gflops per SM. To optimize power and area
efficiency, some SM non-data-path units
operate at half the SP clock rate.

SM multithreading. A graphics vertex or


pixel shader is a program for a single thread
Figure 3. Streaming multiprocessor (SM).
that describes how to process a vertex or a
pixel. Similarly, a CUDA kernel is a C
stencil shadow generation and cube map program for a single thread that describes
texture generation. Geometry shader output how one thread computes a result. Graphics
primitives go to later stages for clipping, and computing applications instantiate
viewport transformation, and rasterization many parallel threads to render complex
into pixel fragments. images and compute large result arrays. To
dynamically balance shifting vertex and
Streaming multiprocessor pixel shader thread workloads, the unified
The SM is a unified graphics and SM concurrently executes different thread
computing multiprocessor that executes programs and different types of shader
vertex, geometry, and pixel-fragment shader programs.
programs and parallel computing programs. To efficiently execute hundreds of
As Figure 3 shows, the SM consists of eight threads in parallel while running several
streaming processor (SP) cores, two special- different programs, the SM is hardware
function units (SFUs), a multithreaded multithreaded. It manages and executes up
instruction fetch and issue unit (MT Issue), to 768 concurrent threads in hardware with
an instruction cache, a read-only constant zero scheduling overhead.
cache, and a 16-Kbyte read/write shared To support the independent vertex,
memory. primitive, pixel, and thread programming
The shared memory holds graphics input model of graphics shading languages and
buffers or shared data for parallel comput- the CUDA C/C++ language, each SM
ing. To pipeline graphics workloads thread has its own thread execution state
through the SM, vertex, geometry, and and can execute an independent code path.
pixel threads have independent input and Concurrent threads of computing programs
output buffers. Workloads can arrive and can synchronize at a barrier with a single
depart independently of thread execution. SM instruction. Lightweight thread crea-
Geometry threads, which generate variable tion, zero-overhead thread scheduling, and
amounts of output per thread, use separate fast barrier synchronization support very
output buffers. fine-grained parallelism efficiently.
Each SP core contains a scalar multiply-
add (MAD) unit, giving the SM eight Single-instruction, multiple-thread. To man-
MAD units. The SM uses its two SFU units age and execute hundreds of threads running
........................................................................

MARCH–APRIL 2008 43
.........................................................................................................................................................................................................................
HOT CHIPS 19

several different programs efficiently, the


Tesla SM uses a new processor architecture
we call single-instruction, multiple-thread
(SIMT). The SM’s SIMT multithreaded
instruction unit creates, manages, schedules,
and executes threads in groups of 32
parallel threads called warps. The term warp
originates from weaving, the first parallel-
thread technology. Figure 4 illustrates SIMT
scheduling. The SIMT warp size of 32
parallel threads provides efficiency on plen-
tiful fine-grained pixel threads and comput-
ing threads.
Each SM manages a pool of 24 warps,
with a total of 768 threads. Individual
threads composing a SIMT warp are of the
same type and start together at the same
program address, but they are otherwise free
to branch and execute independently. At
each instruction issue time, the SIMT
multithreaded instruction unit selects a
warp that is ready to execute and issues
the next instruction to that warp’s active
threads. A SIMT instruction is broadcast
synchronously to a warp’s active parallel
threads; individual threads can be inactive Figure 4. Single-instruction, multiple-
due to independent branching or predica- thread (SIMT) warp scheduling.
tion.
The SM maps the warp threads to the SP
cores, and each thread executes indepen- SIMT architecture is similar to single-
dently with its own instruction address and instruction, multiple-data (SIMD) design,
register state. A SIMT processor realizes full which applies one instruction to multiple
efficiency and performance when all 32 data lanes. The difference is that SIMT
threads of a warp take the same execution applies one instruction to multiple inde-
path. If threads of a warp diverge via a data- pendent threads in parallel, not just multi-
dependent conditional branch, the warp ple data lanes. A SIMD instruction controls
serially executes each branch path taken, a vector of multiple data lanes together and
disabling threads that are not on that path, exposes the vector width to the software,
and when all paths complete, the threads whereas a SIMT instruction controls the
reconverge to the original execution path. execution and branching behavior of one
The SM uses a branch synchronization stack thread.
to manage independent threads that diverge In contrast to SIMD vector architectures,
and converge. Branch divergence only SIMT enables programmers to write thread-
occurs within a warp; different warps level parallel code for independent threads
execute independently regardless of whether as well as data-parallel code for coordinated
they are executing common or disjoint code threads. For program correctness, program-
paths. As a result, Tesla architecture GPUs mers can essentially ignore SIMT execution
are dramatically more efficient and flexible attributes such as warps; however, they can
on branching code than previous generation achieve substantial performance improve-
GPUs, as their 32-thread warps are much ments by writing code that seldom requires
narrower than the SIMD width of prior threads in a warp to diverge. In practice, this
GPUs.1 is analogous to the role of cache lines in
.......................................................................

44 IEEE MICRO
traditional codes: Programmers can safely programs are becoming longer and more
ignore cache line size when designing for scalar, and it is increasingly difficult to fully
correctness but must consider it in the code occupy even two components of the prior
structure when designing for peak perfor- four-component vector architecture. Previ-
mance. SIMD vector architectures, on the ous architectures employed vector pack-
other hand, require the software to manu- ing—combining sub-vectors of work to
ally coalesce loads into vectors and to gain efficiency—but that complicated the
manually manage divergence. scheduling hardware as well as the compiler.
Scalar instructions are simpler and compiler
SIMT warp scheduling. The SIMT ap- friendly. Texture instructions remain vector
proach of scheduling independent warps is based, taking a source coordinate vector and
simpler than previous GPU architectures’ returning a filtered color vector.
complex scheduling. A warp consists of up High-level graphics and computing-lan-
to 32 threads of the same type—vertex, guage compilers generate intermediate in-
geometry, pixel, or compute. The basic unit structions, such as DX10 vector or PTX
of pixel-fragment shader processing is the 2 scalar instructions,10,2 which are then opti-
3 2 pixel quad. The SM controller groups mized and translated to binary GPU
eight pixel quads into a warp of 32 threads. instructions. The optimizer readily expands
It similarly groups vertices and primitives DX10 vector instructions to multiple Tesla
into warps and packs 32 computing threads SM scalar instructions. PTX scalar instruc-
into a warp. The SIMT design shares the tions optimize to Tesla SM scalar instruc-
SM instruction fetch and issue unit effi- tions about one to one. PTX provides a
ciently across 32 threads but requires a full stable target ISA for compilers and provides
warp of active threads for full performance
compatibility over several generations of
efficiency.
GPUs with evolving binary instruction set
As a unified graphics processor, the SM
architectures. Because the intermediate lan-
schedules and executes multiple warp types
guages use virtual registers, the optimizer
concurrently—for example, concurrently
analyzes data dependencies and allocates
executing vertex and pixel warps. The SM
real registers. It eliminates dead code, folds
warp scheduler operates at half the 1.5-GHz
instructions together when feasible, and
processor clock rate. At each cycle, it selects
optimizes SIMT branch divergence and
one of the 24 warps to execute a SIMT warp
convergence points.
instruction, as Figure 4 shows. An issued
warp instruction executes as two sets of 16
Instruction set architecture. The Tesla SM
threads over four processor cycles. The SP
cores and SFU units execute instructions has a register-based instruction set including
independently, and by issuing instructions floating-point, integer, bit, conversion, tran-
between them on alternate cycles, the scendental, flow control, memory load/store,
scheduler can keep both fully occupied. and texture operations.
Implementing zero-overhead warp sched- Floating-point and integer operations
uling for a dynamic mix of different warp include add, multiply, multiply-add, mini-
programs and program types was a chal- mum, maximum, compare, set predicate,
lenging design problem. A scoreboard and conversions between integer and float-
qualifies each warp for issue each cycle. ing-point numbers. Floating-point instruc-
The instruction scheduler prioritizes all tions provide source operand modifiers for
ready warps and selects the one with highest negation and absolute value. Transcenden-
priority for issue. Prioritization considers tal function instructions include cosine,
warp type, instruction type, and ‘‘fairness’’ sine, binary exponential, binary logarithm,
to all warps executing in the SM. reciprocal, and reciprocal square root.
Attribute interpolation instructions provide
SM instructions. The Tesla SM executes efficient generation of pixel attributes.
scalar instructions, unlike previous GPU Bitwise operators include shift left, shift
vector instruction architectures. Shader right, logic operators, and move. Control
........................................................................

MARCH–APRIL 2008 45
.........................................................................................................................................................................................................................
HOT CHIPS 19

flow includes branch, call, return, trap, and load-to-use latency for local and global
barrier synchronization. memory implemented in external DRAM.
The floating-point and integer instruc- The latest Tesla architecture GPUs
tions can also set per-thread status flags for provide efficient atomic memory opera-
zero, negative, carry, and overflow, which tions, including integer add, minimum,
the thread program can use for conditional maximum, logic operators, swap, and
branching. compare-and-swap operations. Atomic op-
erations facilitate parallel reductions and
Memory access instructions. The texture parallel data structure management.
instruction fetches and filters texture sam-
ples from memory via the texture unit. The Streaming processor. The SP core is the
ROP unit writes pixel-fragment output to primary thread processor in the SM. It
memory. performs the fundamental floating-point
To support computing and C/C++ operations, including add, multiply, and
language needs, the Tesla SM implements multiply-add. It also implements a wide
memory load/store instructions in addition variety of integer, comparison, and conver-
to graphics texture fetch and pixel output. sion operations. The floating-point add and
Memory load/store instructions use integer multiply operations are compatible with the
byte addressing with register-plus-offset IEEE 754 standard for single-precision FP
address arithmetic to facilitate conventional numbers, including not-a-number (NaN)
compiler code optimizations. and infinity values. The unit is fully
For computing, the load/store instruc- pipelined, and latency is optimized to
tions access three read/write memory spaces: balance delay and area.
The add and multiply operations use
N local memory for per-thread, private,
IEEE round-to-nearest-even as the default
temporary data (implemented in ex-
rounding mode. The multiply-add opera-
ternal DRAM);
tion performs a multiplication with trunca-
N shared memory for low-latency access
tion, followed by an add with round-to-
to data shared by cooperating threads
nearest-even. The SP flushes denormal
in the same SM; and
source operands to sign-preserved zero and
N global memory for data shared by all
flushes results that underflow the target
threads of a computing application
output exponent range to sign-preserved
(implemented in external DRAM).
zero after rounding.
The memory instructions load-global,
store-global, load-shared, store-shared, Special-function unit. The SFU supports
load-local, and store-local access global, computation of both transcendental func-
shared, and local memory. Computing tions and planar attribute interpolation.11 A
programs use the fast barrier synchroniza- traditional vertex or pixel shader design
tion instruction to synchronize threads contains a functional unit to compute
within the SM that communicate with each transcendental functions. Pixels also need
other via shared and global memory. an attribute-interpolating unit to compute
To improve memory bandwidth and the per-pixel attribute values at the pixel’s x,
reduce overhead, the local and global load/ y location, given the attribute values at the
store instructions coalesce individual paral- primitive’s vertices.
lel thread accesses from the same warp into For functional evaluation, we use qua-
fewer memory block accesses. The addresses dratic interpolation based on enhanced
must fall in the same block and meet minimax approximations to approximate
alignment criteria. Coalescing memory the reciprocal, reciprocal square root, log2x,
requests boosts performance significantly 2x, and sin/cos functions. Table 1 shows the
over separate requests. The large thread accuracy of the function estimates. The SFU
count, together with support for many unit generates one 32-bit floating point
outstanding load requests, helps cover result per cycle.
.......................................................................

46 IEEE MICRO
Table 1. Function approximation statistics.

Input Accuracy (good % exactly


Function interval bits) ULP* error rounded Monotonic

1/x [1, 2) 24.02 0.98 87 Yes


1/sqrt(x) [1, 4) 23.40 1.52 78 Yes
2x [0, 1) 22.51 1.41 74 Yes
log2x [1, 2) 22.57 N/A** N/A Yes
sin/cos [0, p/2) 22.47 N/A N/A No
........................................................................................................................................................
* ULP: unit-in-the-last-place.
** N/A: not applicable.

The SFU also supports attribute interpo- neously: vertex, geometry, and pixel. It
lation, to enable accurate interpolation of packs each of these input types into the
attributes such as color, depth, and texture warp width, initiating shader processing,
coordinates. The SFU must interpolate and unpacks the results.
these attributes in the (x, y) screen space Each input type has independent I/O
to determine the values of the attributes at paths, but the SMC is responsible for load
each pixel location. We express the value of balancing among them. The SMC supports
a given attribute U in an (x, y) plane in static and dynamic load balancing based on
plane equations of the following form: driver-recommended allocations, current
allocations, and relative difficulty of addi-
U ðx, yÞ ~ tional resource allocation. Load balancing of
ðAU | x z BU | y z CU Þ= the workloads was one of the more
challenging design problems due to its
ðAW | x z BW | y z CW Þ impact on overall SPA efficiency.

where A, B, and C are interpolation Texture unit


parameters associated with each attribute The texture unit processes one group of
U, and W is related to the distance of the four threads (vertex, geometry, pixel, or
pixel from the viewer for perspective compute) per cycle. Texture instruction
projection. The attribute interpolation sources are texture coordinates, and the
hardware in the SFU is fully pipelined, outputs are filtered samples, typically a
and it can interpolate four samples per four-component (RGBA) color. Texture is
cycle. a separate unit external to the SM connect-
In a shader program, the SFU can ed via the SMC. The issuing SM thread can
generate perspective-corrected attributes as continue execution until a data dependency
follows: stall.
Each texture unit has four texture address
N Interpolate 1/W, and invert to form generators and eight filter units, for a peak
W. GeForce 8800 Ultra rate of 38.4 gigabi-
N Interpolate U/W. lerps/s (a bilerp is a bilinear interpolation of
N Multiply U/W by W to form perspec- four samples). Each unit supports full-speed
tive-correct U. 2:1 anisotropic filtering, as well as high-
dynamic-range (HDR) 16-bit and 32-bit
floating-point data format filtering.
SM controller. The SMC controls multiple The texture unit is deeply pipelined.
SMs, arbitrating the shared texture unit, Although it contains a cache to capture
load/store path, and I/O path. The SMC filtering locality, it streams hits mixed with
serves three graphics workloads simulta- misses without stalling.
........................................................................

MARCH–APRIL 2008 47
.........................................................................................................................................................................................................................
HOT CHIPS 19

Rasterization ROPs handle depth and stencil testing and


Geometry primitives output from the updates and color blending and updates.
SMs go in their original round-robin input The memory controller uses lossless color
order to the viewport/clip/setup/raster/zcull (up to 8:1) and depth compression (up to
block. The viewport and clip units clip the 8:1) to reduce bandwidth. Each ROP has a
primitives to the standard view frustum and peak rate of four pixels per clock and
to any enabled user clip planes. They supports 16-bit floating-point and 32-bit
transform postclipping vertices into screen floating-point HDR formats. ROPs support
(pixel) space and reject whole primitives double-rate-depth processing when color
outside the view volume as well as back- writes are disabled.
facing primitives. Each memory partition is 64 bits wide
Surviving primitives then go to the setup and supports double-data-rate DDR2 and
unit, which generates edge equations for the graphics-oriented GDDR3 protocols at up
rasterizer. Attribute plane equations are also to 1 GHz, yielding a bandwidth of about
generated for linear interpolation of pixel 16 Gbytes/s.
attributes in the pixel shader. A coarse- Antialiasing support includes up to 163
rasterization stage generates all pixel tiles multisampling and supersampling. HDR
that are at least partially inside the primi- formats are fully supported. Both algo-
tive. rithms support 1, 2, 4, 8, or 16 samples per
The zcull unit maintains a hierarchical z pixel and generate a weighted average of the
surface, rejecting pixel tiles if they are samples to produce the final pixel color.
conservatively known to be occluded by Multisampling executes the pixel shader
previously drawn pixels. The rejection rate once to generate a color shared by all pixel
is up to 256 pixels per clock. The screen is samples, whereas supersampling runs the
subdivided into tiles; each TPC processes a pixel shader once per sample. In both cases,
predetermined subset. The pixel tile address depth values are correctly evaluated for each
therefore selects the destination TPC. Pixel sample, as required for correct interpene-
tiles that survive zcull then go to a fine- tration of primitives.
rasterization stage that generates detailed Because multisampling runs the pixel
coverage information and depth values for shader once per pixel (rather than once
the pixels. per sample), multisampling has become the
OpenGL and Direct3D require that a most popular antialiasing method. Beyond
depth test be performed after the pixel four samples, however, storage cost increases
shader has generated final color and depth faster than image quality improves, espe-
values. When possible, for certain combi- cially with HDR formats. For example, a
nations of API state, the Tesla GPU single 1,600 3 1,200 pixel surface, storing
performs the depth test and update ahead 16 four-component, 16-bit floating-point
of the fragment shader, possibly saving samples, requires 1,600 3 1,200 3 16 3
thousands of cycles of processing time, (64 bits color + 32 bits depth) 5 368
without violating the API-mandated seman- Mbytes.
tics. For the vast majority of edge pixels, two
The SMC assembles surviving pixels into colors are enough; what matters is more-
warps to be processed by a SM running the detailed coverage information. The cover-
current pixel shader. When the pixel shader age-sampling antialiasing (CSAA) algorithm
has finished, the pixels are optionally depth provides low-cost-per-coverage samples, al-
tested if this was not done ahead of the lowing upward scaling. By computing and
shader. The SMC then sends surviving storing Boolean coverage at up to 16
pixels and associated data to the ROP. samples and compressing redundant color
and depth and stencil information into the
Raster operations processor memory footprint and bandwidth of four or
Each ROP is paired with a specific eight samples, 163 antialiasing quality can
memory partition. The TPCs feed data to be achieved at 43 antialiasing performance.
the ROPs via an interconnection network. CSAA is compatible with existing rendering
.......................................................................

48 IEEE MICRO
Table 2. Comparison of antialiasing modes.

Antialiasing mode

Feature Brute-force supersampling Multisampling Coverage sampling

Quality level 13 43 163 13 43 163 13 43 163


Texture and shader samples 1 4 16 1 1 1 1 1 1
Stored color and z samples 1 4 16 1 4 16 1 4 4
Coverage samples 1 4 16 1 4 16 1 4 16

techniques including HDR and stencil performs virtual to physical translation.


algorithms. Edges defined by the intersec- Hardware reads the page tables from local
tion of interpenetrating polygons are ren- memory to respond to misses on behalf of a
dered at the stored sample count quality hierarchy of translation look-aside buffers
(43 or 83). Table 2 summarizes the spread out among the rendering engines.
storage requirements of the three algo-
rithms. Parallel computing architecture
The Tesla scalable parallel computing
Memory and interconnect architecture enables the GPU processor
The DRAM memory data bus width is array to excel in throughput computing,
384 pins, arranged in six independent executing high-performance computing ap-
partitions of 64 pins each. Each partition plications as well as graphics applications.
owns 1/6 of the physical address space. The Throughput applications have several prop-
memory partition units directly enqueue erties that distinguish them from CPU serial
requests. They arbitrate among hundreds of applications:
in-flight requests from the parallel stages of
the graphics and computation pipelines. N extensive data parallelism—thousands
The arbitration seeks to maximize total of computations on independent data
DRAM transfer efficiency, which favors elements;
grouping related requests by DRAM bank N modest task parallelism—groups of
and read/write direction, while minimizing threads execute the same program,
latency as far as possible. The memory and different groups can run different
controllers support a wide range of DRAM programs;
clock rates, protocols, device densities, and N intensive floating-point arithmetic;
data bus widths. N latency tolerance—performance is the
amount of work completed in a given
Interconnection network. A single hub unit time;
routes requests to the appropriate partition N streaming data flow—requires high
from the nonparallel requesters (PCI-Ex- memory bandwidth with relatively
press, host and command front end, input little data reuse;
assembler, and display). Each memory N modest inter-thread synchronization
partition has its own depth and color and communication—graphics
ROP units, so ROP memory traffic origi- threads do not communicate, and
nates locally. Texture and load/store re- parallel computing applications re-
quests, however, can occur between any quire limited synchronization and
TPC and any memory partition, so an communication.
interconnection network routes requests
and responses. GPU parallel performance on through-
put problems has doubled every 12 to
Memory management unit. All processing 18 months, pulled by the insatiable de-
engines generate addresses in a virtual mands of the 3D game market. Now, Tesla
address space. A memory management unit GPUs in laptops, desktops, workstations,
........................................................................

MARCH–APRIL 2008 49
.........................................................................................................................................................................................................................
HOT CHIPS 19

The two-level parallel decomposition maps


naturally to the Tesla architecture: Parallel
SMs compute result blocks, and parallel
threads compute result elements.
The programmer or compiler writes a
program that computes a sequence of result
grids, partitioning each result grid into
coarse-grained result blocks that are com-
puted independently in parallel. The pro-
gram computes each result block with an
array of fine-grained parallel threads, parti-
tioning the work among threads that
compute result elements.

Cooperative thread array or thread block


Unlike the graphics programming model,
which executes parallel shader threads
independently, parallel-computing pro-
gramming models require that parallel
threads synchronize, communicate, share
data, and cooperate to efficiently compute a
result. To manage large numbers of con-
current threads that can cooperate, the Tesla
computing architecture introduces the co-
operative thread array (CTA), called a thread
block in CUDA terminology.
A CTA is an array of concurrent threads
that execute the same thread program and
Figure 5. Decomposing result data into a grid of blocks partitioned into
can cooperate to compute a result. A CTA
elements to be computed in parallel.
consists of 1 to 512 concurrent threads, and
each thread has a unique thread ID (TID),
numbered 0 through m. The programmer
declares the 1D, 2D, or 3D CTA shape and
and systems are programmable in C with dimensions in threads. The TID has one,
CUDA tools, using a simple parallel two, or three dimension indices. Threads of
programming model. a CTA can share data in global or shared
memory and can synchronize with the
Data-parallel problem decomposition barrier instruction. CTA thread programs
To map a large computing problem use their TIDs to select work and index
effectively to a highly parallel processing shared data arrays. Multidimensional TIDs
architecture, the programmer or compiler can eliminate integer divide and remainder
decomposes the problem into many small operations when indexing arrays.
problems that can be solved in parallel. For Each SM executes up to eight CTAs
example, the programmer partitions a large concurrently, depending on CTA resource
result data array into blocks and further demands. The programmer or compiler
partitions each block into elements, so that declares the number of threads, registers,
the result blocks can be computed indepen- shared memory, and barriers required by
dently in parallel, and the elements within the CTA program. When an SM has
each block can be computed cooperatively sufficient available resources, the SMC
in parallel. Figure 5 shows the decomposi- creates the CTA and assigns TID numbers
tion of a result data array into a 3 3 2 grid to each thread. The SM executes the CTA
of blocks, in which each block is further threads concurrently as SIMT warps of 32
decomposed into a 5 3 3 array of elements. parallel threads.
.......................................................................

50 IEEE MICRO
Figure 6. Nested granularity levels: thread (a), cooperative thread array (b), and grid (c).
These have corresponding memory-sharing levels: local per-thread, shared per-CTA, and
global per-application.

CTA grids Parallel granularity


To implement the coarse-grained block Figure 6 shows levels of parallel granu-
and grid decomposition of Figure 5, the larity in the GPU computing model. The
GPU creates CTAs with unique CTA ID three levels are
and grid ID numbers. The compute work
distributor dynamically balances the GPU N thread—computes result elements se-
workload by distributing a stream of CTA lected by its TID;
work to SMs with sufficient available N CTA—computes result blocks selected
resources. by its CTA ID;
To enable a compiled binary program to N grid—computes many result blocks,
run unchanged on large or small GPUs with and sequential grids compute sequen-
any number of parallel SM processors, tially dependent application steps.
CTAs execute independently and compute
result blocks independently of other CTAs Higher levels of parallelism use multiple
in the same grid. Sequentially dependent GPUs per CPU and clusters of multi-GPU
application steps map to two sequentially nodes.
dependent grids. The dependent grid waits
for the first grid to complete; then the CTAs Parallel memory sharing
of the dependent grid read the result blocks Figure 6 also shows levels of parallel
written by the first grid. read/write memory sharing:
........................................................................

MARCH–APRIL 2008 51
.........................................................................................................................................................................................................................
HOT CHIPS 19

N local—each executing thread has a tially on one core, or partially in parallel on


private per-thread local memory for a few cores.
register spill, stack frame, and address-
able temporary variables; CUDA programming model
N shared—each executing CTA has a CUDA is a minimal extension of the C
per-CTA shared memory for access to and C++ programming languages. A pro-
data shared by threads in the same grammer writes a serial program that calls
CTA; parallel kernels, which can be simple
N global—sequential grids communicate functions or full programs. The CUDA
and share large data sets in global program executes serial code on the CPU
memory. and executes parallel kernels across a set of
parallel threads on the GPU. The program-
Threads communicating in a CTA use mer organizes these threads into a hierarchy
the fast barrier synchronization instruction of thread blocks and grids as described
to wait for writes to shared or global earlier. (A CUDA thread block is a GPU
memory to complete before reading data CTA.)
written by other threads in the CTA. The Figure 7 shows a CUDA program exe-
load/store memory system uses a relaxed cuting a series of parallel kernels on a
memory order that preserves the order of heterogeneous CPU–GPU system. Ker-
reads and writes to the same address from nelA and KernelB execute on the GPU
the same issuing thread and from the as grids of nBlkA and nBlkB thread
viewpoint of CTA threads coordinating blocks (CTAs), which instantiate nTidA
with the barrier synchronization instruction. and nTidB threads per CTA.
Sequentially dependent grids use a global The CUDA compiler nvcc compiles an
intergrid synchronization barrier between integrated application C/C++ program
grids to ensure global read/write ordering. containing serial CPU code and parallel
GPU kernel code. The CUDA runtime API
Transparent scaling of GPU computing manages the GPU as a computing device
Parallelism varies widely over the range of that acts as a coprocessor to the host CPU
GPU products developed for various market with its own memory system.
segments. A small GPU might have one SM The CUDA programming model is
with eight SP cores, while a large GPU similar in style to a single-program multi-
might have many SMs totaling hundreds of ple-data (SPMD) software model—it ex-
SP cores. presses parallelism explicitly, and each
The GPU computing architecture trans- kernel executes on a fixed number of
parently scales parallel application perfor- threads. However, CUDA is more flexible
mance with the number of SMs and SP than most SPMD implementations because
cores. A GPU computing program executes each kernel call dynamically creates a new
on any size of GPU without recompiling, grid with the right number of thread blocks
and is insensitive to the number of SM and threads for that application step.
multiprocessors and SP cores. The program CUDA extends C/C++ with the declara-
does not know or care how many processors tion specifier keywords __global__ for
it uses. kernel entry functions, __device__ for
The key is decomposing the problem into global variables, and __shared__ for
independently computed blocks as de- shared-memory variables. A CUDA kernel’s
scribed earlier. The GPU compute work text is simply a C function for one
distribution unit generates a stream of sequential thread. The built-in variables
CTAs and distributes them to available threadIdx.{x, y, z} and block
SMs to compute each independent block. Idx.{x, y, z} provide the thread ID
Scalable programs do not communicate within a thread block (CTA), while block
among CTA blocks of the same grid; the Idx provides the CTA ID within a grid.
same grid result is obtained if the CTAs The extended function call syntax ker-
execute in parallel on many cores, sequen- nel,,,nBlocks,nThreads...(args);
.......................................................................

52 IEEE MICRO
Figure 7. CUDA program sequence of kernel A followed by kernel B on a heterogeneous
CPU–GPU system.

invokes a parallel kernel function on a grid It uses parallel threads to compute the same
of nBlocks, where each block instanti- array indices in parallel, and each thread
ates nThreads concurrent threads, and computes only one sum.
args are ordinary arguments to function
kernel(). Scalability and performance
Figure 8 shows an example serial C pro- The Tesla unified architecture is designed
gram and a corresponding CUDA C program. for scalability. Varying the number of SMs,
The serial C program uses two nested loops to TPCs, ROPs, caches, and memory parti-
iterate over each array index and compute tions provides the right mix for different
c[idx] 5 a[idx] + b[idx] each trip. performance and cost targets in the value,
The parallel CUDA C program has no loops. mainstream, enthusiast, and professional

Figure 8. Serial C (a) and CUDA C (b) examples of programs that add arrays.

........................................................................

MARCH–APRIL 2008 53
.........................................................................................................................................................................................................................
HOT CHIPS 19

N 384-pin DRAM interface;


N 1.08-GHz DRAM clock;
N 104-Gbyte/s peak bandwidth; and
N typical power of 150 W at 1.3 V.

T he Tesla architecture is the first


ubiquitous supercomputing platform.
NVIDIA has shipped more than 50 million
Tesla-based systems. This wide availability,
coupled with C programmability and the
CUDA software development environment,
enables broad deployment of demanding
parallel-computing and graphics applications.
With future increases in transistor density,
the architecture will readily scale processor
parallelism, memory partitions, and overall
performance. Increased number of multipro-
cessors and memory partitions will support
larger data sets and richer graphics and
computing, without a change to the pro-
gramming model.
We continue to investigate improved sched-
Figure 9. GeForce 8800 Ultra die layout. uling and load-balancing algorithms for the
unified processor. Other areas of improvement
market segments. NVIDIA’s Scalable Link are enhanced scalability for derivative products,
Interconnect (SLI) enables multiple GPUs reduced synchronization and communication
to act together as one, providing further overhead for compute programs, new graphics
scalability. features, increased realized memory band-
CUDA C/C++ applications executing on width, and improved power efficiency. MICRO
Tesla computing platforms, Quadro work-
stations, and GeForce GPUs deliver com- Acknowledgments
pelling computing performance on a range We thank the entire NVIDIA GPU deve-
of large problems, including more than lopment team for their extraordinary effort
1003 speedups on molecular modeling, in bringing Tesla-based GPUs to market.
more than 200 Gflops on n-body problems, ................................................................................................
and real-time 3D magnetic-resonance im- References
aging.12–14 For graphics, the GeForce 8800 1. J. Montrym and H. Moreton, ‘‘The GeForce
GPU delivers high performance and image 6800,’’ IEEE Micro, vol. 25, no. 2, Mar./
quality for the most demanding games.15 Apr. 2005, pp. 41-51.
Figure 9 shows the GeForce 8800 Ultra 2. CUDA Technology, NVIDIA, 2007, http://
physical die layout implementing the Tesla www.nvidia.com/CUDA.
architecture shown in Figure 1. Implemen- 3. CUDA Programming Guide 1.1, NVIDIA,
tation specifics include 2007; http://developer.download.nvidia.
com/compute/cuda/1_1/NVIDIA_CUDA_
N 681 million transistors, 470 mm2; Programming_Guide_1.1.pdf.
N TSMC 90-nm CMOS; 4. J. Nickolls, I. Buck, K. Skadron, and M.
N 128 SP cores in 16 SMs; Garland, ‘‘Scalable Parallel Programming
N 12,288 processor threads; with CUDA,’’ ACM Queue, vol. 6, no. 2,
N 1.5-GHz processor clock rate; Mar./Apr. 2008, pp. 40-53.
N peak 576 Gflops in processors; 5. DX Specification, Microsoft; http://msdn.
N 768-Mbyte GDDR3 DRAM; microsoft.com/directx.
.......................................................................

54 IEEE MICRO
6. E. Lindholm, M.J. Kilgard, and H. Moreton, group. His research interests include graph-
‘‘A User-Programmable Vertex Engine,’’ ics processor design and parallel graphics
Proc. 28th Ann. Conf. Computer Graphics architectures. Lindholm has an MS in
and Interactive Techniques (Siggraph 01), electrical engineering from the University
ACM Press, 2001, pp. 149-158. of British Columbia.
7. G. Elder, ‘‘Radeon 9700,’’ Eurographics/
Siggraph Workshop Graphics Hardware, John Nickolls is director of GPU comput-
Hot 3D Session, 2002, http://www. ing architecture at NVIDIA. His interests
graphicshardware.org/previous/www_2002/ include parallel processing systems, languag-
presentations/Hot3D-RADEON9700.ppt. es, and architectures. Nickolls has a BS in
8. Microsoft DirectX 9 Programmable Graph- electrical engineering and computer science
ics Pipeline, Microsoft Press, 2003. from the University of Illinois and MS and
9. J. Andrews and N. Baker, ‘‘Xbox 360 PhD degrees in electrical engineering from
System Architecture,’’ IEEE Micro, Stanford University.
vol. 26, no. 2, Mar./Apr. 2006, pp. 25-37.
10. D. Blythe, ‘‘The Direct3D 10 System,’’ Stuart Oberman is a design manager in the
ACM Trans. Graphics, vol. 25, no. 3, July GPU hardware group at NVIDIA. His
2006, pp. 724-734. research interests include computer arith-
11. S.F. Oberman and M.Y. Siu, ‘‘A High- metic, processor design, and parallel archi-
Performance Area-Efficient Multifunction tectures. Oberman has a BS in electrical
Interpolator,’’ Proc. 17th IEEE Symp. Com- engineering from the University of Iowa
puter Arithmetic (Arith-17), IEEE Press, and MS and PhD degrees in electrical
2005, pp. 272-279. engineering from Stanford University. He is
12. J.E. Stone et al., ‘‘Accelerating Molecular a senior member of the IEEE.
Modeling Applications with Graphics Pro-
cessors,’’ J. Computational Chemistry, John Montrym is a chief architect at
vol. 28, no. 16, 2007, pp. 2618-2640. NVIDIA, where he has worked in the
13. L. Nyland, M. Harris, and J. Prins, ‘‘Fast N- development of several GPU product fam-
Body Simulation with CUDA,’’ GPU Gems ilies. His research interests include graphics
3, H. Nguyen, ed., Addison-Wesley, 2007, processor design, parallel graphics architec-
pp. 677-695. tures, and hardware-software interfaces.
14. S.S. Stone et al., ‘‘How GPUs Can Improve Montrym has a BS in electrical engineering
the Quality of Magnetic Resonance Imag- from the Massachusetts Institute of Tech-
ing,’’ Proc. 1st Workshop on General nology.
Purpose Processing on Graphics Process-
ing Units, 2007; http://www.gigascale.org/ Direct questions and comments about
pubs/1175.html. this article to Erik Lindholm or John
15. A.L. Shimpi and D. Wilson, ‘‘NVIDIA’s Nickolls, NVIDIA, 2701 San Tomas
GeForce 8800 (G80): GPUs Re-architected Expressway, Santa Clara, CA 95050;
for DirectX 10,’’ AnandTech, Nov. 2006; elindholm@nvidia.com or jnickolls@nvidia.
http://www.anandtech.com/video/showdoc. com.
aspx?i52870.
For more information on this or any other
Erik Lindholm is a distinguished engineer computing topic, please visit our Digital
at NVIDIA, working in the architecture Library at http://computer.org/csdl.

........................................................................

MARCH–APRIL 2008 55

You might also like