Professional Documents
Culture Documents
NVIDIA HAS DEVELOPED THE TESLA SCALABLE UNIFIED GRAPHICS AND PARALLEL
COMPUTING ARCHITECTURE. ITS SCALABLE PARALLEL ARRAY OF PROCESSORS IS
Vertex processors operate on the vertices texture units. The generality required of a
of primitives such as points, lines, and unified processor opened the door to a
triangles. Typical operations include trans- completely new GPU parallel-computing
forming coordinates into screen space, capability. The downside of this generality
which are then fed to the setup unit and was the difficulty of efficient load balancing
the rasterizer, and setting up lighting and between different shader types.
texture parameters to be used by the pixel- Other critical hardware design require-
fragment processors. Pixel-fragment proces- ments were architectural scalability, perfor-
sors operate on rasterizer output, which fills mance, power, and area efficiency.
the interior of primitives, along with the The Tesla architects developed the
interpolated parameters. graphics feature set in coordination with
Vertex and pixel-fragment processors the development of the Microsoft Direct3D
have evolved at different rates: Vertex DirectX 10 graphics API.10 They developed
processors were designed for low-latency, the GPU’s computing feature set in coor-
high-precision math operations, whereas dination with the development of the
pixel-fragment processors were optimized CUDA C parallel programming language,
for high-latency, lower-precision texture compiler, and development tools.
filtering. Vertex processors have tradition-
ally supported more-complex processing, so Tesla architecture
they became programmable first. For the The Tesla architecture is based on a
last six years, the two processor types scalable processor array. Figure 1 shows a
have been functionally converging as the block diagram of a GeForce 8800 GPU
result of a need for greater programming with 128 streaming-processor (SP) cores
generality. However, the increased general- organized as 16 streaming multiprocessors
ity also increased the design complexity, (SMs) in eight independent processing units
area, and cost of developing two separate called texture/processor clusters (TPCs).
processors. Work flows from top to bottom, starting
Because GPUs typically must process at the host interface with the system PCI-
more pixels than vertices, pixel-fragment Express bus. Because of its unified-processor
processors traditionally outnumber vertex design, the physical Tesla architecture
processors by about three to one. However, doesn’t resemble the logical order of
typical workloads are not well balanced, graphics pipeline stages. However, we will
leading to inefficiency. For example, use the logical graphics pipeline flow to
with large triangles, the vertex processors explain the architecture.
are mostly idle, while the pixel processors At the highest level, the GPU’s scalable
are fully busy. With small triangles, streaming processor array (SPA) performs
the opposite is true. The addition of all the GPU’s programmable calculations.
more-complex primitive processing in The scalable memory system consists of
DX10 makes it much harder to select a external DRAM control and fixed-function
fixed processor ratio.10 All these factors raster operation processors (ROPs) that
influenced the decision to design a unified perform color and depth frame buffer
architecture. operations directly on memory. An inter-
A primary design objective for Tesla was connection network carries computed
to execute vertex and pixel-fragment shader pixel-fragment colors and depth values from
programs on the same unified processor the SPA to the ROPs. The network also
architecture. Unification would enable dy- routes texture memory read requests from
namic load balancing of varying vertex- and the SPA to DRAM and read data from
pixel-processing workloads and permit the DRAM through a level-2 cache back to the
introduction of new graphics shader stages, SPA.
such as geometry shaders in DX10. It also The remaining blocks in Figure 1 deliver
let a single team focus on designing a fast input work to the SPA. The input assembler
and efficient processor and allowed the collects vertex work as directed by the input
sharing of expensive hardware such as the command stream. The vertex work distri-
.......................................................................
40 IEEE MICRO
Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming
multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor.
MARCH–APRIL 2008 41
.........................................................................................................................................................................................................................
HOT CHIPS 19
work distribution is based on the pixel tions to texture operations, one texture unit
location. serves two SMs. This architectural ratio can
vary as needed.
Streaming processor array
The SPA executes graphics shader thread Geometry controller
programs and GPU computing programs The geometry controller maps the logical
and provides thread control and manage- graphics vertex pipeline into recirculation
ment. Each TPC in the SPA roughly on the physical SMs by directing all
corresponds to a quad-pixel unit in previous primitive and vertex attribute and topology
architectures.1 The number of TPCs deter- flow in the TPC. It manages dedicated on-
mines a GPU’s programmable processing chip input and output vertex attribute
performance and scales from one TPC in a storage and forwards contents as required.
small GPU to eight or more TPCs in high- DX10 has two stages dealing with vertex
performance GPUs. and primitive processing: the vertex shader
and the geometry shader. The vertex shader
Texture/processor cluster processes one vertex’s attributes indepen-
As Figure 2 shows, each TPC contains a dently of other vertices. Typical operations
geometry controller, an SM controller are position space transforms and color and
(SMC), two streaming multiprocessors texture coordinate generation. The geome-
(SMs), and a texture unit. Figure 3 expands try shader follows the vertex shader and
each SM to show its eight SP cores. To deals with a whole primitive and its vertices.
balance the expected ratio of math opera- Typical operations are edge extrusion for
.......................................................................
42 IEEE MICRO
for transcendental functions and attribute
interpolation—the interpolation of pixel
attributes from vertex attributes defining a
primitive. Each SFU also contains four
floating-point multipliers. The SM uses the
TPC texture unit as a third execution unit
and uses the SMC and ROP units to
implement external memory load, store,
and atomic accesses. A low-latency inter-
connect network between the SPs and the
shared-memory banks provides shared-
memory access.
The GeForce 8800 Ultra clocks the SPs
and SFU units at 1.5 GHz, for a peak of 36
Gflops per SM. To optimize power and area
efficiency, some SM non-data-path units
operate at half the SP clock rate.
MARCH–APRIL 2008 43
.........................................................................................................................................................................................................................
HOT CHIPS 19
44 IEEE MICRO
traditional codes: Programmers can safely programs are becoming longer and more
ignore cache line size when designing for scalar, and it is increasingly difficult to fully
correctness but must consider it in the code occupy even two components of the prior
structure when designing for peak perfor- four-component vector architecture. Previ-
mance. SIMD vector architectures, on the ous architectures employed vector pack-
other hand, require the software to manu- ing—combining sub-vectors of work to
ally coalesce loads into vectors and to gain efficiency—but that complicated the
manually manage divergence. scheduling hardware as well as the compiler.
Scalar instructions are simpler and compiler
SIMT warp scheduling. The SIMT ap- friendly. Texture instructions remain vector
proach of scheduling independent warps is based, taking a source coordinate vector and
simpler than previous GPU architectures’ returning a filtered color vector.
complex scheduling. A warp consists of up High-level graphics and computing-lan-
to 32 threads of the same type—vertex, guage compilers generate intermediate in-
geometry, pixel, or compute. The basic unit structions, such as DX10 vector or PTX
of pixel-fragment shader processing is the 2 scalar instructions,10,2 which are then opti-
3 2 pixel quad. The SM controller groups mized and translated to binary GPU
eight pixel quads into a warp of 32 threads. instructions. The optimizer readily expands
It similarly groups vertices and primitives DX10 vector instructions to multiple Tesla
into warps and packs 32 computing threads SM scalar instructions. PTX scalar instruc-
into a warp. The SIMT design shares the tions optimize to Tesla SM scalar instruc-
SM instruction fetch and issue unit effi- tions about one to one. PTX provides a
ciently across 32 threads but requires a full stable target ISA for compilers and provides
warp of active threads for full performance
compatibility over several generations of
efficiency.
GPUs with evolving binary instruction set
As a unified graphics processor, the SM
architectures. Because the intermediate lan-
schedules and executes multiple warp types
guages use virtual registers, the optimizer
concurrently—for example, concurrently
analyzes data dependencies and allocates
executing vertex and pixel warps. The SM
real registers. It eliminates dead code, folds
warp scheduler operates at half the 1.5-GHz
instructions together when feasible, and
processor clock rate. At each cycle, it selects
optimizes SIMT branch divergence and
one of the 24 warps to execute a SIMT warp
convergence points.
instruction, as Figure 4 shows. An issued
warp instruction executes as two sets of 16
Instruction set architecture. The Tesla SM
threads over four processor cycles. The SP
cores and SFU units execute instructions has a register-based instruction set including
independently, and by issuing instructions floating-point, integer, bit, conversion, tran-
between them on alternate cycles, the scendental, flow control, memory load/store,
scheduler can keep both fully occupied. and texture operations.
Implementing zero-overhead warp sched- Floating-point and integer operations
uling for a dynamic mix of different warp include add, multiply, multiply-add, mini-
programs and program types was a chal- mum, maximum, compare, set predicate,
lenging design problem. A scoreboard and conversions between integer and float-
qualifies each warp for issue each cycle. ing-point numbers. Floating-point instruc-
The instruction scheduler prioritizes all tions provide source operand modifiers for
ready warps and selects the one with highest negation and absolute value. Transcenden-
priority for issue. Prioritization considers tal function instructions include cosine,
warp type, instruction type, and ‘‘fairness’’ sine, binary exponential, binary logarithm,
to all warps executing in the SM. reciprocal, and reciprocal square root.
Attribute interpolation instructions provide
SM instructions. The Tesla SM executes efficient generation of pixel attributes.
scalar instructions, unlike previous GPU Bitwise operators include shift left, shift
vector instruction architectures. Shader right, logic operators, and move. Control
........................................................................
MARCH–APRIL 2008 45
.........................................................................................................................................................................................................................
HOT CHIPS 19
flow includes branch, call, return, trap, and load-to-use latency for local and global
barrier synchronization. memory implemented in external DRAM.
The floating-point and integer instruc- The latest Tesla architecture GPUs
tions can also set per-thread status flags for provide efficient atomic memory opera-
zero, negative, carry, and overflow, which tions, including integer add, minimum,
the thread program can use for conditional maximum, logic operators, swap, and
branching. compare-and-swap operations. Atomic op-
erations facilitate parallel reductions and
Memory access instructions. The texture parallel data structure management.
instruction fetches and filters texture sam-
ples from memory via the texture unit. The Streaming processor. The SP core is the
ROP unit writes pixel-fragment output to primary thread processor in the SM. It
memory. performs the fundamental floating-point
To support computing and C/C++ operations, including add, multiply, and
language needs, the Tesla SM implements multiply-add. It also implements a wide
memory load/store instructions in addition variety of integer, comparison, and conver-
to graphics texture fetch and pixel output. sion operations. The floating-point add and
Memory load/store instructions use integer multiply operations are compatible with the
byte addressing with register-plus-offset IEEE 754 standard for single-precision FP
address arithmetic to facilitate conventional numbers, including not-a-number (NaN)
compiler code optimizations. and infinity values. The unit is fully
For computing, the load/store instruc- pipelined, and latency is optimized to
tions access three read/write memory spaces: balance delay and area.
The add and multiply operations use
N local memory for per-thread, private,
IEEE round-to-nearest-even as the default
temporary data (implemented in ex-
rounding mode. The multiply-add opera-
ternal DRAM);
tion performs a multiplication with trunca-
N shared memory for low-latency access
tion, followed by an add with round-to-
to data shared by cooperating threads
nearest-even. The SP flushes denormal
in the same SM; and
source operands to sign-preserved zero and
N global memory for data shared by all
flushes results that underflow the target
threads of a computing application
output exponent range to sign-preserved
(implemented in external DRAM).
zero after rounding.
The memory instructions load-global,
store-global, load-shared, store-shared, Special-function unit. The SFU supports
load-local, and store-local access global, computation of both transcendental func-
shared, and local memory. Computing tions and planar attribute interpolation.11 A
programs use the fast barrier synchroniza- traditional vertex or pixel shader design
tion instruction to synchronize threads contains a functional unit to compute
within the SM that communicate with each transcendental functions. Pixels also need
other via shared and global memory. an attribute-interpolating unit to compute
To improve memory bandwidth and the per-pixel attribute values at the pixel’s x,
reduce overhead, the local and global load/ y location, given the attribute values at the
store instructions coalesce individual paral- primitive’s vertices.
lel thread accesses from the same warp into For functional evaluation, we use qua-
fewer memory block accesses. The addresses dratic interpolation based on enhanced
must fall in the same block and meet minimax approximations to approximate
alignment criteria. Coalescing memory the reciprocal, reciprocal square root, log2x,
requests boosts performance significantly 2x, and sin/cos functions. Table 1 shows the
over separate requests. The large thread accuracy of the function estimates. The SFU
count, together with support for many unit generates one 32-bit floating point
outstanding load requests, helps cover result per cycle.
.......................................................................
46 IEEE MICRO
Table 1. Function approximation statistics.
The SFU also supports attribute interpo- neously: vertex, geometry, and pixel. It
lation, to enable accurate interpolation of packs each of these input types into the
attributes such as color, depth, and texture warp width, initiating shader processing,
coordinates. The SFU must interpolate and unpacks the results.
these attributes in the (x, y) screen space Each input type has independent I/O
to determine the values of the attributes at paths, but the SMC is responsible for load
each pixel location. We express the value of balancing among them. The SMC supports
a given attribute U in an (x, y) plane in static and dynamic load balancing based on
plane equations of the following form: driver-recommended allocations, current
allocations, and relative difficulty of addi-
U ðx, yÞ ~ tional resource allocation. Load balancing of
ðAU | x z BU | y z CU Þ= the workloads was one of the more
challenging design problems due to its
ðAW | x z BW | y z CW Þ impact on overall SPA efficiency.
MARCH–APRIL 2008 47
.........................................................................................................................................................................................................................
HOT CHIPS 19
48 IEEE MICRO
Table 2. Comparison of antialiasing modes.
Antialiasing mode
MARCH–APRIL 2008 49
.........................................................................................................................................................................................................................
HOT CHIPS 19
50 IEEE MICRO
Figure 6. Nested granularity levels: thread (a), cooperative thread array (b), and grid (c).
These have corresponding memory-sharing levels: local per-thread, shared per-CTA, and
global per-application.
MARCH–APRIL 2008 51
.........................................................................................................................................................................................................................
HOT CHIPS 19
52 IEEE MICRO
Figure 7. CUDA program sequence of kernel A followed by kernel B on a heterogeneous
CPU–GPU system.
invokes a parallel kernel function on a grid It uses parallel threads to compute the same
of nBlocks, where each block instanti- array indices in parallel, and each thread
ates nThreads concurrent threads, and computes only one sum.
args are ordinary arguments to function
kernel(). Scalability and performance
Figure 8 shows an example serial C pro- The Tesla unified architecture is designed
gram and a corresponding CUDA C program. for scalability. Varying the number of SMs,
The serial C program uses two nested loops to TPCs, ROPs, caches, and memory parti-
iterate over each array index and compute tions provides the right mix for different
c[idx] 5 a[idx] + b[idx] each trip. performance and cost targets in the value,
The parallel CUDA C program has no loops. mainstream, enthusiast, and professional
Figure 8. Serial C (a) and CUDA C (b) examples of programs that add arrays.
........................................................................
MARCH–APRIL 2008 53
.........................................................................................................................................................................................................................
HOT CHIPS 19
54 IEEE MICRO
6. E. Lindholm, M.J. Kilgard, and H. Moreton, group. His research interests include graph-
‘‘A User-Programmable Vertex Engine,’’ ics processor design and parallel graphics
Proc. 28th Ann. Conf. Computer Graphics architectures. Lindholm has an MS in
and Interactive Techniques (Siggraph 01), electrical engineering from the University
ACM Press, 2001, pp. 149-158. of British Columbia.
7. G. Elder, ‘‘Radeon 9700,’’ Eurographics/
Siggraph Workshop Graphics Hardware, John Nickolls is director of GPU comput-
Hot 3D Session, 2002, http://www. ing architecture at NVIDIA. His interests
graphicshardware.org/previous/www_2002/ include parallel processing systems, languag-
presentations/Hot3D-RADEON9700.ppt. es, and architectures. Nickolls has a BS in
8. Microsoft DirectX 9 Programmable Graph- electrical engineering and computer science
ics Pipeline, Microsoft Press, 2003. from the University of Illinois and MS and
9. J. Andrews and N. Baker, ‘‘Xbox 360 PhD degrees in electrical engineering from
System Architecture,’’ IEEE Micro, Stanford University.
vol. 26, no. 2, Mar./Apr. 2006, pp. 25-37.
10. D. Blythe, ‘‘The Direct3D 10 System,’’ Stuart Oberman is a design manager in the
ACM Trans. Graphics, vol. 25, no. 3, July GPU hardware group at NVIDIA. His
2006, pp. 724-734. research interests include computer arith-
11. S.F. Oberman and M.Y. Siu, ‘‘A High- metic, processor design, and parallel archi-
Performance Area-Efficient Multifunction tectures. Oberman has a BS in electrical
Interpolator,’’ Proc. 17th IEEE Symp. Com- engineering from the University of Iowa
puter Arithmetic (Arith-17), IEEE Press, and MS and PhD degrees in electrical
2005, pp. 272-279. engineering from Stanford University. He is
12. J.E. Stone et al., ‘‘Accelerating Molecular a senior member of the IEEE.
Modeling Applications with Graphics Pro-
cessors,’’ J. Computational Chemistry, John Montrym is a chief architect at
vol. 28, no. 16, 2007, pp. 2618-2640. NVIDIA, where he has worked in the
13. L. Nyland, M. Harris, and J. Prins, ‘‘Fast N- development of several GPU product fam-
Body Simulation with CUDA,’’ GPU Gems ilies. His research interests include graphics
3, H. Nguyen, ed., Addison-Wesley, 2007, processor design, parallel graphics architec-
pp. 677-695. tures, and hardware-software interfaces.
14. S.S. Stone et al., ‘‘How GPUs Can Improve Montrym has a BS in electrical engineering
the Quality of Magnetic Resonance Imag- from the Massachusetts Institute of Tech-
ing,’’ Proc. 1st Workshop on General nology.
Purpose Processing on Graphics Process-
ing Units, 2007; http://www.gigascale.org/ Direct questions and comments about
pubs/1175.html. this article to Erik Lindholm or John
15. A.L. Shimpi and D. Wilson, ‘‘NVIDIA’s Nickolls, NVIDIA, 2701 San Tomas
GeForce 8800 (G80): GPUs Re-architected Expressway, Santa Clara, CA 95050;
for DirectX 10,’’ AnandTech, Nov. 2006; elindholm@nvidia.com or jnickolls@nvidia.
http://www.anandtech.com/video/showdoc. com.
aspx?i52870.
For more information on this or any other
Erik Lindholm is a distinguished engineer computing topic, please visit our Digital
at NVIDIA, working in the architecture Library at http://computer.org/csdl.
........................................................................
MARCH–APRIL 2008 55