You are on page 1of 9

BITS-Pilani K.K.

Birla Goa Campus

VLIW architectures: a survey of some popular


VLIW processors
Sreekesh S, Kiran Shinde, Abuturab Muhhammadi, Shubham Jain, Abhinav Gupta

contrast, a vector processor only speeds up loops. A VLIW


AbstractVLIW architecture was conceived in the late 70s has one central control unit issuing a single long instruction
and early 80s as a solution for achieving high performance per cycle. Many independent and tightly coupled operations
computing. Whereas other architectures like superscalar are carried out per instruction, and each operation takes a
processors rely on complicated hardware for scheduling
statically predictable number of cycles to execute, with all the
instructions on processors, VLIW leverages the compiler for
doing the same. In state-of-the-art, VLIW can be seen in several instructions being pipelined in the machine.
DSP processors. However VLIW didnt stand out in test of time, In VLIW architectures, most of the performance comes
as several better architectures were proposed eventually. The from the efficiency of complier to take long streams of code
paper presents the milestones that led to VLIW and explores the and schedule it effectively. Leveraging the compiler for
architectural features of popular CPUs ELI-512, Cedra, Intel scheduling instructions results in less complex hardware for a
Italium.
given performance metric. However time has proved that this
Index Terms VLIW, IA-64, ELI-512, Cedra5, Software
method is more suitable for data dominated applications like
pipelining, Trace Compiler scientific computing and digital signal processing, and not for
general computing applications which are control driven.
Some of the general principles used in VLIW are discussed
I. INTRODUCTION below.
architecture community in the late 1970s and
C OMPUTER
early 1980s believed that the intrinsic parallelism in
object code was limited due to branches and dependencies in
A. Trace Scheduling
Trace scheduling finds sufficient parallelism in ordinary
code to justify thinking about highly parallel VLIW. In
code. This was well established by the studies of Tjaden and scientific code, it finds parallelism beyond basic blocks. A
Flynn. Parallelism was limited because after a few lines of basic block of code has no jumps in except at beginning and
codes (called the basic block) a branch instruction would be no jumps out except at the end. Before Fisher, no one looked
hit, and within the parallel block there may be true data for parallelism beyond basic block, as there was no way to tell
dependencies. beforehand about the flow of control. Trace scheduling
John Fisher, a professor of University of Illinois, Urbana replaces block-by-block compaction of code with compaction
Champaign was instrumental in developing VLIW of long streams of code. Dynamic information about jump
architecture by not sticking to this fundamental assumption. prediction is used at compile time to select streams with
One of Fishers key insights was that if those basic blocks are highest probability of execution.
too short to afford much parallelism, perhaps there was a way The compiler builds a graph with edges representing control
to artificially lengthen them by hoisting code from beyond flows and blocks representing the basic blocks. From the
those conditional branches and then to undo what should not obtained graph, the trace scheduler selects the most frequently
have been done when execution went the wrong way. For executed group of blocks and makes it as a trace. Dynamic
achieving this he developed a technique called Trace information is made use for this. This is done in the
Scheduling, a complier, and a machine called ELI-512 to run preprocessor stage.
the code generated by the compiler [1]. After pre-processing, the scheduler has made many code
motions that will not correctly preserve the jumps from the
II. GENERAL PRINCIPLES OF VLIW trace to the other blocks of the graphs. Post-processor inserts
In VLIW, many statically scheduled, tightly coupled, fine- new code at the trace exit and entrances to recover the correct
grained operations execute in parallel within a single and large machine state. The process continues on the entire graph,
instruction stream. This is different from vector processors, in identifying the traces and compacting them the most
the fact that a vector processor performs logically similar frequently executed trace first. Fig.1 shows an example of
operations in a long instruction (long in terms of time of trace schedule.
execution) whereas a VLIW performs logically unrelated
B. Software Pipelining and Loop unrolling
operations at the same time in different functional units. In
Software pipelining can be thought of as special case of
trace scheduling. In many scientific and signal processing
BITS-Pilani K.K. Birla Goa Campus

Fig. 1. (a) A flow graph, with each block representing a basic block of code. (b) A trace picked from the flow graph. (c) The trace has been scheduled but
it hasn't been relinked to the rest of the code. (d) The sections of unscheduled code that allow re linking. (Joseph Fisher)

applications, there will be several loops running one after decision is taken, these moved instructions which are in the
another to perform operations on large arrays. In many cases, pipeline gets executed (rather than creating bubbles in the
two such loops could be combined as they operate on separate pipeline) and the fetch from jump locations happen after these
arrays which dont have a true data dependency between them. instructions are executed.
Many of the VLIW combines such independent loops into a Many of the operations packed in one instruction may be
single loop and helps reduce computation. Loop unrolling memory references. This causes trouble as many functional
helps reduce the number of iterations. Both these techniques units try to access the memory bus at the same time, and the
effectively help to reduce the number of conditional jumps in arbitration scheme creates further delay. To reduce this, ELI
the code and increases the length of trace. has a banked memory scheme, effectively increasing the
bandwidth of the memory system. The compiler takes care
that memory accesses which occur in a single instruction are
II. MULTIFLOW ENORMOUSLY LONGWORD INSTRUCTION-512
for different banks in memory and avoids conflicts. But there
ELI-512 uses a 512 bit instruction word and can perform are still problems with arrays and loops operating on it. The
the following operations per cycle: anti-aliasing system used for unrolling the loops takes care of
10 ALU operations. 8 will be 32-bit integer this. However pointer arithmetic and its uses for accessing the
operations, and 8 will be done using 64-bit memory can lead the machine to access a location
ALUs with a varied repertoire, including unpredictable by the compiler. This is dealt with by
pipelined floating-point calculations. introducing additional shadow ports in the memory dedicated
8 pipelined memory references. for such unpredictable operations. With all banks combined,
32 register accesses. the bandwidth of the memory system in ELI-512 is
A multiway conditional jump based on several 400Mbytes/second.
independent tests.
To carry out these operations, ELI has 8 M-clusters and 8 F-
clusters arranged in a circular fashion and interconnected as III. CYNDROME CYDRA-5 DIRECTED DATAFLOW
shown in fig.2. Each M-cluster has a local memory module, ARCHITECTURE
an fixed point ALU, a multiport register bank, and a crossbar Cyndrome has merged several innovative technologies into
switch which can select from 8 buses. Each F-cluster has a a balanced and functionally complete mini-supercomputer
floating point ALU, FP register bank and crossbar switch.
called the Cydra 5 departmental computer. The architecture of
TTL ICs were used in implementing the ELI-512.[2]
the Cydra 5 is a numeric Processor and the compiler
Trace scheduling requires a mechanism to jump to n+1
technology that goes hand-in-hand. The chief virtue of this
locations based on the result of n tests. However to produce
n+1 address in an instruction takes a lot more space. To combination is its ability to excel with a broad spectrum of
circumvent this problem, only one jump address in specified computations. It enables users to achieve larger performance
in the VLIW instruction and the machine calculates the jump gains over super minis without re-arranging their existing
address combining this address with some bits of the test application software and algorithms. Any serious attempt at
result. It also supports delayed branches by which flushing of high performance computing involves the concurrent
the pipeline can be avoided when some jumps occur in execution of multiple operations.
program. Delayed branches work by moving the code ahead of A. The Directed dataflow architecture
a branch to below a branch, such that when the branch
BITS-Pilani K.K. Birla Goa Campus

Fig. 3. Architecture of Cydra-5

There are two major forms of parallelism: course grained two input operands per functional unit and to accept one
and fine grained. Course grained parallelism, popularly functional unit per cycle.
referred to as parallel processing, means multiple processes The VLIW architecture is properly viewed as the logical
running on multiple processors in a cooperative fashion to extension to the scalar RISC architecture. The underlying
perform the job of simple program. In contrast fine grained model is one of scalar code to be executed, but more than one
parallelism exists within a process at the level of individual operation issued per cycle. Using trace scheduling technique,
operations that constitute the program. Vector SIMD, VLIW VLIW can provide good speed up on scalar code. However,
architecture are examples of architecture that use fine grained given the lack of architectural emphasis on loops, VLIW does
parallelism. not do as well on vectorizable computations as a vector
The emphasis in supercomputer is on execution of processor does. If greater generality is required in the
arithmetic (particularly floating point) operations. The starting capabilities of the architecture, a more general model of
point for all supercomputer architecture is multiple, pipelined, computation is needed, one that does not only on scalar and
floating point functional units, in addition to any needed for vector code like the VLIW and vector architecture but also on
integer operations and memory accesses. The fundamental important class of loops that passes parallelism but are not
objective is to keep them as busy as possible. Assuming this vectorizable.
The directed dataflow architecture is also significantly
influenced by another concept, one of moving complexity
and functionality out of hardware and into software
whenever possible. The benefits of this concept are reduced
hardware cost and often the ability to make better decisions
at compile time. In directed dataflow architecture, the
compiler makes most decisions regarding the scheduling at
operation at compile time rather than at runtime.
The compiler takes a program and first creates the
corresponding dataflow graph. It then enforces the rule of
dataflow execution, with full knowledge of execution
latency of each operation, to produce a schedule that
indicates exactly when and where each operation will be
performed. While scheduling at compile time, the compiler
can examine the whole program, in effect looking forward
into the execution. It thus creates a better schedule that
might have been possible with runtime scheduling. An
instruction for directed dataflow machine consist of a time
Fig. 2. Architecture Multiflow ELI-512 showing interconnections of
the functional blocks(Joseph Fisher) slice out of this schedule , that is , operations that the
schedule specifies for the initiation at the same time. Such an
will be achieved, the hardware must be equipped to provide
BITS-Pilani K.K. Birla Goa Campus

extensions and a load/store unit optimized for the video-


processing domain. The TM3270 is a VLIW-based media-
processor, which is backward source code compatible with
other processors in the TriMedia family i.e. C-code written for
previous TriMedia processors can be recompiled to run on the
TM3270. Fig.5. shows the key architectural features of
TM3270.

A. Operation Encoding
A VLIW instruction may contain up to five operations,
which are template-based encoded in a compressed format to
limit code size. Every VLIW instruction starts with a 10-bit
template field, which specifies the compression of the
operations in the next VLIW instruction. As a result, an
instructions compression template is available one cycle
Fig. 4. VLIW Instruction Encoding of TM3270
before the instructions compressed encoding, which relaxes
the timing requirements of the decoding process. Jump target
VLIW instructions are not compressed and do not require an
explicit template field in the preceding instruction. The 10-bit
template field has five 2-bit compression sub-fields, which are
related to the processors issue slots 1 through 5. An issue
slots 2- bit compression field specifies the size of the
operation encoding. Figure 1 gives an example of a VLIW
instruction containing three operations in slots 2, 3, and 5.
Issue slots 1 and 4 are not used, as specified by the 11
encoding of the related compression fields. Since issue slot 1
is not used, the first encoded operation is for issue slot 2. A
VLIW instruction without any operations is efficiently
encoded in 2 bytes, with a 11:11:11:11:11 template field. A
VLIW with all operations of the maximum size of 42 bits is
encoded in 28 bytes, with a 10:10:10:10:10 template field
and 5 * 42 bits for the operation encoding. This compression
Fig. 5.Architecture summary of TM3270 scheme allows for an efficient encoding of code with a low
amount of instruction level parallelism. Fig.4 shows the
instruction causes multiple operations to be issued in a single instruction encoding scheme.
instruction.

B. The architecture
The general-purpose subsystem consists of up to six
V. INTEL ITALIUM ARCHITECTURE
interactive processors, up to 64 Mbytes of support memory,
one or two 1/0 processors, and the service processor/system The IA-64 instruction set architecture provides hardware
console connected over a IOO-Mbyte/s system bus. Each 1/0 support for instruction-level parallelism (ILP) and is very
processor handles up to three VME buses, to which the much different from superscalar architectures because of
peripheral controllers are attached. Also connected to the support for predicated execution, control& data speculation
system bus, via a lOO-Mbyte/s port, is the pseudorandomly and software pipelining. Teaming up with Hewlett-Packard
interleaved main memory. The numeric processor has three (HP) Intel developed a new 64-bit architecture, called IA-64 a
dedicated ports into the main memory, each providing lOO- new architecture which exploits the vast circuitry and high
Mbyte/s bandwidth. One of these is for instructions; the other speeds available on the newest generations of microchips by a
two are for data. The main memory and support memory share systematic use of parallelism.
a common address space and are both accessible from any The basic concepts used in IA-64 are Instruction-level
processor. Fig.3. shows the architecture of Cedra-5. parallelism that is explicit in the machine instructions rather
than being handled at run time by the processor, Long or Very
IV. TM3270 MEDIA PROCESSOR long instruction words (LIW/VLIW), Branch predication (not
The TM3270 media-processor, is the latest TriMedia VLIW same as branch prediction) and Speculative loading. Intel and
processor, tuned to address the performance demands of HP refer to this combination of concepts as explicitly parallel
standard definition video processing, combined with instruction computing (EPIC). The first Intel product based on
embedded processor requirements for the consumer market. this architecture was Itanium.
The processor incorporates instruction set architectural (ISA) With the Pentium, Intel made an attempt to use superscalar
BITS-Pilani K.K. Birla Goa Campus

explained subsequently.
ii. Multiple execution units: A typical commercial
superscalar machine today may support four parallel pipelines,
using four parallel execution units in both the integer and
floating-point portions of the processor. It is expected thatIA-
64 will be implemented on systems with eight or more parallel
units.
The register file is quite large compared with most RISC
and superscalar machines. Because we wish to make
parallelism explicit and relieve the processor of the burden of
register renaming and dependency analysis, we need a large
number of explicit registers.
Four types of execution unit are defined in the IA-64
architecture:
I-unit: For integer arithmetic, shift-and-add, logical,
compare, and integer multimedia instructions
M-unit: Load and store between register and memory
plus some integer ALU operations
B-unit: Branch instructions
Fig. 6. Predication as implemented in italium (Intel) F-unit: Floating-point instructions
techniques, allowing two CISC instructions to execute at the
B. IA-64 instruction format
same time. Then, the Pentium Pro and Pentium II through
IA-64 defines a 128-bit bundle that contains three
Pentium 4 included a mapping from CISC instructions to
instructions, called syllables, and a template field. The
RISC-like micro-ops and the use of superscalar techniques
processor can fetch instructions one or more bundles at a time;
enabling the effective use of a chip with millions of
each bundle fetch brings in three instructions. The template
transistors. We can increase the processing speed by dumping
field contains information that indicates which instructions can
those extra transistors into bigger on-chip caches, which can
be executed in parallel. The interpretation of the template field
increase performances but eventually reach a point of
is not confined to a single bundle. Rather, the processor can
diminishing return or increase the superscaling by adding
look at multiple bundles to determine which instructions may
more execution units but finally hit the complexity wall.
be executed in parallel. For example, the instruction stream
Branch prediction must be improved, out-of-order processing
may be such that eight instructions can be executed in parallel.
must be used, and longer pipelines must be employed. But
The compiler will reorder instructions so that these eight
with wide and deeper pipelines, there is a greater penalty for
instructions span contiguous bundles and set the template bits
branch misprediction. Out-of-order execution requires a large
so that the processor knows that these eight instructions are
number of renaming registers and complex interlock circuitry
independent, because of the flexibility of the template field,
to handle dependencies.
the compiler can mix independent and dependent instructions
To address these problems, Intel and HP came up with an
in the same bundle. Unlike some previous VLIW designs, IA-
overall design that enables the effective use of a processor
64 does not need to insert null-operation (NOP) instructions to
with many parallel execution units. The key concept of this
fill in the bundles.
new approach is the concept of explicit parallelism. With this
approach, the compiler statically schedules the instructions at
Each instruction has a fixed-length 41-bit format. This is
compile time, rather than having the processor dynamically
somewhat longer than the traditional 32-bit length found on
schedule them at run time. The compiler determines which
RISC and RISC superscalar machines. Two factors lead to the
instructions can execute in parallel and includes this
additional bits. First, IA 64 makes use of more registers than a
information with the machine instruction. The advantage of
typical RISC machine: 128 integers and 128 floating-point
this approach is that the EPIC processor does not need as
registers. Second, to accommodate the predicated execution
complex circuitry as an out-of-order superscalar processor.
technique, an IA-64 machine includes 64 predicate registers.
All instructions include a 4-bit major op code and a reference
A. General Organization
to a predicate register. Although the major op code field can
IA-64 can be implemented in a variety of organizations,
only discriminate among 16 possibilities, the interpretation of
with key features as:
the major op code field depends on the template value and the
i. Large number of registers: The IA-64 instruction format
location of the instruction within a bundle, thus affording
assumes the use of 256registers: 128 64-bit registers for
more possible op codes. Typical instructions also include three
integer, logical, and general-purpose use, and128 82-bit
fields to reference registers, leaving 10 bits for other
registers for floating-point and graphic use. There are also 64
information needed to fully specify the instruction.
1-bitpredicate registers used for predicated execution, as
BITS-Pilani K.K. Birla Goa Campus

L1: cmp c, 0
je L2
add k, 1 ; k = k + 1

jmp L3
L2: sub k, 1 ; k = k - 1
L3: add i, 1 ; i = i + 1

In the Pentium assembly language, a semicolon is used to


delimit a comment. Figure shows a flow diagram of this
assembly code. This diagram breaks the assembly language
program into separate blocks of code. For each block that
executes conditionally, the compiler can assign a predicate.
These predicates are indicated in Figure. Assuming that all of
these predicates have been initialized to false, the resulting IA-
Fig. 7. IA-64 predication and speculative loading (Intel) 64 assembly code is as follows:
Predicated Code:
C. Predicated Execution
(1) cmp.eq p1, p2 = 0, a ;;
Predication is a technique whereby the compiler determines
(2) (p2) cmp.eq p1, p3 = 0, b
which instructions mayexecute in parallel. In the process, the
(3) (p3) add j = 1, j
compiler eliminates branches from the program by using
(4) (p1) cmp.ne p4, p5 = 0, c
conditional execution. A typical example in a high-level
(5) (p4) add k = 1, k
language is an if-then-else instruction. A traditional compiler
(6) (p5) add k = -1, k
inserts a conditional branch at the if point of this construct. If
(7) add i = 1, i
the condition has one logical outcome, the branch is not taken
and the next block of instructions is executed, representing the
Instruction (1) compares the contents of symbolic register a
then path; at the end of this path is an unconditional branch
with 0; it sets the value of predicate register p1 to 1 (true) and
around the next block, representing the else path. If the
p2 to 0 (false) if the relation is true and will set the value of
condition has the other logical outcome, the branch is taken
predicate p1 to 0 and p2 to 1 if the relation is false. Instruction
around the then block of instructions and execution continues
(2) is to be executed only if the predicate p2 is true (i.e., if a is
at the else block of instructions. The two instruction streams
true, which is equivalent to The processor will fetch, decode,
join together after the end of the else block. An IA-64
and begin executing this instruction, but only make a decision
compiler instead does the following. As an example, consider
as to whether to commit the result after it determines whether
the following source code:
the value of predicate register p1 is 1 or 0. Note that
instruction (2) is a predicate-generating instruction and is itself
if(a&&b)
predicated. This instruction requires three predicate register
j= j + 1;
fields in its format. Returning to our Pentium program, the
else
first two conditional branches in the Pentium assembly code
if (c)
are translated into two IA-64 predicated compare instructions.
k = k + 1;
If instruction (1) sets p2 to false, the instruction (2) is not
else
executed. After instruction (2) in the IA-64 program, p3 is true
k = k - 1;
only if the outer if statement in the source code is true. That is,
i = i + 1;
predicate p3 is true only if the expression (a AND b) is true
(i.e., AND The then part of the outer if statement is predicated
Two if statements jointly select one of three possible
on p3 for this reason. Instruction (4) of the IA-64 code decides
execution paths. This can be compiled into the following code,
whether the addition or subtraction instruction in the outer else
using the Pentium assembly language.The program has three
part is performed. Finally, the increment of i is performed
conditional branches and one unconditional branch
unconditionally. Looking at the source code and then at the
instructions:
predicated code, we see that only one of instructions (3), (5),
and (6) is to be executed. In an ordinary superscalar processor,
cmp a, 0 ; compare a with 0
we would use branch prediction to guess which of the three is
je L1 ; branch to L1 if a = 0
to be executed and go down that path. If the processor guesses
cmp b, 0
wrong, the pipeline must be flushed. An IA-64processor can
je L1
begin execution of all three of these instructions and, once the
add j, 1 ; j = j + 1
values of the predicate registers are known, commit only the
Assembly Code:
results of the valid instruction. Thus, we make use of
jmp L3
additional parallel execution units to avoid the delays due to
BITS-Pilani K.K. Birla Goa Campus

Let us look at a simple example, taken from[INTE00a,Volume


1]. Here is the original program:

(p1) brsome_label // Cycle 0


ld8 r1 = [r5] ;; // Cycle 1
add r2 = r1, r3 // Cycle 3

The compiler can rewrite this code using a control speculative


load and a check:

ld8.s r1 = [r5] ;; // Cycle -2


// Other instructions
(p1) brsome_label // Cycle 0
chk.s r1, recovery // Cycle 0
add r2 = r1, r3 // Cycle 0
Fig. 8. Software pipelining in VLIW (Intel)
The speculative load doesnt immediately signal an exception
pipeline flushing. when detected; it just records that fact by setting the NaT bit
for the target register (in this case, r1). The speculative load
D. Control Speculation now executes unconditionally at least two cycles prior to
Another key innovation in IA-64 is control speculation, also thebranch. The chk.s instruction then checks to see if the NaT
known as speculative loading. This enables the processor to bit is set on r1. If not, execution simply falls through to the
load data from memory before the program needs it, to avoid next instruction.
memory latency delays. Also, the processor postpones the
reporting of exceptions until it becomes necessary to report the E. Data Speculation
exception. The term hoist is used to refer to the movement of a In data speculation, a load is moved before a store
load instruction to a point earlier in the instruction stream. instruction that might alter the memory location that is the
Since, load latency is crucial to improve performance; the source of the load. A subsequent check is made to assure that
delays in obtaining data from memory become a bottleneck. the load receives the proper memory value. Consider the
To minimize this, we would like to rearrange the code so that following program fragment:
loads are done as early as possible.
You cannot unconditionally move the load above a branch st8 [r4] = r12 // Cycle 0
because the load may not actually occur. We could move the ld8 r6 = [r8] ;; // Cycle 0
load conditionally, using predicates, so that the data could be add r5 = r6, r7 ;; // Cycle 2
retrieved from memory but not committed to an architectural st8 [r18] = r5 // Cycle 3
register until the outcome of the predicate is known; or we can
use branch prediction techniques. If an exception occurs due As written, the code requires four instruction cycles to
to invalid address or a page fault could be generated. If this execute. If registers r4 and r8 do not contain the same memory
happens, the processor would have to deal with the exception address, then the store through r4 cannot affect the value at the
or fault, causing a delay. address contained in r8; under this circumstance, it is safe to
reorder the load and store to more quickly bring the value into
r6, which is needed subsequently. However, because the
A load instruction in the originalprogram is replaced by two addresses in r4 and r8 may be the same or overlap, such a
instructions: swap is not safe. IA-64 overcomes this problem with the use
A speculative load (ld.s) executes the memory fetch, of a technique known as advanced load.
performs exception detection, but does not deliver the
exception (call the OS routine that handles the exception).This ld8.a r6 = [r8] ;; // Cycle -2 or earlier;
ld.s instruction is hoisted to an appropriate point earlier in the advanced load
program. // other instructions
A checking instruction (chk.s) remains in the place of the st8 [r4] = r12 // Cycle 0
original load and delivers exceptions. This chk.s instruction ld8.c r6 = [r8] // Cycle 0; check load
may be predicated so that it will only execute if the predicate add r5 = r6, r7 ;; // Cycle 0
is true. st8 [r18] = r5 // Cycle 1

If the ld.s detects an exception, it sets a token bit associated Here we have moved the ld instruction earlier and converted
with the target register, known as the Not a Thing (NaT) bit. If it into an advanced load. In addition to performing the
the corresponding chk.s instruction is executed, and if the NaT specified load, the ld8.a instruction writes its source address
bit is set, the chk.s instruction branches to an exception (address contained in r8) to a hardware data structure known
handling routine. as the Advanced Load Address Table (ALAT). Each IA-64
store instruction checks the ALAT for entries that overlap with
BITS-Pilani K.K. Birla Goa Campus

its target address; if a match is found, the ALAT entry is G. Modulo Loop Scheduling
removed. When the original ld8 is converted to an ld8.a Modulo Loop Scheduling is just one software pipelining
instruction and moved, the original position of that instruction technique. There is direct support in the IA 64 architecture for
is replaced with a check load instruction, ld8.c. When the modulo loop scheduling: rotating registers, predicates, special
check load is executed, it checks the ALAT for a matching branch instructions, and loop count registers. The architectural
address. If one is found, no store instruction between the support makes it easy to create SWP loops. The resulting code
advanced load and the check load has altered the source has high performance and is compact. It involves developing a
address of the load, and no action is taken. However, if the schedule for one loop iteration such that when the schedule is
check load instruction does not find a matching ALAT entry, repeated at regular intervals, no intra- or inter-iteration
then the load operation is performed again to assure the dependency is violated, and no resource usage conflict arises.
correct result. The Initiation Interval (ii) is essentially the length of one
We may also want to speculatively execute instructions that pipeline stage -the number of clock cycles it takes to execute
are data dependent on a load instruction, together with the load one pass through the loop instructions or the number of clock
itself. Starting with the same original program, suppose we cycles between starting iteration i and starting iteration i+1.
move up both the load and the subsequent add instruction: When comparing SWP loops, compare the ii and the number
of stages. For high loop counts, ii is more important than the
ld8.a r6 = [r8] ; ; number of stages. With high loop counts, more time is spent in
// Cycle -3 or earlier; advanced load the loops kernel rather than the prolog and epilog. The kernel
// other instructions time is determined by the number of iterations and the ii.
add r5 = r6, r7 // Cycle -1; add that uses r6 The key features that support software pipelining are:
// other instructions
st8 [r4] = r12 // Cycle 0 Automatic register renaming: A fixed-sized area of the
chk.a r6, recover // Cycle 0; check predicate and floating-point register files (p16 to p63; fr32 to
// return point from jump to recover fr127) and a programmable-sized area of the general register
back: file (maximum range of r32 to r127) are capable of rotation.
st8 [r18] = r5 // Cycle 0 This means that during each iteration of a software pipeline
loop, register references within these ranges are automatically
Here we use a chk.a instruction rather than an ld8.c incremented. Thus, if a loop makes use of general register r32
instruction to validate the advanced load. If the chk.a on the first iteration, it automatically makes use of r33 on the
instruction determines that the load has failed, it cannot simply second iteration, and so on.
re-execute the load; instead, it branches to a recovery routine
to clean up: Predication: Each instruction in the loop is predicated on a
rotating predicate register. The purpose of this is to determine
Recover: whether the pipeline is in prolog, kernel, or epilog phase, as
ld8 r6 = [r8] ;; // reload r6 from [r8] explained subsequently.
add r5 = r6, r7 ;; // re-execute the add Special loop terminating instructions: These are branch
br back // jump back to main code instructions that cause the registers to rotate and the loop
count to decrement.
This technique is effective only if the loads and stores
involved has little chance of overlapping.
VI. CONCLUSION
F. Software Pipelining
Software Pipelining (SWP) is the term for overlapping the VLIW has found enormous success in DSP and signal
execution of consecutive loop iterations. SWP is a processing applications over years, although their use is
performance technique that can be done in just about every limited in general purpose computing (IA-32 is one example).
computer architecture.SWP is closely related to loop The earlier VLIW architectures didnt stand the test of time,
unrolling. partly due to the due to the technology they were built on, and
partly due to better architectures combined with lower cost of
Here is a conceptual block diagram of a software pipeline. silicon real estate.
The loop code is separated into four pipeline stages. Six
iterations of the loop are shown (i = 1 to 6). Notice how the REFERENCES
pipeline stages overlap. The second iteration (i=2) can begin
[1] Joseph A Fisher. (1983). Very Long Instruction Word and the ELI-512 ,
while the first iteration (i=1) begins the second stage. This ACM
overlap of iterations increases the amount of parallel [2] Robert P. Colwell. (2009, spring). VLIW the unlikeliest computer
operations that can be executed in the processor. More parallel architecture. IEEE Solid State Circuit Magazine, IEEE
operations helps increase performance. Many, but not all, [3] B. Ramakrishna Rau. (1989). The Cydra 5 Departmental
Supercomputer Design Philosophies, Decisions, and Trade-offs,
loops may be software pipelined. IEEE
[4] B. Ramakrishna Rau. (1988). Cydra 5 Directed Dataflow
Architecture. IEEE
[5] Jan-Willem van de Waerdt. (2005). The TM3270 Media-Processor.
BITS-Pilani K.K. Birla Goa Campus

Proceedings of the 38th Annual IEEE/ACM International


Symposium on Microarchitecture (MICRO05), IEEE
[6] Bharandwaj, J., et al. The Intel IA-64 Compiler Code Generator.
IEEE Micro, September/October 2000.
[7] Chasin, A. Predication, Speculation, and Modern CPUs. Dr.
Dobbs Journal, May 2000.
[8] Dulong, C. The IA-64 Architecture at Work. Computer, July
1998.
[9] Evans, J., and Trimper, G. Itanium Architecture for Programmers.
Upper Saddle River, NJ: Prentice Hall, 2003
[10] Huck, J., et al. Introducing the IA-64 Architecture. IEEE Micro,
September/October 2000.
[11] Hwu, W. Introduction to Predicated Execution. Computer,
January 1998.
[12] Hwu, W.; August, D.; and Sias, J. Program Decision Logic
Optimization Using Predication and Control Speculation.
Proceedings of the IEEE, November 2001.
[13] Jarp, S. Optimizing IA-64 Performance. Dr.Dobbs Journal, July
2001.
[14] Kathail. B.; Schlansker, M.; and Rau, B. Compiling for EPIC
Architectures. Proceedings of the IEEE, November 2001.
[15] Sharangpani, H., and Arona, K. Itanium Processor
Microarchitecture. IEEE Micro, September/October 2000.
[16] Triebel, W. Itanium Architecture for Software Developers. Intel
Press, 2001.

You might also like