Professional Documents
Culture Documents
Fig. 1. (a) A flow graph, with each block representing a basic block of code. (b) A trace picked from the flow graph. (c) The trace has been scheduled but
it hasn't been relinked to the rest of the code. (d) The sections of unscheduled code that allow re linking. (Joseph Fisher)
applications, there will be several loops running one after decision is taken, these moved instructions which are in the
another to perform operations on large arrays. In many cases, pipeline gets executed (rather than creating bubbles in the
two such loops could be combined as they operate on separate pipeline) and the fetch from jump locations happen after these
arrays which dont have a true data dependency between them. instructions are executed.
Many of the VLIW combines such independent loops into a Many of the operations packed in one instruction may be
single loop and helps reduce computation. Loop unrolling memory references. This causes trouble as many functional
helps reduce the number of iterations. Both these techniques units try to access the memory bus at the same time, and the
effectively help to reduce the number of conditional jumps in arbitration scheme creates further delay. To reduce this, ELI
the code and increases the length of trace. has a banked memory scheme, effectively increasing the
bandwidth of the memory system. The compiler takes care
that memory accesses which occur in a single instruction are
II. MULTIFLOW ENORMOUSLY LONGWORD INSTRUCTION-512
for different banks in memory and avoids conflicts. But there
ELI-512 uses a 512 bit instruction word and can perform are still problems with arrays and loops operating on it. The
the following operations per cycle: anti-aliasing system used for unrolling the loops takes care of
10 ALU operations. 8 will be 32-bit integer this. However pointer arithmetic and its uses for accessing the
operations, and 8 will be done using 64-bit memory can lead the machine to access a location
ALUs with a varied repertoire, including unpredictable by the compiler. This is dealt with by
pipelined floating-point calculations. introducing additional shadow ports in the memory dedicated
8 pipelined memory references. for such unpredictable operations. With all banks combined,
32 register accesses. the bandwidth of the memory system in ELI-512 is
A multiway conditional jump based on several 400Mbytes/second.
independent tests.
To carry out these operations, ELI has 8 M-clusters and 8 F-
clusters arranged in a circular fashion and interconnected as III. CYNDROME CYDRA-5 DIRECTED DATAFLOW
shown in fig.2. Each M-cluster has a local memory module, ARCHITECTURE
an fixed point ALU, a multiport register bank, and a crossbar Cyndrome has merged several innovative technologies into
switch which can select from 8 buses. Each F-cluster has a a balanced and functionally complete mini-supercomputer
floating point ALU, FP register bank and crossbar switch.
called the Cydra 5 departmental computer. The architecture of
TTL ICs were used in implementing the ELI-512.[2]
the Cydra 5 is a numeric Processor and the compiler
Trace scheduling requires a mechanism to jump to n+1
technology that goes hand-in-hand. The chief virtue of this
locations based on the result of n tests. However to produce
n+1 address in an instruction takes a lot more space. To combination is its ability to excel with a broad spectrum of
circumvent this problem, only one jump address in specified computations. It enables users to achieve larger performance
in the VLIW instruction and the machine calculates the jump gains over super minis without re-arranging their existing
address combining this address with some bits of the test application software and algorithms. Any serious attempt at
result. It also supports delayed branches by which flushing of high performance computing involves the concurrent
the pipeline can be avoided when some jumps occur in execution of multiple operations.
program. Delayed branches work by moving the code ahead of A. The Directed dataflow architecture
a branch to below a branch, such that when the branch
BITS-Pilani K.K. Birla Goa Campus
There are two major forms of parallelism: course grained two input operands per functional unit and to accept one
and fine grained. Course grained parallelism, popularly functional unit per cycle.
referred to as parallel processing, means multiple processes The VLIW architecture is properly viewed as the logical
running on multiple processors in a cooperative fashion to extension to the scalar RISC architecture. The underlying
perform the job of simple program. In contrast fine grained model is one of scalar code to be executed, but more than one
parallelism exists within a process at the level of individual operation issued per cycle. Using trace scheduling technique,
operations that constitute the program. Vector SIMD, VLIW VLIW can provide good speed up on scalar code. However,
architecture are examples of architecture that use fine grained given the lack of architectural emphasis on loops, VLIW does
parallelism. not do as well on vectorizable computations as a vector
The emphasis in supercomputer is on execution of processor does. If greater generality is required in the
arithmetic (particularly floating point) operations. The starting capabilities of the architecture, a more general model of
point for all supercomputer architecture is multiple, pipelined, computation is needed, one that does not only on scalar and
floating point functional units, in addition to any needed for vector code like the VLIW and vector architecture but also on
integer operations and memory accesses. The fundamental important class of loops that passes parallelism but are not
objective is to keep them as busy as possible. Assuming this vectorizable.
The directed dataflow architecture is also significantly
influenced by another concept, one of moving complexity
and functionality out of hardware and into software
whenever possible. The benefits of this concept are reduced
hardware cost and often the ability to make better decisions
at compile time. In directed dataflow architecture, the
compiler makes most decisions regarding the scheduling at
operation at compile time rather than at runtime.
The compiler takes a program and first creates the
corresponding dataflow graph. It then enforces the rule of
dataflow execution, with full knowledge of execution
latency of each operation, to produce a schedule that
indicates exactly when and where each operation will be
performed. While scheduling at compile time, the compiler
can examine the whole program, in effect looking forward
into the execution. It thus creates a better schedule that
might have been possible with runtime scheduling. An
instruction for directed dataflow machine consist of a time
Fig. 2. Architecture Multiflow ELI-512 showing interconnections of
the functional blocks(Joseph Fisher) slice out of this schedule , that is , operations that the
schedule specifies for the initiation at the same time. Such an
will be achieved, the hardware must be equipped to provide
BITS-Pilani K.K. Birla Goa Campus
A. Operation Encoding
A VLIW instruction may contain up to five operations,
which are template-based encoded in a compressed format to
limit code size. Every VLIW instruction starts with a 10-bit
template field, which specifies the compression of the
operations in the next VLIW instruction. As a result, an
instructions compression template is available one cycle
Fig. 4. VLIW Instruction Encoding of TM3270
before the instructions compressed encoding, which relaxes
the timing requirements of the decoding process. Jump target
VLIW instructions are not compressed and do not require an
explicit template field in the preceding instruction. The 10-bit
template field has five 2-bit compression sub-fields, which are
related to the processors issue slots 1 through 5. An issue
slots 2- bit compression field specifies the size of the
operation encoding. Figure 1 gives an example of a VLIW
instruction containing three operations in slots 2, 3, and 5.
Issue slots 1 and 4 are not used, as specified by the 11
encoding of the related compression fields. Since issue slot 1
is not used, the first encoded operation is for issue slot 2. A
VLIW instruction without any operations is efficiently
encoded in 2 bytes, with a 11:11:11:11:11 template field. A
VLIW with all operations of the maximum size of 42 bits is
encoded in 28 bytes, with a 10:10:10:10:10 template field
and 5 * 42 bits for the operation encoding. This compression
Fig. 5.Architecture summary of TM3270 scheme allows for an efficient encoding of code with a low
amount of instruction level parallelism. Fig.4 shows the
instruction causes multiple operations to be issued in a single instruction encoding scheme.
instruction.
B. The architecture
The general-purpose subsystem consists of up to six
V. INTEL ITALIUM ARCHITECTURE
interactive processors, up to 64 Mbytes of support memory,
one or two 1/0 processors, and the service processor/system The IA-64 instruction set architecture provides hardware
console connected over a IOO-Mbyte/s system bus. Each 1/0 support for instruction-level parallelism (ILP) and is very
processor handles up to three VME buses, to which the much different from superscalar architectures because of
peripheral controllers are attached. Also connected to the support for predicated execution, control& data speculation
system bus, via a lOO-Mbyte/s port, is the pseudorandomly and software pipelining. Teaming up with Hewlett-Packard
interleaved main memory. The numeric processor has three (HP) Intel developed a new 64-bit architecture, called IA-64 a
dedicated ports into the main memory, each providing lOO- new architecture which exploits the vast circuitry and high
Mbyte/s bandwidth. One of these is for instructions; the other speeds available on the newest generations of microchips by a
two are for data. The main memory and support memory share systematic use of parallelism.
a common address space and are both accessible from any The basic concepts used in IA-64 are Instruction-level
processor. Fig.3. shows the architecture of Cedra-5. parallelism that is explicit in the machine instructions rather
than being handled at run time by the processor, Long or Very
IV. TM3270 MEDIA PROCESSOR long instruction words (LIW/VLIW), Branch predication (not
The TM3270 media-processor, is the latest TriMedia VLIW same as branch prediction) and Speculative loading. Intel and
processor, tuned to address the performance demands of HP refer to this combination of concepts as explicitly parallel
standard definition video processing, combined with instruction computing (EPIC). The first Intel product based on
embedded processor requirements for the consumer market. this architecture was Itanium.
The processor incorporates instruction set architectural (ISA) With the Pentium, Intel made an attempt to use superscalar
BITS-Pilani K.K. Birla Goa Campus
explained subsequently.
ii. Multiple execution units: A typical commercial
superscalar machine today may support four parallel pipelines,
using four parallel execution units in both the integer and
floating-point portions of the processor. It is expected thatIA-
64 will be implemented on systems with eight or more parallel
units.
The register file is quite large compared with most RISC
and superscalar machines. Because we wish to make
parallelism explicit and relieve the processor of the burden of
register renaming and dependency analysis, we need a large
number of explicit registers.
Four types of execution unit are defined in the IA-64
architecture:
I-unit: For integer arithmetic, shift-and-add, logical,
compare, and integer multimedia instructions
M-unit: Load and store between register and memory
plus some integer ALU operations
B-unit: Branch instructions
Fig. 6. Predication as implemented in italium (Intel) F-unit: Floating-point instructions
techniques, allowing two CISC instructions to execute at the
B. IA-64 instruction format
same time. Then, the Pentium Pro and Pentium II through
IA-64 defines a 128-bit bundle that contains three
Pentium 4 included a mapping from CISC instructions to
instructions, called syllables, and a template field. The
RISC-like micro-ops and the use of superscalar techniques
processor can fetch instructions one or more bundles at a time;
enabling the effective use of a chip with millions of
each bundle fetch brings in three instructions. The template
transistors. We can increase the processing speed by dumping
field contains information that indicates which instructions can
those extra transistors into bigger on-chip caches, which can
be executed in parallel. The interpretation of the template field
increase performances but eventually reach a point of
is not confined to a single bundle. Rather, the processor can
diminishing return or increase the superscaling by adding
look at multiple bundles to determine which instructions may
more execution units but finally hit the complexity wall.
be executed in parallel. For example, the instruction stream
Branch prediction must be improved, out-of-order processing
may be such that eight instructions can be executed in parallel.
must be used, and longer pipelines must be employed. But
The compiler will reorder instructions so that these eight
with wide and deeper pipelines, there is a greater penalty for
instructions span contiguous bundles and set the template bits
branch misprediction. Out-of-order execution requires a large
so that the processor knows that these eight instructions are
number of renaming registers and complex interlock circuitry
independent, because of the flexibility of the template field,
to handle dependencies.
the compiler can mix independent and dependent instructions
To address these problems, Intel and HP came up with an
in the same bundle. Unlike some previous VLIW designs, IA-
overall design that enables the effective use of a processor
64 does not need to insert null-operation (NOP) instructions to
with many parallel execution units. The key concept of this
fill in the bundles.
new approach is the concept of explicit parallelism. With this
approach, the compiler statically schedules the instructions at
Each instruction has a fixed-length 41-bit format. This is
compile time, rather than having the processor dynamically
somewhat longer than the traditional 32-bit length found on
schedule them at run time. The compiler determines which
RISC and RISC superscalar machines. Two factors lead to the
instructions can execute in parallel and includes this
additional bits. First, IA 64 makes use of more registers than a
information with the machine instruction. The advantage of
typical RISC machine: 128 integers and 128 floating-point
this approach is that the EPIC processor does not need as
registers. Second, to accommodate the predicated execution
complex circuitry as an out-of-order superscalar processor.
technique, an IA-64 machine includes 64 predicate registers.
All instructions include a 4-bit major op code and a reference
A. General Organization
to a predicate register. Although the major op code field can
IA-64 can be implemented in a variety of organizations,
only discriminate among 16 possibilities, the interpretation of
with key features as:
the major op code field depends on the template value and the
i. Large number of registers: The IA-64 instruction format
location of the instruction within a bundle, thus affording
assumes the use of 256registers: 128 64-bit registers for
more possible op codes. Typical instructions also include three
integer, logical, and general-purpose use, and128 82-bit
fields to reference registers, leaving 10 bits for other
registers for floating-point and graphic use. There are also 64
information needed to fully specify the instruction.
1-bitpredicate registers used for predicated execution, as
BITS-Pilani K.K. Birla Goa Campus
L1: cmp c, 0
je L2
add k, 1 ; k = k + 1
jmp L3
L2: sub k, 1 ; k = k - 1
L3: add i, 1 ; i = i + 1
If the ld.s detects an exception, it sets a token bit associated Here we have moved the ld instruction earlier and converted
with the target register, known as the Not a Thing (NaT) bit. If it into an advanced load. In addition to performing the
the corresponding chk.s instruction is executed, and if the NaT specified load, the ld8.a instruction writes its source address
bit is set, the chk.s instruction branches to an exception (address contained in r8) to a hardware data structure known
handling routine. as the Advanced Load Address Table (ALAT). Each IA-64
store instruction checks the ALAT for entries that overlap with
BITS-Pilani K.K. Birla Goa Campus
its target address; if a match is found, the ALAT entry is G. Modulo Loop Scheduling
removed. When the original ld8 is converted to an ld8.a Modulo Loop Scheduling is just one software pipelining
instruction and moved, the original position of that instruction technique. There is direct support in the IA 64 architecture for
is replaced with a check load instruction, ld8.c. When the modulo loop scheduling: rotating registers, predicates, special
check load is executed, it checks the ALAT for a matching branch instructions, and loop count registers. The architectural
address. If one is found, no store instruction between the support makes it easy to create SWP loops. The resulting code
advanced load and the check load has altered the source has high performance and is compact. It involves developing a
address of the load, and no action is taken. However, if the schedule for one loop iteration such that when the schedule is
check load instruction does not find a matching ALAT entry, repeated at regular intervals, no intra- or inter-iteration
then the load operation is performed again to assure the dependency is violated, and no resource usage conflict arises.
correct result. The Initiation Interval (ii) is essentially the length of one
We may also want to speculatively execute instructions that pipeline stage -the number of clock cycles it takes to execute
are data dependent on a load instruction, together with the load one pass through the loop instructions or the number of clock
itself. Starting with the same original program, suppose we cycles between starting iteration i and starting iteration i+1.
move up both the load and the subsequent add instruction: When comparing SWP loops, compare the ii and the number
of stages. For high loop counts, ii is more important than the
ld8.a r6 = [r8] ; ; number of stages. With high loop counts, more time is spent in
// Cycle -3 or earlier; advanced load the loops kernel rather than the prolog and epilog. The kernel
// other instructions time is determined by the number of iterations and the ii.
add r5 = r6, r7 // Cycle -1; add that uses r6 The key features that support software pipelining are:
// other instructions
st8 [r4] = r12 // Cycle 0 Automatic register renaming: A fixed-sized area of the
chk.a r6, recover // Cycle 0; check predicate and floating-point register files (p16 to p63; fr32 to
// return point from jump to recover fr127) and a programmable-sized area of the general register
back: file (maximum range of r32 to r127) are capable of rotation.
st8 [r18] = r5 // Cycle 0 This means that during each iteration of a software pipeline
loop, register references within these ranges are automatically
Here we use a chk.a instruction rather than an ld8.c incremented. Thus, if a loop makes use of general register r32
instruction to validate the advanced load. If the chk.a on the first iteration, it automatically makes use of r33 on the
instruction determines that the load has failed, it cannot simply second iteration, and so on.
re-execute the load; instead, it branches to a recovery routine
to clean up: Predication: Each instruction in the loop is predicated on a
rotating predicate register. The purpose of this is to determine
Recover: whether the pipeline is in prolog, kernel, or epilog phase, as
ld8 r6 = [r8] ;; // reload r6 from [r8] explained subsequently.
add r5 = r6, r7 ;; // re-execute the add Special loop terminating instructions: These are branch
br back // jump back to main code instructions that cause the registers to rotate and the loop
count to decrement.
This technique is effective only if the loads and stores
involved has little chance of overlapping.
VI. CONCLUSION
F. Software Pipelining
Software Pipelining (SWP) is the term for overlapping the VLIW has found enormous success in DSP and signal
execution of consecutive loop iterations. SWP is a processing applications over years, although their use is
performance technique that can be done in just about every limited in general purpose computing (IA-32 is one example).
computer architecture.SWP is closely related to loop The earlier VLIW architectures didnt stand the test of time,
unrolling. partly due to the due to the technology they were built on, and
partly due to better architectures combined with lower cost of
Here is a conceptual block diagram of a software pipeline. silicon real estate.
The loop code is separated into four pipeline stages. Six
iterations of the loop are shown (i = 1 to 6). Notice how the REFERENCES
pipeline stages overlap. The second iteration (i=2) can begin
[1] Joseph A Fisher. (1983). Very Long Instruction Word and the ELI-512 ,
while the first iteration (i=1) begins the second stage. This ACM
overlap of iterations increases the amount of parallel [2] Robert P. Colwell. (2009, spring). VLIW the unlikeliest computer
operations that can be executed in the processor. More parallel architecture. IEEE Solid State Circuit Magazine, IEEE
operations helps increase performance. Many, but not all, [3] B. Ramakrishna Rau. (1989). The Cydra 5 Departmental
Supercomputer Design Philosophies, Decisions, and Trade-offs,
loops may be software pipelined. IEEE
[4] B. Ramakrishna Rau. (1988). Cydra 5 Directed Dataflow
Architecture. IEEE
[5] Jan-Willem van de Waerdt. (2005). The TM3270 Media-Processor.
BITS-Pilani K.K. Birla Goa Campus