Professional Documents
Culture Documents
Why parallel architectures? Single core show diminishing increase in performance . Power wall: increasing
the clock speeds (the basic idea to improve speed, along with more smaller transistor on a smaller die) leads
to higher dissipation, and a power density (W/cm2) beyond inexpensive cooling techniques. Multicore: keep
the clock speeds lower for each core (simpler, slower, less power, more power efficient) but use more cores
on a single chip. Flynn taxonomy: 4 categories depending on data/instruction parallelism. SISD( single
instruction single data) classic uniprocessor design that does not exploit any parallelism. MISD multiple
instructions perfomed in parallel on the same stream of data (not common). SIMD: perform the same
instruction on multiple data in parallel; 1 control unit + multiple data paths (e.g. GPUs). SIMD exploits data
(hw) level parallelism for example in calculations involving arrays. MIMD: multiple autonomous processors
executing different operations on different streams of data (e.g. molticore). MIMD is usually SPMD (single
program multiple data): single program runs on all processors of a mimd, and execution of parts of it by one
processor or the other is cooerdinated through expressions. SIMD architectures: control unit tells each
processing unit what to do and coordinates exchange/share of data between them. Fetch one instruction, do
work on multiple data. SIMD wants adjacents data in memory that can be operated in parallel, this is usually
done with for loops. Loop unrolling reveals data parallelism: instead of doing one iteration of the loop at the
same time we can do 4 (if they are independent) on a 4 processor machine. We just unroll the loop in order
to have 4 times less interation but each iteration in doing 4 times the job. Then 4 single data instructions can
be grouped in a single SIMD instruction. MIMD architectures: 2 or more processor connected with a
communication network. It uses a thread level parallelism: multiple program counters, each processor
execute a different thread (grain size = amount of computation given to each thread). Memory: how can
multiple cores access the same memory and how does it have to be designed? Memory wall: memory
access times decrease a lot slower than logic delays in processors. The way to solve this in uni-multicore
systems is to build a memory hierarchy of progressively slower but more capient memory levels (cpu
registers, cache levels, ram, non volatile, disk.). But new issues arise: coherency, consistency, scalability.
Multiple processors: + (effective use of millions of transistors, easy scaling by adding cores, uses less
powerful processors cheaper and energy efficient) (parallelization cannot increse performance indefinitely,
algortihms and hw limit performance, one task for many processors, there has to be coordination).
AMP (asymmetric multiprocessor): each processor has local memory, task statically allocated to each
processor. SMP (symmetric): processor share memory, tasks dynamically scheduled to each processor.
Heterogeneous: different specialized processors, usually with different ISA, usually AMP. Homogeneous: all
processors have the same ISA, any processor can run any task, usually SMP. MP can be locally
homogeneous and globally heterogeneous (ex. Multicore CPU + GPU). Shared memory: one copy of data
shared by many cores, if one processor asks for X in memory there's only one place to look.
Communications is achieved via shared global variables, but synchronization (or using different variables) is
needed to prevent race condition (two processors accessing the same variable in a non definite order which
can lead to faults). Ex. Data parallelism (for loop) each processor permors the same instruction on different
data, a single process can fork multiple concurrent threads, each one has it's own execution path, local state
and global shared resources. At the end the forked threads are joined and synchronized.Task/control
parallelism: perform different function on different data.
Distributed memory: if one processor asks for X, X can be in any of the processor memories and each
processor has a different X, N different places to look for it. This tipe of architectures rely on explicit message
exchange to exchange data: P1 has to request explicitly P2's X, who in return sends a copy to P1. Coverage:
all programs have a parallel and a sequential part, that cannot be parallelized in any way (e.g. data
dependence). Amdahl's law: the performance improvement in a multicore is limited by the fraction of the
code that can be parallelized. ex. #instructions, p is the parallelizable fraction, n the number of processors,
Tck is the clk period. If clocks per instruction = 1, ex_time=#instructions*Tck ; ex_time_parallel =
p*#instructions/N + (1-p)*#instructions; speedup=ex_time/ex_time_parallel=1/(p/N + 1 p). If N increases the
speedup asymptotically reaches the value 1/1-p, if a small part of the code is parallelizable the performance
is not gonna improve a lot, parallel programm is worth it for programs that have a huge parallel fraction.
Overhead of parallelism: given enough parallel work, this is the biggest barrier to speedup: cost of starting a
thread, cost of message passing, cost of synchronizing, cost of redundant computation. Tradeoff: algorithm
needs sufficient large units of work to run fast in parallel (without overhead) but not so large that there's not
enough parallel stuff to do. Granularity is a qualitative measure of the ratio of computation to communication.
In parallel computing there are typically computation stages separated by communication/synch ones. Fine
grain parallelism: low computation to communication, less computation between communication events ,less
opportunity for performance enhancement, high communication overhead (more frequent). Coarse: high
computation to communication, more computation between communication events, more opportunity for
performance enhancement, low comm overhead, difficult load balancing. Load balancing: processors that
finish early have to wait the processor with the biggest amount of work for synchronization, leading to a lot of
idle time. Static load balancing: programmer gives a fixed amount of work to each processor a priori. Works
ok for homogeneous (all cores the same, same amount of work), bad for heterogeneous (they need uneven
distribution for performance). Dynamic load balancing: when one core finishes its work, it takes work from
core with the heaviest load, making this good for heterogeneous o uneven workload. Comm&Synch: in
parallel programming processors need to communicate partial results on data or synchronize for correct
processing. In shared memory: communication takes place implicitly operating concurrently on shared
variables and synchronization primitives must be explicitly included in the cose. In distributed memory:
communication primitives must be inserted in the code, synchronization is implicitly achieved through
messages. Cores exchange data (longer) or control (shorter) messages. Concurrency between
communication and computation can be achieved through pipelining: while one processor is computing, the
other is communicating with the memory. Memory access: Uniform (UMA) al processors have equal access
times; Non-uniform (NUMA) processors have the same address space, memory is accessible by all but
different parts have different access times for each PU (placement of data affects performance). Distributed
shared memory PGAS (partitioned global address space): each processor has a memory node, globally
visible by all the other processors. Processor access to his own memory node is fast, to the others is slow
(application has to exploit locality but can be coded as regular sharing memory and has the same synch
requirements). Decomposition is fundamental for parallel programs: break computation into tasks to be
divided between processors, number of tasks can vary in time and others may become available in time
(after others have completed); choose enough tasks to keep processors busy. Algorithms start with a good
understanding of the problem and usually lend themselves to easy decomposition (e.g. function calls and
loops). Tasks should be big enough in order to avoid comm/synch overhead and need to have very few (if
any) dependencies to eliminate bottlenecks. Data decomposition is usually part of task decomposition, it's
useful when computation revolves around a huge data structure and similar operations are applied to
different parts of data (ex. parts of matrixes/arrays or splitting trees). Then we have to decide which tasks
can run in parallel and which have a defined order and formalize this in a task dependency graph (there can
be different ways, and tasks can have different dimensions).
Open MP: it's the standard shared memory programming model. It's a collection of compiler directives,
library routines and environment variables that can be easily included in a serial program flow. Fork-Join
parallelism: initially just one htread is executing sequential code, fork: the master thread creates/awakens a
team of additional threads to execute parallel code, join: at the end of parallel code threads are suspended
upon synchronization.
#pragma omp parallel num_threads(4) { } code within the scope is replicated among threads, 4 in this case
or default #threads. Code within the scopeof the pragma is outlined to a new function by the compiler. The
pragma is replaced with calls to the runtime library omp_parallel_start(&function_1,..), the new function is
called function_1(), then they are synchronized with a barrier omp_parallel_end(); . omp_get_thread_num()
is a runtime call that returns the different id for each running thread, omp_get_num_threads() return the
number of active threads. #pragma omp parallel private(a) shared(b){ } shared means that every thread sees
the same address in the shared memory for a, private means that every thread has a separate copy of
variable b. a variable declared inside a parallel pragma is automatically private to each thread (one issue for
each thread). Firstprivate is needed to generate private variables to each thread that are initialized to the
variable value before the parallel cose (if just private is used they are randomly initialized). The lastprivate
clause defines a variable private as in firstprivate or private, but causes the value from the last task (value
assigned by the thread handling the last iteration in a for, value assigned by the thread handling the last
section) to be copied back to the original value after the end of the loop/sections construct. #pragma omp
parallel for { } (short for #pragma omp parallel { #pragma omp for for(..){ } }) splits the for loop among the
issued threads. Each thread executes a consecutive part of the iterations from a lower bound to an upper
bound ex. thread 1 executes iterations 1 to 5, t2 executes from 6 to 10 and so on. The for loop in splitted in
different functions which run a shorter for with different boundaries. #pragma omp for schedule(static)
schedules #iterations/#threads iterations to each thread ex. 12 iteration, 4 threads = 3 iterations/thread. This
is suitable if the workload is balanced between iterations and has a very small overhead because bounds
can be computed just knowing the thread ids. If interations have different duration this leads to huge
inefficiency, so schedule(dynamic) can be used. A task is generated for each iterations, and work is fetched
by the threads from the runtime environment through synchronized access to the work queue. After a thread
completes a task it fetches another one from the work queue until the queue is empty, leading to a balance of
workload and a reduction in total parallel time. Fine grain parallelism: more opportunity for load balancing,
small amounts of computation between parallel fetching, huge parallelization overhead (more switching
between tasks and scheduling). Coarse grain: harder to load balance, large amounts of computation
between scheduling, low parallelization overhead. schedule(dynamic,1) each iteration is a task, runtime
scheduling primitive is invoked at each iteration, schedule(dynamic,2) every 2 iterations is a task, runtime
scheduling primitive is invoked every two iterations. schedule (guided[,chunk]) thread dynamically grab
blocks of iterations,blocks start small and end big. #pragma omp for collapse (2) is used for 2 nested loops,
whose iterations are then split among threads.
#pragma omp barrier is the basic synchronization directive, all threads in a parallel region wait there that the
other threads finish computing before going on, it prevents later stages of the program from working with
inconsistent data. It is implied at the end of parallel, for and sections (unless a nowait clause is expressed).
#pragma omp critical{ } a critical section is a part of the program that can be executed by a single thread at a
time, and it's identified by the precedent pragma. For example if many threads are updating a shared value
ex. sum+=x[i] (inside a for loop and for pragma) each thread will fetch a value to update, update it then return
the new value. At the same time another thread could have updated the sum value and the first thread return
would be wrong. This is a race condition, in which the result of sum depend in an unpredictable way on the
actual order of execution and update.The code in the critical section is executed by all threads but one
thread at a time can execute it, while others have to wait. It's a form of serialization, which inevitably ruins the
performance but is required for correctness of code. To reduce the impact on performance we should keep
the critical section as short as possible. #pragma omp atomic { } is the same as critical, in that the action in
atomic cannot be interrupted but it has to be a simple instruction that is mapped in a single processor opcode
instruction at assembler level (ex and + xor) (a lot less overhead than critical). A programming pattern in
which a variable is fetched, updated and then stored back (ex. sum+=x[i] ) is called a reduction. It can be
handled with critical but there is also a reduction clause reduction(+:sum). Each thread computes partial
sums on a private version of the reduction variable, shared variable is updated with the partial sums at the
end of the loop (so each thread runs in parallel without critical blocks until the end).
#pragma omp master is executed ust by the master thread, the others simply ignore it, with no synch implied.
(barrier has to be added if necessary). #pragma omp single is a block executed just by one thread (not
necessarily the master) with a barrier implied at the end. The for pragma exploits data parallelism (do the
same stuff over different data), there are also constructs that exploit task parallelism (do different stuff on
different things). #pragma omp parallel sections implies a block of code that can be executed in parallel, with
an implied barrier at the end. #pragma omp parallel sections {pragma omp alpha(); pragma omp beta();}
alpha and beta are code blocks that can be run in parallel (visible by the dependency graph), as made
explicit by the pragma. Sections allows a limited form of task parallelism, in which the parallelism is statically
outlined in the code. Tasks are units of work whose execution may be deferred or be executed immediately.
They can be nested and used in recursive structures such as trees.#pragma omp task outlines these units,
each thread that enounters the directive creates a task. The main programming model is that a thread of the
team (generated in the parallel section) creates task, whereas other threads in the team execute them in any
order, even suspending (to wait for child tasks results) and resuming them. So first we use a parallel pragma
and then immediately a single pragma to create tasks (otherwise all threads would create tasks) who are
serviced by all threads. Another way to do this is inside a regular loop with a for pragma, in which task is
inside the loop body. In this way each thread creates tasks concerning different iterations of the loop, then
executed by all the threads. #pragma omp taskwait forces all tasks that encounter it to suspend until their
child tasks are not completed. A barrier guarantees that all tasks created by the current thread team are
completed before moving on.
Pdyn=a*C*V2*fck a: activity factor, how often gates switch on average; C: total capacitance; aC is a kind of
effective capacitance (the average load capacitance); V is the supply voltage; f ck is the clock frequency.
Ileak=k1*(1-exp(-K2Vds/T))(exp(k3*(Vgs-Vth-Voff)/T), increases with supply voltage and temperature and
decreases with Vth. Delay= Cvdd/Ion = Cvdd/u(T)(Vdd-Vth(T)) u is the mobility, u(T)=u(T0)(T/T0)^m,
Vth(T)=Vth(T0)-k(T-T0). For wires resistivity depends linearly on T, with interconnections delay increasing as
T increases. For low Vth (Vdd>>Vth) main temperature effect is the mobility one, delay increasing with
temperature. For high Vth main temperature effect is the threshold one, delay decreasing with temperature.
Reduce dynamic power: DVFS (dynamic voltage and frequency scaling), reduce effective capacitance
(exploit idle or underutilized networks). Static power: DVS (dynamic voltage scaling, by reducing Vdd or
increasing Vth using the body effect with a polarization), DTM (dynamic thermal management, reduce
temperature using cooling), reducing leakage with HW topology. Power management strategies. DVFS,
RFTS (run fast then stop): Clock gating, power gating, Turbo modes. DVFS (with governor or deadline):
exploits slack (time left until the deadline for the computation) to reduce dynamically voltage and frequency
to spread the computation evenly across the time quantum. Potentially cubic power saving, because Vdd
and fck both scale down.Clock gating: the clock signal to each FF can be gated, in orded to be blocked when
no input transition is detected. It saves dynamic power (but the core continues to leak) by preventing
unwanted FF internal switching; the transition with this mode is istantaneous. Power gating disconnects the
logic circuit from the Vdd lines using high Vth transistors (lower leakage current than regular gates). This sets
to zero the consuption of the block, but all registers' content is lost. This requires costly (in terms of time and
energy) state savings and restoring before and after the power gating. Uncore power: The uncore power
accounts for all the components on the chip that are not the core itself: PLLs, peripherals, memory
controllers. The dynamic consumption by those components can be avoided with clock gating when the
system is idle. Their leakage power can be suppressed using power gating. The uncore power may or may
not scale with voltage in DVFS (does not scale with freqeuncy because it's just the core frequency that's
changed), depending if the uncore runs on the same supply or not. P= Pdyn+Pstat+Puncore=
CV^2f+VIleak+Puncore. E= Pdyn*CPU_time+(Puncore+Pstat)*(CPU_time+Idle_time).
Clock gating: uncore becomes inactive during idle E= Pdyn*CPU_time+Pstat*(CPU_time+Idle_time)
+Puncore*CPU_time. Power gating: no power used during inactive but needs save and restore E=
(Pdyn+Pstat+Puncore)*CPU_time+Esave+Erestore. In DVFS everything has to be computed using the new
supply value and frequency, taking into account that the execution time gets longer.
CPU_time and idle time can depend on number of processors and/or parallelizability of code. All processors
consume static power all the time, they all consume dynamic power when they are concurrently running on
the parallel part of the code, one consumes dynamic power during the serial part of the code. Dynamic
power scales with V^2 and f, static power scales (approximately) with V (if Ileak dows not depend on supply).
Break even time is the time that the core needs to be powered off to compensate for save and restore
energy, this time being shorter is leakage is higher. The technique giving the highest power saving is a
runtime decision. DVFS tries to save power by reducing the core clock frequency and voltage, but this does
not reduce the memory/bus clock. CPU bound applications are those with a high number of ALU instructions,
high data locality and cache hit rate. Memory bound applications are those with a high number of memory
access instructions (large data set, complex data patterns), low data locality and low cache hit rate. DVFS
severly impacts CPU bound application, because increases time of execution of most instructions and
despite reducing power leads to an unfavorable energy balance (longer time but less power). For memory
bound applications this does not happen because they run at the speed of the memory, which is not affected
by the scaling, but manage to save power. They run for a marginally longer time but they save power
reducing therefore energy. A memory bound application has a CPI much larger than 1 (because of wait
states and dependencies), whereas most alu operations have a cpi of 1. In general
CPU_time=CPI*instructions_number/f_clock . CPU_time_DVFS=(CPI-1)*instruction_number/f_clock_max +
instruction_number/f_clock_reduced, summing the time needed for memory access to the one needed for
alu operations. For a memory bound CPI>>1, so CPU_time_DVFS is almost (CPI-1) *instruction_number
/f_clock_max so it doesnt change from the regulare time without DVFS.