You are on page 1of 10

Hdl is not a program. not running on anything, just describes hw or behaviour.

Synth code: corresponds to


real hw, can be synth a simulated. Non synth code: meant for simulation, no correspondence to hw, used to
test synth code. Value set: 0,1,z(high-z),x(unknown). Begin end: used instead of { } of the C language.
Wire: connect different elements, can be treated as physical wires, just combinational logic,can be read or
assigned, no values stored, MUST be driven by an assign or an output port of a module (no to the left of = or
<=). ex. wire a; Reg: no correspondence with phyisical registers, represent data storage elements, they
retain value till a new one is assigned to them (no on the left of assign, no as module inputs, no driven by
module output), they can be synthetized to FF, latch, or combinational. ex. reg a; Both: can drive input of
modules, can be on the right of assign, =,<=. If multiple drivers, output is x. Logic: unifies reg and wire. can
be on the left of =,<= or assign,output of a module. Last assignment wins in determining the output. ex. logic
a; Parameter: it's like a constant in C. ex. parameter lsb = 7 ; Can be used to define states in FSM ex.
parameter[1:0] s0=2'b00, s1=2'b01, s2=2'b10, s3=2'b11; Enum: ca be used to define states in a FSM or
other stuff ex.enum {s0,s1,s2,s3} cs,ns; Vectors: represent buses. ex. logic [7:0] a; wire[1:4] b; reg[1:0] c;
a[7:4] // takes a slice Arrays: ex. integer count[1:5]; // 5 integers reg var[-15:16]; // 32 1-bit registers reg[7:0]
mem[0:1023] // 1024 8-bit registers. Access entire element mem[10]=8'b10110110. Accessing subfield
needs temp storage reg[7:0] temp; temp=mem[10]; temp[3]=1; Operators: && || (logical and/or), & | ^ (and or
xor reduction of a multibit or bitwise),~ (bitwise negation), !(logical negation), arithmetic as usual, relational
as usual, == != (equality inequality), {A,B} (concatenation), A<<3 >> (shift). Number formats 8'b0 (8 bit 0),
6'b101001 (6 bit 101001), 4'b10_01 (_ can be used as a separator). Conditional operator c=(a)?a+b : a-b; .
Repetition 3{c} (concatenates 3 times c). Case ex. case(m) begin 2'b00: // 2'b01: // 2b'10: // 2'b:11: //
default: // endcase. Assignment: assign continuously drives left side with the right ex. assign w=a&b; = is a
blocking assignment: it evaluates the RHS and completes di assignment before moving on (blocks other
assignments), the order of = assignment matters. A sequence of = is executed one after the other, even
when the same variable appears on both sides ex. for(i=0; i<=3; i=i+1) N=N|A[i]; is a sequence of 4
cascaded OR gates in which one of the input is the previous iteration output (gives N 4 different values one
after the other). <= is a non-blocking assignment: it evaluates the RHS at the beginning of the timestep but
updates the LHS at the end of the timestep. Between evaluation and update, other statements may be
evaluated or updated or scheduled to be updated at the end of the timestep. Generally, use = in
always_comb combinational blocks and <= in sequential blocks, use <= if you mix comb and sequential in
the same block. Always: a procedural block. Within the procedural block,we can write statements thatare
evaluated sequentially. This has nothing to do with the order in which the logic operates. Always blocks run
continuously. Actions inside them take zero time to execute. The body is a group of statements that's
executed when a signal in the sensitivity list changes always @ (a or b). If all the RHS signals are present in
the list and all the cases are covered (in a if or case) it's a combinational block, otherwise it's sequential.
(always @ (*) inserts all the relevant signal in the sensitvity list). If it's edge sensitive always @ (posedge clk)
it's a FF, otherwise it's a latch. ex. reg A; if(B) A=B|C; it's a latch because if B=0 A keeps its value if(B) A=B|C
else A=C|D is a combinational block synthetizeable with a multiplexer.
It's better to use always_comb, always_ff @(posedge_clk), always_latch (the sensitivity list is inferred).
Modules: describe HW complex behaviours ex. module module_name(input in1, input [7:0] in2 ..output o1,
output reg[2:0] o2 ,..output logic o3); //module body endmodule. To issue an instance of a module
module_name instance_name(.in1(w1),.in2(w2)..,.o2(wo1),..); specifing .port(port_connection) or using the
order used in the declaration ex. module_name instance_name(w1,w2..o2,..);.
RTL design: High level description of design Partition in Data Path and Control Design both parts
Interconnect. Data path: given data inputs computes data outputs in a programmable way set by control
signals. Produces flag/status signals about the computation. Control: controls the flow of data through the
DP, it's made of one or more state machines to keep the state of the circuit. Decides when and what control
signals to activate,depending on the input, state and flag signals.
Finite states machine (FSM): First combinational block computes output given input and current state.
Second combinational block computes next state given input and current state. A register bank makes the
next state become the current state each active edge of the clock. The state variables might be explicit (state
signals delcared) or implicit (counters). Always specify next state and output for every possible state/input
(use default in case statements).
Coding styles: 1) two/three always comb blocks: one/two for next state/output and one always_ff for
curr_state<=next_state. Needs both current state and next state. 2) One always_ff block. Needs just one
state signal but more confusing/complex in some situations 3) Datapath+control: Partition the system in data
path that elaborates data and control that generates control signals for datapath.
Mealy: output depends on input and state. Generally asynchronous because the inputs are not synchronized,
synchronous if the outputs are registered to change just @clock edge. If asynchronous outputs must not be
reg type. If syncronous you need always_ff currentoutput<= nextoutput, in which nextoutput is computed in
combinational block.
Moore: output just depends on state (necessarily synchronous, output changes when the state changes so
@clock edge).
HDL languages: 1) captures design in RTL form, which is easier/faster/less error prone than schematic
capture 2)can use simulation and verification tools 3)Automatic synthesis tools are used to obtain a
reasonably optimal gate-level design that meets requirements.
Synthetizer: basically a Boolean Combinational Logic optimizer that is aware of timing constraints. HDLs can
be implemented in HW as semi-custom or programmable solutions.
ASIC: application specific integrated circuit, a chip designed to perform a particular operation instead of
being general purpose (as. CPUs are). It's generally not SW programmable to perform different tasks but
may include a processor for suitable operations. They can be implemented as FPGAs. ex. video
decoders/encoders, graphic chips, digital filters, graphic chips, network processors in routers. Diffent styles.
Full custom: every transistor is drawn an designed by hand. Only way to design analog blocks and specific
high-performance digital blocks (ALUs, memory cells). High-performance, long design time, requires full set
of masks for fabrication.
Standard-cell-based or semi-custom: standard cells are custom designed and then inserted in a library
together with their timing, area, power specs. These cells are then used as part of the design by being
placed and connected in a suitable way, with the use of a place and route CAD. Some cells can be put
togheter to create a single macrocell. Standard cell ASICs are usually synthetized from a RTL description of
the design, using the library components. IP (intellectual property blocks) are often used to reduce time to
market. Hard Ip: usually given by the silicon foundry, low level representation, technology
dependent,predictable performance and area, delivered in transistor layout files, no meaningful modifications
can be carried out. Soft Ips: tech indepentend, delivered in RTL or logic gate netlists together with
synth/verification scripts.
Logic Synthesis: turns RTL into a gate-level netlist (in verilog), a list of logical gates and their implied
connections. The generic netlist in mapped into standard cell library functions. The logic is then optimized to
satisfy power/area/time constraints.
Floorplanning: tentative placement of the major functional blocks (including macros) of the IC on the die
area, satisfiying blocks' geometrical constraints. Also power planning: power estimation is needed to decide
where/how many power/ground wires are needed. Clock tree synthesis: clock is distributed to cells and
buffers are added to act as repeaters in long wires. Routing: logic cells are wired together (first loose routes
then more detailed). Chip signoff: using timing/power/logic libraries the IC is checked to comply to all specs
(power, timing, functional, geometric, routing) and eventually reoptimized. If tests are OK, a transistor layour
file is generated to be used for mask generation.
Timing libraries: in-out time, setup-hold time, min pulse widths for every cell in library and also electrical DRC
and power infos. Physical libraries: layout abstracts, dimensions, number of pins, DRC, capacitance models.
In synthesis constraints drive the choice of architecture. Possible optimizations: resource sharing in HW and
expression elimination in RTL.
Static timing analysis STA: a method of computing the expected timing of an IC without simulation. No
functionality verification, dependant on process tech. Path delays: gate delays (increase with L of MOS) and
interconnect delays (increase with line width). Timing constraints: target clock frequencies, modeling of clock
skew, jitter, insertion delays, transition times, input/output driver/load, external signals and delays.
Power optimization/analysis: tools can use the deafult witching activity of 50% or activity obtained through
meaningful simulations. Optimization techiniques: clock gating (clock signal is kept constant for blocks that
don't need to switch through an enable), gate level optimization (mapping multiple gates in a single cell that
saves power), operand isolation / data gating (data inputs to combinational blocks are kept constant when
not needed to prevent unneeded switching activity), multi Vth transistors (available in the selected process).
Embedded computing system: any device that includes a programmable computer but is not itself a general purpose
computer. A Microcontroller is a small CPU with many support devices built into the chip: self contained
(CPU, Memory, I/O),application or task specific (not a general-purpose computer),appropriately scaled for
the job,small power consumption,low costs. CPU+Memory+Peripherals(clock generation, adc-dac, i/o ports,
communication interfaces UART-SPI-I2C, timers,DMA..)+interconnection bus. MPU: architecture( #of bits
that can be processed at the same time also width of the register file ex. 32bit) and instruction set (ISA,
usually RISC few fast simple register-register instructions). CPU: datapath (register file, alu) [inputs: control
signals and operands, outputs: flags]+ control (state machine) [in: flags and operands out:control signals].
General purpose registers + special registers: program counter (keeps the address of the next instruction to
fetch), stack pointer (keeps the address of the top of the stack for returning from nested procedures in
memory), status register, instruction register (keeps the instruction to be decoded), memory address
register/memory data register. CPU architectures: von neumann (memory contains both data and code and
can be accessed by a single port, 1 address bus + 1 data/code bus), harvard (separate memory ports and
buses for code and data that can be accessed separately, 1 split address bus + 1 data bus + 1 instructions
bus), variations. VN: easier to implement H: faster access because 2 access in 1 cycle, can be pipelined,
more flexible and no data/code mix-up. Pipelining: and instruction takes more than 1 cycle to be executed:
fetch+decode+execute. But during the same cycle one instruction is fetched, one decoded, and the other
executed. Throughput is 1 instruction/cycle (ideal), latency is 3 or more cycles.
Low power: longer battery life, smaller, less EMI, simpler power spupplies, smaller product, cooling,
reliability. Power in CMOS P=ACV2f+AVIshortf+VIleak. V: supply voltage f:frequency. Active power: A: activity
factor, how often the wires switch on average, C total capacitance. Short-circuit power: finite slope of isnglas
cause a low-impedance pathc between gnd and vdd during transients, =slope time, Ishort=short circuit
current. Static power: subtreshold leakage of FETs, increas exponentially with T and decrease with
increasing Vt. How to reduce power: reduce rate of charge of node with high capacitive loads, prevent
glitches, reduce switching when not needed, decrease frequency, decrease voltage, reduce area, reduce
leakage with power gating with high Vt mos or body biasing (increase vt with body effect). Application profile:
MCUs are typically idle most of the time, permorming tasks just when receiving an interrupt; therefore they
need power saving modes to reduce consuption. Several layers of power saving modes: reduce the
maximum clock frequency and voltage, but increases the efficiency Mhz/uA; then stop the CPU clock but
memory is retained, then just a small part of the memory is retained, then nothing; progressively longer
wake-up times. Some peripherals can be not available at low power modes. Voltage regulators provide the
CPU with smaller voltages to save power. Transition between modes can be triggered by peripheral interrupt
or real time clock events, in order to wake up the CPU. Low consumption in active mode has to be balanced
with very low consumpion in idle mode, the best trade-off depends on the real duty cycle (active-idle).
Clock subsystem: multiple oscillators, both internal and external (crystal/capacitor has to be added
externally) , PLLs and prescalers to vary frequency, different frequencies and power consumptions. They
power everything, from buses (synchronous) to peripherals (ex. ADC analog/digital clock) . Clock to each
peripheral can be gated to reduce consumption (by default at startup), but beware of non synchronous
consumption (ex. GPIO pin surging currents).
Buses. All of them are synchronous. AHB: separate read/write bus, control bus, ready signal, high performance,
pipelined (address + control phase, then data phase follow one cycle after the other) , split transactions (the two phases
for the same transaction can be in non adjacent clock periods, handled by the arbiter), burst mode (single address phase
followed by multiple data phases), multiple masters. APB: low power, simple, slow (no pipeline or fancy modes). Multiple
AHB-APB buses can be present bridged together in the same chip.
Memory. RAM (usually sram): on chip volatile, small, fast-access, used for run-time execution. Flash: bigger, slower
access, on chip non volatile, limited (10k) # of write cycles, used for permanent code and data storage. External memory:
connected with communication interface. Address space: everything mapped into single continuous address space (4GB
for 32bit). Flash, RAM, peripherals register are all seen as memory addresses. A read/write from/to a peripheral is seen
as a load/store from a memory address by the CPU.
GPIO: each MCU pin can be used as GPIO. Input: read digital value of pin (can trigger interrupt). Output: set digital value
of specified pin. GPIO pins are independent but are grouped un in ports. No floating inputs (default): use
external/programmable internal pull up/pull down networks.
Interrupts: A hardware interrupt is an electronic signal sent to the processor from an external device, either a
part of the device, such as an internal peripheral, or an external peripheral. It is a way to respond to an
external event (flag being set) without polling it continuously. HW senses the signal transition and
automatically transfers control to SW: the PC is updated with the start address (all the addresses of the ISR
are in a vector in memory) of a procedure called ISR (interrupt service routine) that takes the appropriate
action to service the interrupt (first of all, clearing the flag). When finished, the ISR hands the control back to
where it left (pops back the old PC from the stack). Transparent to user, easier coding, no waste of time
polling. The main usually runs an endless loop in background (low priority) executing a low power command,
sleeping the CPU until an interrupt wakes it up. The ISR is executed, usually changing the background mode
to non-sleep, and then handles back the cpu to the main. NVIC: HW unit that coordinates interrupt from
multiple sources (part of the ARM core). Defines priority levels of each interrupt source and enable flags for
each interrupt. Piority levels are programmable, have subpriority level and can be changed dynamically,
external non maskable interrupt (that cannot be ignored by the CPU but must be serviced), nested interrupts
(if an interrupt with higher priority occurs during another interrupt ISR, it is served first, interrupting the other
ISR), tail-chaining (the processor avoids to pop the registers of main from the stack after servicing one ISR
and immediately services another, which otherwise would require to push back the registers), stack pop
preemption (stops popping if another interrupt request arrives during it). EXTI: peripheral that maps events
on GPIO pins or other external sources (es RTC wake up) into interrupt requests for the NVIC.
Timers: might be driven by internal or external clock (eventually divided in HW), multiple capture/compare
blocks with interrupt capabilities, uses: generate periodic events, periodic wake up from sleep, count external
events, PWM generation, power saving by implementing delays while the cpu can sleep. A timer is exentially
a counter, counting cycles with a known clock rate. Each clock cycle the counter is incremented by 1 (or
decremented), interrupts occur when it reaches the value present in the auto-reload register and then
overflows back to 0. Capture: when capture input signal occurs, counter value is placed in the CCR
(capture/compare register) and an interrupt is generated (or other action are done, such as changing timer
output). If a second capture occurs before the ccr is read an interrupt is generated. Compare: when the value
of the counter is equal to the value of the ccr, actions will occur (interrupt, change of timer output). The
counter will keep counting up to the autoreload value before overflowing. This mode is useful to generate
periodic waveforms. Pulse times and periods can be varied by varying the clock (using timer prescalers and
clock tree prescalers, programmable on-the-fly). PWM: method to use a digital 2-value signal to control an
analog waveform. The pwm signal has constant period, whereas the duty cycle (on_time/period) is varied in
function of the analog value it represents. The average value of the pwm signal depends on its duty cycle:
small on time, small average, large on time, large average.Varying the dut cycle varies the average, which
can be extracted from the pwm signal by using a low pass filter (some physical systems don't need this
because they are inherently low-pass, any device whose response time is slower than the frequency of the
pwm signal can be controlled usign pwm). In pwm mode the timer controls one or more output channels.
When counter reaches 0, maximum or ccr value, the output value of the channel can be changed. Various
options define which events change the output value and how (center aligned timer counting from 0 to ccr
and from ccr to 0, toggle output when counter reaches 0 or ccr; edge aligned counter counting up/down,
toggle when over/underflow).
ADC: first block is sample and hold: a clock controlled switch can connect or not the input to a capacitor.
When the switch is closed the voltage on the capacitor follows the one at the input, when the switch opens
the last value is sampled and is kept constant by the capacitor. The value is then measured with an ADC that
converts che constant analog signal into a digital code. SAR: the value at the input of the adc is compared
with the content of the SAR register, initialized at the digital code at half of the maximum value ex. 1000,
converted by a DAC. If the input>sar then the MSB of the sar has to be kept 1, otherwise it's set to 0. The
next clock cycle the sar second bit is updated to 1 and the comparison takes place again, deciding the value
of the second bit. It goes on like this until all the bits in the sar are set, so the time for converting the data is N
times the clock period (where N bits is the resolution of the adc). The total conversion time (the inverse of the
sampling frequency) is given by the time the switch is closed (sample/hold time, still has to be an integer
number of ADC_internal_clock periods) plus the conversion time (#bit* ADC_internal_clock period).
Features: Resolutions in bits can be programmed via SW and clk frequency as well (ADC_clock is derived
from another clock using prescalers). If multiple ADC are available they can work in interleaving mode
(sampling at regular intervals one after the other) to increase the maximum sampling frequency. Reducing
the power supply to the adc will reduce its maximum speed and range of convertible values. ADCs
conversions can be triggered by timers, interrupts or other events. Each ADC has several multiplexed
channels which can be converted independently and can be either regular or injected (whose conversion is
triggered by an event and blocks other regural conversions); channel can be converted just once or in
continuous scan mode. Regular channel results are stored in a single data register (injected have their own)
which has to be copied to memory by the DMA before the next channel result arrives. The end of a
conversion can generate an interrupt and an automatic DMA request. An analog watchdog/comparator might
be available: it is a rail to rail comparator with programmable high/low thresholds that can generate
interrupts through exti to trigger ADC conversions or other events when a threshold is passed.
Serial interfaces: short distance, low complexity, low cost, low-speed. MCUs are often pin limited so multiple
functions have to be multiplexed to the same pin ex. pins can be GPIO in, outs, interrupts or wires for serial
communications (alternate function), can have internal pull-ups pull-downs. I2C: supports multiple speed
modes, <3m interconnections, supports multi-master (communication always started by a master), half
duplex synchronous communication. Two lines: SDA (data) and SCL (clock). Pull-up resistor keep the lines
high while in idle, who are pulled down by open drain drivers in the devices on the bus (if any driver pulls
down the line goes down, any device can be master/slave). In idle SDA=1, SCL=1. To start a communication,
the master SDA 1->0 (while SCL=1) , then starts generating clock. Except for start stop bits, SDA transition
only when SCL=0. Then the master transmits the 7 (or 10) bit slave address on SDA, broadcasted to all
devices on the bus. Then the master transmits a direction bit master->slave:0 slave->master:1. The slave
acknowledges drving SDA low (if not ack the transaction must be repeated). The master then sends the data
payload packet of 8 bits, aknowledged by the slave. Multiple packets can be sent but must be acknowledged
by the slave each time. At the end of the transfer the master sets sda to 0 (stop bit), releases the scl (goes to
1) and then releases the sda (goes to 1). Reads work similarly with slave sending packets of 8 bit and master
acknowledging. A typical transaction could be: master addresses slave and writes an internal slave register
address (sends a command). The master then starts a read transaction addressing the same slave and
receiving data from it. If a read follows a write, the first transaction does not use a stop bit, whose role is
taken by the start bit the second transaction. Then when the read is finished, the masters does not
acknowledge (lets the line go up to 1) and then sends the stop bit. Clock stretching: slave can ask for more
time to process a bit by driving SCL low and the releasing it when it's done.
SPI: Just one master, multiple slaves enables by chip selects, no address scheme. Based on two data lines:
MISO (master in slave out) and MOSI (master out slave in) and two control lines SCK (clock) and CSN (chip
select active low). Full duplex: data is streamed back and forth between master and slave FIFO/shift
registers (FIFO dimension can be programmed and depends on the peripheral). The master pushes the
content of its FIFO into the slave and the slave does the same, sending one bit per clock until the two FIFOs
exchanged the content. 4 operating modes depending on polarity (starting phase of the sck) and phase (the
edge at which MOSI in switched and MISO is sampled. The other edged the MISO is switched and MOSI
sampled. 0: negedge 1:posedge). No ack, no address, no stretching but protocols can be mapped on top of
this. SPI easier/faster for point to point, but if there are many slaves needs a lot of chip selects and lines.
UART: used to interface MCU with other computing devices in simplex, half-duplex, full-duplex mode.
Communication speed defined as baud rate or symbol rate (symbols/second). Interface is a couple of serial
to parallel (RX), parallel to serial (TX) converters. Asynchronous: each peripheral has its own clock which is
not shared, running faster than the bit rate. The receiver clock phase is locked to the edge of transmitted
data (as it is sampling faster than the bitrate e sees an edge as a 0-1 transition). In idle the trasmission line is
driven to 1. One start bit (line to 0), 5-9 data bits (usually 8, depending on application) the last of which can
be used for parity check (both even/odd), one-two stop bits (line back to 1). UART can include an handshake
protocol with 2 additional signals: RTS request to send (the mcu can accept new data), CTS clear to send
(the slave can send new data). The roles are reversed from the slave, transfers occur when CST=RTS=1.
DMA: provides high-speed data transfer between peripherals ans memory or memory and memory without
employing the CPU. Instead of using loads and stores, you just give the peripheral ( or memory) start
address, whether to increment that pointer or not ,how many transfers you need, the pointer to destination
and a start command. Multiple streams are avaiable and each stream has several channels, each one
dedicated to the transfer from a single peripheral. Configuration and priority levels are set in SW, with
independent source/destination transfer size (word, byte..). Event flags generate interrupts (half transfer,
complete, failure). M2M mode: from one memory buffer to another (no circular or direct mode). M2P or P2M
mode: when a treshold in the fifo is reached its content is drained and stored in the destination. After an
event the peripheral sends a request signal to the DMA. Multiple channels are multiplexed (time division) in
one stream, streams hase levels of SW priority (and in case of parity HW priority). Direct mode: the threshold
level in the fifo is not used, after each transfer from the peripheral to the fifo the fifo is immediately drained.
Peripheral and memory pointer can be optionally automatically post-incremented (+4 +2 +1 depending on
the transfer size word, hw, byte..). Noncircular mode: no DMA request is served after the last transfer (#
bytes to send = 0). Circular mode (for circular buffers and continuous data flow ex. ADC scan) the # of data
to be transferred is automatically reloaded with the inital value programmed during channel configuration,
and DMA requests continue to be served.
WSN: sensor units: low power, small, easily integrated with circuits (sensor mems+adc+communication
interface), offset-drift, adc conversion is a power premium (analog sensors are cheaper but need to be
sampled before being sent with transmission protocols), data bandwith >> information bandwidth (because of
communication protocol overhead).WSN protocols impact on performance: cost, speed, power. MCU
communicates with the radio subsystem (transceiver+antenna) using a serial port (SPI, UART..). Radio
subsystem can be a stand-alone chip with its own MCU and small memory or can be integrated with the
main MCU and HW blocks (e.g. mac access) in a wireless microcontroller, an architecture that can be
optimized for low cost and low power. WSN protocols define the 3 lower layers of ISO/OSI model: physical,
data link (mac) and network. Physical layer: provides mechanical, electrical, functional, and procedural
characteristics to establish, maintain, and release physical connections (e.g. data circuits) between data link
entities. It concerns data rates, modulations (frequency hopping), bands and number of channels. Data link
layer (media access control mac): maps network packets into radio frames, concerns transmission/reception
of frames through the air, error control and security. Controls access to the shared air medium reducing
interference and trying to prevent collisions, such as hidden node (A-B-C B is in range of both A and C, but
they cannot see each other. If they try to both transmit to B, because they don't know what the other is doing
there's gonna be interference) or exposed node (A-B-C-D, A in in range of B and C in range of D, and C in
range of B and viceversa. If B transmits to A, C will sense someone who is in range in transmitting and avoid
transmitting to D for fear of interference, whereas D cannot receive interference from B). Network
layer:Provides functional and procedural means to exchange network service data units between two
transport entities over a network connection. It provides transport entities with independence from
switching/routing considerations. It's about the organization of the network (how it's formed, how to
join/leave), routing packets through the network (shortest path, best energy efficiency), tracing of the status
of the link (routing tables). Network structure is not always predictable and must divide the load between
nodes, adapt to changes and reduce the amount of data exchanged.Zigbee: low cost,low power, low data
rate mesh (but also star and tree) network. Every network must have one coordinator node (in star must be
the center) that has to create and mantain it. Bluetooth: frequency hopping on multiple in band channels. A
master can communicate with up to 7 slaves in a piconet (roles can then reverse, one of the slaves
becoming the master). Two or more piconets form a scatternet in which some nodes are both master and
slaves. At any time data can be exchanged between the master and a single slave. The master chooses the
address of the slave and switches rapidly in a round robin fashion; slaver are theorietically supposed to listen
in every slot. Power in WSN: radio is the most power hungry part in a wsn node. Idle mode is required for the
node to respond faster (radio listening, but not TX/RX) but consumes a lot of power. A wake up radio can be
added to the design to reduce the power expense: the node is kept in sleep unless the wake up radio
receives a wake up packet, it's convenient is the wake up radio consumes less power in listening than the
main radio does in idle.

Why parallel architectures? Single core show diminishing increase in performance . Power wall: increasing
the clock speeds (the basic idea to improve speed, along with more smaller transistor on a smaller die) leads
to higher dissipation, and a power density (W/cm2) beyond inexpensive cooling techniques. Multicore: keep
the clock speeds lower for each core (simpler, slower, less power, more power efficient) but use more cores
on a single chip. Flynn taxonomy: 4 categories depending on data/instruction parallelism. SISD( single
instruction single data) classic uniprocessor design that does not exploit any parallelism. MISD multiple
instructions perfomed in parallel on the same stream of data (not common). SIMD: perform the same
instruction on multiple data in parallel; 1 control unit + multiple data paths (e.g. GPUs). SIMD exploits data
(hw) level parallelism for example in calculations involving arrays. MIMD: multiple autonomous processors
executing different operations on different streams of data (e.g. molticore). MIMD is usually SPMD (single
program multiple data): single program runs on all processors of a mimd, and execution of parts of it by one
processor or the other is cooerdinated through expressions. SIMD architectures: control unit tells each
processing unit what to do and coordinates exchange/share of data between them. Fetch one instruction, do
work on multiple data. SIMD wants adjacents data in memory that can be operated in parallel, this is usually
done with for loops. Loop unrolling reveals data parallelism: instead of doing one iteration of the loop at the
same time we can do 4 (if they are independent) on a 4 processor machine. We just unroll the loop in order
to have 4 times less interation but each iteration in doing 4 times the job. Then 4 single data instructions can
be grouped in a single SIMD instruction. MIMD architectures: 2 or more processor connected with a
communication network. It uses a thread level parallelism: multiple program counters, each processor
execute a different thread (grain size = amount of computation given to each thread). Memory: how can
multiple cores access the same memory and how does it have to be designed? Memory wall: memory
access times decrease a lot slower than logic delays in processors. The way to solve this in uni-multicore
systems is to build a memory hierarchy of progressively slower but more capient memory levels (cpu
registers, cache levels, ram, non volatile, disk.). But new issues arise: coherency, consistency, scalability.
Multiple processors: + (effective use of millions of transistors, easy scaling by adding cores, uses less
powerful processors cheaper and energy efficient) (parallelization cannot increse performance indefinitely,
algortihms and hw limit performance, one task for many processors, there has to be coordination).
AMP (asymmetric multiprocessor): each processor has local memory, task statically allocated to each
processor. SMP (symmetric): processor share memory, tasks dynamically scheduled to each processor.
Heterogeneous: different specialized processors, usually with different ISA, usually AMP. Homogeneous: all
processors have the same ISA, any processor can run any task, usually SMP. MP can be locally
homogeneous and globally heterogeneous (ex. Multicore CPU + GPU). Shared memory: one copy of data
shared by many cores, if one processor asks for X in memory there's only one place to look.
Communications is achieved via shared global variables, but synchronization (or using different variables) is
needed to prevent race condition (two processors accessing the same variable in a non definite order which
can lead to faults). Ex. Data parallelism (for loop) each processor permors the same instruction on different
data, a single process can fork multiple concurrent threads, each one has it's own execution path, local state
and global shared resources. At the end the forked threads are joined and synchronized.Task/control
parallelism: perform different function on different data.
Distributed memory: if one processor asks for X, X can be in any of the processor memories and each
processor has a different X, N different places to look for it. This tipe of architectures rely on explicit message
exchange to exchange data: P1 has to request explicitly P2's X, who in return sends a copy to P1. Coverage:
all programs have a parallel and a sequential part, that cannot be parallelized in any way (e.g. data
dependence). Amdahl's law: the performance improvement in a multicore is limited by the fraction of the
code that can be parallelized. ex. #instructions, p is the parallelizable fraction, n the number of processors,
Tck is the clk period. If clocks per instruction = 1, ex_time=#instructions*Tck ; ex_time_parallel =
p*#instructions/N + (1-p)*#instructions; speedup=ex_time/ex_time_parallel=1/(p/N + 1 p). If N increases the
speedup asymptotically reaches the value 1/1-p, if a small part of the code is parallelizable the performance
is not gonna improve a lot, parallel programm is worth it for programs that have a huge parallel fraction.
Overhead of parallelism: given enough parallel work, this is the biggest barrier to speedup: cost of starting a
thread, cost of message passing, cost of synchronizing, cost of redundant computation. Tradeoff: algorithm
needs sufficient large units of work to run fast in parallel (without overhead) but not so large that there's not
enough parallel stuff to do. Granularity is a qualitative measure of the ratio of computation to communication.
In parallel computing there are typically computation stages separated by communication/synch ones. Fine
grain parallelism: low computation to communication, less computation between communication events ,less
opportunity for performance enhancement, high communication overhead (more frequent). Coarse: high
computation to communication, more computation between communication events, more opportunity for
performance enhancement, low comm overhead, difficult load balancing. Load balancing: processors that
finish early have to wait the processor with the biggest amount of work for synchronization, leading to a lot of
idle time. Static load balancing: programmer gives a fixed amount of work to each processor a priori. Works
ok for homogeneous (all cores the same, same amount of work), bad for heterogeneous (they need uneven
distribution for performance). Dynamic load balancing: when one core finishes its work, it takes work from
core with the heaviest load, making this good for heterogeneous o uneven workload. Comm&Synch: in
parallel programming processors need to communicate partial results on data or synchronize for correct
processing. In shared memory: communication takes place implicitly operating concurrently on shared
variables and synchronization primitives must be explicitly included in the cose. In distributed memory:
communication primitives must be inserted in the code, synchronization is implicitly achieved through
messages. Cores exchange data (longer) or control (shorter) messages. Concurrency between
communication and computation can be achieved through pipelining: while one processor is computing, the
other is communicating with the memory. Memory access: Uniform (UMA) al processors have equal access
times; Non-uniform (NUMA) processors have the same address space, memory is accessible by all but
different parts have different access times for each PU (placement of data affects performance). Distributed
shared memory PGAS (partitioned global address space): each processor has a memory node, globally
visible by all the other processors. Processor access to his own memory node is fast, to the others is slow
(application has to exploit locality but can be coded as regular sharing memory and has the same synch
requirements). Decomposition is fundamental for parallel programs: break computation into tasks to be
divided between processors, number of tasks can vary in time and others may become available in time
(after others have completed); choose enough tasks to keep processors busy. Algorithms start with a good
understanding of the problem and usually lend themselves to easy decomposition (e.g. function calls and
loops). Tasks should be big enough in order to avoid comm/synch overhead and need to have very few (if
any) dependencies to eliminate bottlenecks. Data decomposition is usually part of task decomposition, it's
useful when computation revolves around a huge data structure and similar operations are applied to
different parts of data (ex. parts of matrixes/arrays or splitting trees). Then we have to decide which tasks
can run in parallel and which have a defined order and formalize this in a task dependency graph (there can
be different ways, and tasks can have different dimensions).
Open MP: it's the standard shared memory programming model. It's a collection of compiler directives,
library routines and environment variables that can be easily included in a serial program flow. Fork-Join
parallelism: initially just one htread is executing sequential code, fork: the master thread creates/awakens a
team of additional threads to execute parallel code, join: at the end of parallel code threads are suspended
upon synchronization.
#pragma omp parallel num_threads(4) { } code within the scope is replicated among threads, 4 in this case
or default #threads. Code within the scopeof the pragma is outlined to a new function by the compiler. The
pragma is replaced with calls to the runtime library omp_parallel_start(&function_1,..), the new function is
called function_1(), then they are synchronized with a barrier omp_parallel_end(); . omp_get_thread_num()
is a runtime call that returns the different id for each running thread, omp_get_num_threads() return the
number of active threads. #pragma omp parallel private(a) shared(b){ } shared means that every thread sees
the same address in the shared memory for a, private means that every thread has a separate copy of
variable b. a variable declared inside a parallel pragma is automatically private to each thread (one issue for
each thread). Firstprivate is needed to generate private variables to each thread that are initialized to the
variable value before the parallel cose (if just private is used they are randomly initialized). The lastprivate
clause defines a variable private as in firstprivate or private, but causes the value from the last task (value
assigned by the thread handling the last iteration in a for, value assigned by the thread handling the last
section) to be copied back to the original value after the end of the loop/sections construct. #pragma omp
parallel for { } (short for #pragma omp parallel { #pragma omp for for(..){ } }) splits the for loop among the
issued threads. Each thread executes a consecutive part of the iterations from a lower bound to an upper
bound ex. thread 1 executes iterations 1 to 5, t2 executes from 6 to 10 and so on. The for loop in splitted in
different functions which run a shorter for with different boundaries. #pragma omp for schedule(static)
schedules #iterations/#threads iterations to each thread ex. 12 iteration, 4 threads = 3 iterations/thread. This
is suitable if the workload is balanced between iterations and has a very small overhead because bounds
can be computed just knowing the thread ids. If interations have different duration this leads to huge
inefficiency, so schedule(dynamic) can be used. A task is generated for each iterations, and work is fetched
by the threads from the runtime environment through synchronized access to the work queue. After a thread
completes a task it fetches another one from the work queue until the queue is empty, leading to a balance of
workload and a reduction in total parallel time. Fine grain parallelism: more opportunity for load balancing,
small amounts of computation between parallel fetching, huge parallelization overhead (more switching
between tasks and scheduling). Coarse grain: harder to load balance, large amounts of computation
between scheduling, low parallelization overhead. schedule(dynamic,1) each iteration is a task, runtime
scheduling primitive is invoked at each iteration, schedule(dynamic,2) every 2 iterations is a task, runtime
scheduling primitive is invoked every two iterations. schedule (guided[,chunk]) thread dynamically grab
blocks of iterations,blocks start small and end big. #pragma omp for collapse (2) is used for 2 nested loops,
whose iterations are then split among threads.
#pragma omp barrier is the basic synchronization directive, all threads in a parallel region wait there that the
other threads finish computing before going on, it prevents later stages of the program from working with
inconsistent data. It is implied at the end of parallel, for and sections (unless a nowait clause is expressed).
#pragma omp critical{ } a critical section is a part of the program that can be executed by a single thread at a
time, and it's identified by the precedent pragma. For example if many threads are updating a shared value
ex. sum+=x[i] (inside a for loop and for pragma) each thread will fetch a value to update, update it then return
the new value. At the same time another thread could have updated the sum value and the first thread return
would be wrong. This is a race condition, in which the result of sum depend in an unpredictable way on the
actual order of execution and update.The code in the critical section is executed by all threads but one
thread at a time can execute it, while others have to wait. It's a form of serialization, which inevitably ruins the
performance but is required for correctness of code. To reduce the impact on performance we should keep
the critical section as short as possible. #pragma omp atomic { } is the same as critical, in that the action in
atomic cannot be interrupted but it has to be a simple instruction that is mapped in a single processor opcode
instruction at assembler level (ex and + xor) (a lot less overhead than critical). A programming pattern in
which a variable is fetched, updated and then stored back (ex. sum+=x[i] ) is called a reduction. It can be
handled with critical but there is also a reduction clause reduction(+:sum). Each thread computes partial
sums on a private version of the reduction variable, shared variable is updated with the partial sums at the
end of the loop (so each thread runs in parallel without critical blocks until the end).
#pragma omp master is executed ust by the master thread, the others simply ignore it, with no synch implied.
(barrier has to be added if necessary). #pragma omp single is a block executed just by one thread (not
necessarily the master) with a barrier implied at the end. The for pragma exploits data parallelism (do the
same stuff over different data), there are also constructs that exploit task parallelism (do different stuff on
different things). #pragma omp parallel sections implies a block of code that can be executed in parallel, with
an implied barrier at the end. #pragma omp parallel sections {pragma omp alpha(); pragma omp beta();}
alpha and beta are code blocks that can be run in parallel (visible by the dependency graph), as made
explicit by the pragma. Sections allows a limited form of task parallelism, in which the parallelism is statically
outlined in the code. Tasks are units of work whose execution may be deferred or be executed immediately.
They can be nested and used in recursive structures such as trees.#pragma omp task outlines these units,
each thread that enounters the directive creates a task. The main programming model is that a thread of the
team (generated in the parallel section) creates task, whereas other threads in the team execute them in any
order, even suspending (to wait for child tasks results) and resuming them. So first we use a parallel pragma
and then immediately a single pragma to create tasks (otherwise all threads would create tasks) who are
serviced by all threads. Another way to do this is inside a regular loop with a for pragma, in which task is
inside the loop body. In this way each thread creates tasks concerning different iterations of the loop, then
executed by all the threads. #pragma omp taskwait forces all tasks that encounter it to suspend until their
child tasks are not completed. A barrier guarantees that all tasks created by the current thread team are
completed before moving on.
Pdyn=a*C*V2*fck a: activity factor, how often gates switch on average; C: total capacitance; aC is a kind of
effective capacitance (the average load capacitance); V is the supply voltage; f ck is the clock frequency.
Ileak=k1*(1-exp(-K2Vds/T))(exp(k3*(Vgs-Vth-Voff)/T), increases with supply voltage and temperature and
decreases with Vth. Delay= Cvdd/Ion = Cvdd/u(T)(Vdd-Vth(T)) u is the mobility, u(T)=u(T0)(T/T0)^m,
Vth(T)=Vth(T0)-k(T-T0). For wires resistivity depends linearly on T, with interconnections delay increasing as
T increases. For low Vth (Vdd>>Vth) main temperature effect is the mobility one, delay increasing with
temperature. For high Vth main temperature effect is the threshold one, delay decreasing with temperature.
Reduce dynamic power: DVFS (dynamic voltage and frequency scaling), reduce effective capacitance
(exploit idle or underutilized networks). Static power: DVS (dynamic voltage scaling, by reducing Vdd or
increasing Vth using the body effect with a polarization), DTM (dynamic thermal management, reduce
temperature using cooling), reducing leakage with HW topology. Power management strategies. DVFS,
RFTS (run fast then stop): Clock gating, power gating, Turbo modes. DVFS (with governor or deadline):
exploits slack (time left until the deadline for the computation) to reduce dynamically voltage and frequency
to spread the computation evenly across the time quantum. Potentially cubic power saving, because Vdd
and fck both scale down.Clock gating: the clock signal to each FF can be gated, in orded to be blocked when
no input transition is detected. It saves dynamic power (but the core continues to leak) by preventing
unwanted FF internal switching; the transition with this mode is istantaneous. Power gating disconnects the
logic circuit from the Vdd lines using high Vth transistors (lower leakage current than regular gates). This sets
to zero the consuption of the block, but all registers' content is lost. This requires costly (in terms of time and
energy) state savings and restoring before and after the power gating. Uncore power: The uncore power
accounts for all the components on the chip that are not the core itself: PLLs, peripherals, memory
controllers. The dynamic consumption by those components can be avoided with clock gating when the
system is idle. Their leakage power can be suppressed using power gating. The uncore power may or may
not scale with voltage in DVFS (does not scale with freqeuncy because it's just the core frequency that's
changed), depending if the uncore runs on the same supply or not. P= Pdyn+Pstat+Puncore=
CV^2f+VIleak+Puncore. E= Pdyn*CPU_time+(Puncore+Pstat)*(CPU_time+Idle_time).
Clock gating: uncore becomes inactive during idle E= Pdyn*CPU_time+Pstat*(CPU_time+Idle_time)
+Puncore*CPU_time. Power gating: no power used during inactive but needs save and restore E=
(Pdyn+Pstat+Puncore)*CPU_time+Esave+Erestore. In DVFS everything has to be computed using the new
supply value and frequency, taking into account that the execution time gets longer.
CPU_time and idle time can depend on number of processors and/or parallelizability of code. All processors
consume static power all the time, they all consume dynamic power when they are concurrently running on
the parallel part of the code, one consumes dynamic power during the serial part of the code. Dynamic
power scales with V^2 and f, static power scales (approximately) with V (if Ileak dows not depend on supply).
Break even time is the time that the core needs to be powered off to compensate for save and restore
energy, this time being shorter is leakage is higher. The technique giving the highest power saving is a
runtime decision. DVFS tries to save power by reducing the core clock frequency and voltage, but this does
not reduce the memory/bus clock. CPU bound applications are those with a high number of ALU instructions,
high data locality and cache hit rate. Memory bound applications are those with a high number of memory
access instructions (large data set, complex data patterns), low data locality and low cache hit rate. DVFS
severly impacts CPU bound application, because increases time of execution of most instructions and
despite reducing power leads to an unfavorable energy balance (longer time but less power). For memory
bound applications this does not happen because they run at the speed of the memory, which is not affected
by the scaling, but manage to save power. They run for a marginally longer time but they save power
reducing therefore energy. A memory bound application has a CPI much larger than 1 (because of wait
states and dependencies), whereas most alu operations have a cpi of 1. In general
CPU_time=CPI*instructions_number/f_clock . CPU_time_DVFS=(CPI-1)*instruction_number/f_clock_max +
instruction_number/f_clock_reduced, summing the time needed for memory access to the one needed for
alu operations. For a memory bound CPI>>1, so CPU_time_DVFS is almost (CPI-1) *instruction_number
/f_clock_max so it doesnt change from the regulare time without DVFS.

You might also like