Professional Documents
Culture Documents
Prepared By:
Approved by:
UNIT I
Introduction: Need of high speed computing increase the speed of computers history of parallel
computers and recent parallel computers; solving problems in parallel temporal parallelism data
parallelism comparison of temporal and data parallel processing data parallel processing with
specialized processors inter-task dependency. The need for parallel computers - models of computation
- analyzing algorithms expressing algorithms.
2 MARKS
1. What is high performance computing?
(APRIL 2014)
High Performance Computing most generally refers to the practice of aggregating computing power in a way that
delivers much higher performance than one could get out of a typical desktop computer or workstation in order to
solve large problems in science, engineering, or business.
2. Define Computing.
The simultaneous use of more than one processor or computer to solve a problem. Use of multiple
processors or computers working together on a common task.
Each processor works on its section of the problem
Processors can exchange information
shared memory
Access times vary from CPU to CPU
in NUMA systems
over a network
Data parallelism is a form of parallelization of computing across multiple processors in parallel computing
environments. Data parallelism focuses on distributing the data across different parallel computing nodes. It
contrasts to task parallelism as another form of parallelism.
(November 2014)
The traditional scientific paradigm is first to do theory (say on paper), and then lab experiments to confirm or deny
the theory. The traditional engineering paradigm is first to do a design (say on paper), and then build a laboratory
prototype. Both paradigms are being replacing by numerical experiments and numerical prototyping. There are
several reasons for this.
Real phenomena are too complicated to model on paper (eg. climate prediction).
Real experiments are too hard, too expensive, too slow, or too dangerous for a laboratory (eg oil reservoir
simulation, large wind tunnels, overall aircraft design, galactic evolution, whole factory or product life cycle
design and optimization, etc.).
No Synchronization:
No Bubbles in pipeline
More Fault tolerance
No communication
Disadvantages:
Static assignment
Partitionable
Time to divide jobs is small.
(Apr 2013 )
Parallel computing is a form of computation in which many calculations are carried out
simultaneously, operating on the principle that large problems can often be divided into smaller
ones, which are then solved concurrently.
27. Comparison Between Temporal and Data Parallelism.
TEMPORAL PARALLELISM
(November 2014)
DATA PARALLELISM
Independent task
No bubbles
quasi dynamic
Tolerates to processor
Efficient with coarse grained
28. Specify the types of parallelism that have been seen software.
Task Parallelism
Data parallelism
(November 2014)
11 MARKS
1. What is Parallel Computing and explain?
Problems are solved by a series of instructions, executed one after the other by the CPU. Only one
instruction may be executed at any moment in time.
In the simplest sense, parallel computing is the simultaneous use of multiple compute resources
to solve a computational problem.
The compute resources can include:
A single computer with multiple processors;
An arbitrary number of computers connected by a network;
A combination of both.
The computational problem usually demonstrates characteristics such as the ability to be:
Broken apart into discrete pieces of work that can be solved simultaneously;
Execute multiple program instructions at any moment in time;
Solved in less time with multiple compute resources than with a single compute resource.
Parallel computing is an evolution of serial computing that attempts to emulate what has always been
the state of affairs in the natural world: many complex, interrelated events happening at the same
time, yet within a sequence. Some examples:
Planetary and galactic orbits
Weather and ocean patterns
Tectonic plate drift
Rush hour traffic in LA
Automobile assembly line
Daily operations within a business
Building a shopping mall
Ordering a hamburger at the drive through.
Traditionally, parallel computing has been considered to be "the high end of computing" and has been
motivated by numerical simulations of complex systems and "Grand Challenge Problems" such as:
weather and climate
chemical and nuclear reactions
biological, human genome
geological, seismic activity
Today, commercial applications are providing an equal or greater driving force in the development of
faster computers. These applications require the processing of large amounts of data in sophisticated
ways. Example applications include:
parallel databases, data mining
oil exploration
web search engines, web based business services
computer-aided diagnosis in medicine
management of national and multi-national corporations
advanced graphics and virtual reality, particularly in the entertainment industry
networked video and multi-media technologies
collaborative work environments
Ultimately, parallel computing is an attempt to maximize the infinite but seemingly scarce
commodity called time.
(April 2013)
Limits to serial computing - both physical and practical reasons pose significant constraints to simply
building ever faster serial computers:
Transmission speeds - the speed of a serial computer is directly dependent upon how fast data
can move through hardware. Absolute limits are the speed of light (30 cm/nanosecond) and
the transmission limit of copper wire (9 cm/nanosecond). Increasing speeds necessitate
increasing proximity of processing elements.
Real phenomena are too complicated to model on paper (eg. climate prediction).
Real experiments are too hard, too expensive, too slow, or too dangerous for a laboratory (eg oil
reservoir simulation, large wind tunnels, overall aircraft design, galactic evolution, whole factory
or product life cycle design and optimization, etc.).
Scientific and engineering problems requiring the most computing power to simulate are
commonly called "Grand Challenges like predicting the climate 50 years hence, are estimated to
require computers computing at the rate of 1 Tflop = 1 Teraflop = 10^12 floating point operations
per second, and with a memory size of 1 TB = 1 Terabyte = 10^12 bytes. Here is some commonly
used notation we will use to describe problem sizes:
By increasing the speed of the processing element using faster semiconductor technology (by
using advanced technology)
By architecture methods. It in turn we can increase the speed of computer by applying
parallelism.
Use parallelism in Single processor
overlap the execution of number of instructions by pipelining or by using multiple functional
units
Overlap the operation of different units.
(November 2014)
Characterized by:
The fastest clock rates, because vector pipelines can be very simple.
Vector processing.
Quite good vectorizing compilers.
High price tag; small market share.
Not always scalable because of shared-memory bottleneck (vector processors need more data per
cycles than conventional processors). Vector processing is back in various forms: SIMD
extensions of commodity microprocessors (e.g. Intel's SSE), vector processors for game consoles
(Cell), multithreaded vector processors (Cray), etc.
Vector processors went down temporarily because of:
Market issues, price/performance, microprocessor revolution, commodity microprocessors.
Not enough parallelism for biggest problems. Hard to vectorize/parallelize automatically
Didn't scale down.
MPPs
Glory days: 90-96
Famous examples: Intel hypercubes and Paragon, TMC Connection Machine, IBM SP, Cray/SGI
T3E.
Characterized by:
Scalable interconnection network, up to 1000's of processors. We'll discuss these networks shortly
Commodity (or at least, modest) microprocessors.
Message passing programming paradigm.
Killed by:
Small market niche, especially as a modest number of processors can do more and more.
Programming paradigm too hard.
Relatively slow communication (especially latency) compared to ever-faster processors (this is
actually no more and no less than another example of the memory wall).
Today
A state of flux in hardware.
But more stability in software, e.g., MPI and OpenMP.
Machines are being sold, and important problems are being solved, on all of the following:
Vector SMPs, e.g., Cray X1, Hitachi, Fujitsu, NEC.
SMPs and ccNUMA, e.g., Sun, IBM, HP, SGI, Dell, hundreds of custom boxes.
Distributed memory multiprocessors, e.g., Cray XT3, IBM Blue Gene.
Clusters: Beowulf (Linux) and many manufacturers and assemblers.
A complete top-down view: At the highest level you have either a distributed memory
architecture with a scalable interconnection network, or an SMP architecture with a bus.
A distributed memory architecture may or may not provide support for a global memory
consistency model (such as cache coherence, software distributed shared memory, coherent
RDMA, etc.). On an SMP architecture you expect hardware support for cache coherence.
A distributed memory architecture can be built from SMP or even (rarely) ccNUMA boxes. Each
box is treated as a tightly coupled node (with local processors and uniformly accessed shared
memory). Boxes communicate via message passing, or (less frequently) with hardware or
software memory coherence schemes. Both on distributed and on shared memory architectures,
the processors themselves may support an internal form of task or data parallelism. Processors
may be vector processors, commodity microprocessors with multiple cores, or multiple threads
multiplexed over a single core, heterogeneous multicore processors, etc.
Programming: Typically MPI is supported over both distributed and shared-memory substrates
for portability (large existing base of code written and optimized in MPI). OpenMP and POSIX
threads are almost always available on SMPs and ccNUMA machines. OpenMP implementations
over distributed memory machines with software support for cache coherence also exist, but
scaling these implementation is hard and is a subject of ongoing research.
Future
The end of Moore's Law?
Nanoscale electronics
Exotic architectures? Quantum, DNA/molecular.
5. Explain various methods for solving problems in parallel.
(November 2014)
Time is same.
Let no of jobs=n
Time to do a job=p
Each job is divided into k tasks
Time for each task=p/k
Time to complete n jobs with no pipeline processing =np
Time complete n jobs with pipeline processing of k teachers=p+(n-1)p/k=p*[(k+n-1)/k]
Speedup due to pipeline processing=[np/p(k+n-1)/k]=[k/1+(k-1)/n]
Problems encountered:
Synchronization:
Identical time
Bubbles in pipeline
Bubbles are formed
Fault tolerance
Does not tolerate.
Scalability
Cant be increased.
No Synchronization:
No Bubbles in pipeline
No communication
Disadvantages:
Static assignment
Partitionable
Advantages:
No bubbles
Disadvantages:
If speedup of a method is directly proportional to the number, then the method is said to scale well.
Let total no of papers=n
Let there be k teachers
Time waited to get paper=q
Time for each teacher to get, grade and return a paper=(q+p)
Total time to correct papers by k teachers=[n(q+p)/k]
Speed up due to parallel processing=np/[n(q+p)/k]=k/[1+(q/p)]
(April 2013)
DATA PARALLELISM
Independent task
No bubbles
quasi dynamic
Tolerates to processor
Efficient with coarse grained
There is a head examiner whop dispatches answer papers to teachers. We assume that teacher 1(T1)
grades A1, teacher 2(T2) grades A2 and teacher i(Ti) grades Ai to question Qi.
Procedure:
Give one answer book to T1,T2,T3,T4
When a corrected answer paper is returned check if all questions are graded. If yes add marks and
put the paper in the output pile.
If no check which questions are not graded
For each I,if Ai is ungraded and teacher Ti is idle send it to teacher Ti or if any other teacher Tp is
idle.
Repeat steps 2,3 and 4 until no answer paper remains in input pile
Answer papers are divided into 4 equal piles and put in the in-trays of each teacher. Each teacher repeats
4 times simultaneously steps 1 to 5.
For teachers Ti(i=1 to 4) do in parallel
Take an answer paper from in-tray
Grade answer Ai to question Qi and put in out-tray
Repeat steps 1 and 2 till no papers are left
Check if teacher (i+1)mod4s in-tray is empty.
As soon as it is empty, empty own out-tray into in-tray of that teacher.
METHOD 8: Agenda Parallelism
Answer book is thought as an agenda of questions to be graded. All teachers are asked to work on the
first item on agenda, namely grade the answer to first question in all papers. Head examiner gives one
paper to each teacher and asks him to grade the answer A1 to Q1.When a teacher finishes this, he is given
with another paper. This is data parallel method with dynamic schedule and fine grain tasks.
6. Briefly explain about Inter Task Dependency with example
The following assumptions are made in evolving tasks to teacher:
The answer to a question is independent of answers to other questions
Teachers do not have to interact
The same instructions are used to grad all answer books
Tasks are inter related. Some tasks are done independently and simultaneously while others have to wait
for completion of previous tasks. The inter relations of various tasks of a job may be represented
graphically as a task graph.
Procedure: Recipe for Chinese vegetable fried rice:
T1: Clean and wash rice
T2: Boil water in a vessel with 1 teaspoon salt
T3: Put rice in boiling water with some oil and cook till soft
T4: Drain rice and cool
T5: Wash and scrape carrots
T6: Wash and string French beans
T7: Boil water with teaspoon salt in 2 vessels
T8: Drop carrots and French beans in boiling water
T9: Drain and cool carrots and French beans
T10: Dice carrots
T11: Dice French beans
T12: Peel onions and dice into small pieces
T13: Clean cauliflower .Cut into small pieces.
T14: Heat oil in iron pan and fry diced onion cauliflower for 1 min in heated oil
T15: Add diced carrots and French beans to above and fry for 2 min.
T16: Add cooled cooked rice, chopped onions and soya sauce to the above and stir and fry for 5 min.
There are 16 tasks in this,in that they have to be carried out in sequence. A graph showing the
relationship among the tasks is given
RAM
Interconnection Network
Combinatorial Circuits
Connection machine
Internet
(April 2013)
Parallel
Distributed
Technical aspects
Parallel computers (usually) work in tight syncrony, share memory to a large extent and have a
very fast and reliable communication mechanism between them.
Distributed computers are more independent, communication is less Frequent and less
synchronous, and the cooperation is limited.
Purposes
Distributed computers have individual goals and private activities. Sometime communications
with other ones are needed. (e. G. Distributed data base operations).
RAM consists of
A memory with M locations.
A READ phase in which the processor reads datum from a memory location and copies it into a
register.
A COMPUTE phase in which a processor performs a basic operation on data from one or two of
its registers.
A WRITE phase in which the processor copies the contents of an internal register into a memory
location.
(April 2014)
The processors communicate using m shared (or global) memory locations, U1, U2, ..., Um.
Allowing both local & global memory is typical in model study.
All processors operate synchronously (i.e. using same clock), but can execute a different sequence
of instructions.
Some authors inaccurately restrict PRAM to simultaneously executing the same sequence
of instructions (i.e., SIMD fashion)
Each processor has a unique index called, the processor ID, which can be referenced by the
processors program.
Often an unstated assumption for a parallel model
Each PRAM step consists of three phases, executed in the following order:
A read phase in which each processor may read a value from shared memory
A compute phase in which each processor may perform basic arithmetic/logical operations
on their local data.
A write phase where each processor may write a value to shared memory.
Note that this prevents reads and writes from being simultaneous.
Exclusive Read (ER): Two or more processors cannot simultaneously read the same memory
location.
Concurrent Read (CR): Any number of processors can read the same memory location
simultaneously.
Exclusive Write (EW): Two or more processors can not write to the same memory location
simultaneously.
Concurrent Write (CW): Any number of processors can write to the same memory location
simultaneously.
Priority CW: The processor with the highest priority writes its value into a memory location.
Common CW: Processors writing to a common memory location succeed only if they write the
same value.
Arbitrary CW: When more than one value is written to the same location, any one of these values
(e.g., one with lowest processor ID) is stored in memory.
Random CW: One of the processors is randomly selected write its value into memory.
11 Marks:
1. Explain about the models of Computation. (April 2013) (Q. No. 7, Ref.Pg.No.22)
2. Write a comparison of Temporal and Parallel Processing. (April 2013) (Q. No. 5, Ref.Pg.No.17)
3. Consider an examination paper has 4 questions to be answered and there are 1000 answer books.
Illustrate how data parallel processing with specialization processor is done for the above problem.
(November 2013)
4. Discuss the various abstract machine models for parallel computers in detail. (November 2013)
5. A. Compare Temporal and Data Parallelism. (April 2014) (Q. No. 5, Ref.Pg.No.17)
B. Compare BSP and PRAM.
6. Explain the PRAM Model of Computation. (April 2014) (Q. No. 8, Ref.Pg.No.20)
7. Discuss the history of past and present parallel Computers. (November 2014) (Q. No. 4,
Ref.Pg.No.11)
8. Discuss the various parallel computing models. (November 2014) (Q. No. 5, Ref.Pg.No.13)