Edge Detection

A PARALLEL ALGORITHM FOR FAST EDGE DETECTION ON
THE GRAPHICS PROCESSING UNIT
An Honors Thesis
Presented to
The Faculty of the Department of Computer Science
Washington and Lee University
In Partial Fulfillment Of the Requirements for
Honors in Computer Science
by
Alexander Lee Jackson
May 2009
To my Mother and Father. . .
Contents
1 Introduction 2
1.1 General Purpose GPU Programming . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 6
2.1 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 General Purpose Graphics Processing Unit . . . . . . . . . . . . . . 11
2.2.2 Compute Unified Device Architecture . . . . . . . . . . . . . . . . . 13
2.3 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 GPU Edge Detection Algorithms 19
3.1 One Pixel Per Thread Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Multiple Pixels Per Thread Algorithm . . . . . . . . . . . . . . . . . . . . . 21
4 Evaluation and Results 25
4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Conclusions 31
i
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Bibliography 33
ii
ACKNOWLEDGMENTS
I would like to thank my adviser Rance Necaise for assisting me through this long and, at
times, frustrating process.
Thanks also to Tania S. Douglas and her research team at the University of Cape Town
in South Africa for inspiring this project and supplying a number of sample images.
A special thank you to Ryleigh for talking me down from many ledges when it got
overwhelming.
iii
ABSTRACT
Often, it is a race against time to make a proper diagnosis of a disease. In areas of the world
where qualified medical personnel are scarce, work is being done on the automated diagnosis
of illnesses. Automated diagnosis involves several stages of image processing on lab samples
in search of abnormalities that may indicate the presence of such things as tuberculosis.
These image processing tasks are good candidates for migration to parallelism which would
significantly speed up the process. However, a traditional parallel computer is not a very
accessible piece of hardware to many. The graphics processing unit (GPU) has evolved into
a highly parallel component that recently has gained the ability to be utilized by developers
for non-graphical computations.
This paper demonstrates the parallel computing power of the GPU in the area of medical
image processing. We present a new algorithm for performing edge detection on images using
NVIDIAs CUDA programming model in order to program the GPU in C. We evaluated our
algorithm on a number of sample images and compared it to two other implementations;
one sequential and one parallel. This new algorithm produces impressive speedup in the
edge detection process.
iv
A PARALLEL ALGORITHM FOR FAST EDGE DETECTION ON
THE GRAPHICS PROCESSING UNIT
Chapter 1
Introduction
Graphics processing units (GPUs) have evolved over the past decade into highly parallel,
multicore processors[2]. Until recently, these extremely powerful pieces of hardware were
typically only used for processing graphical data. Unless the user is playing a graphics
intensive computer game or executing some other application of a graphical nature, these
high-powered GPUs are often underutilized. In recent years researchers have begun to
investigate the viability of using this highly parallel, highly efficient processing unit for
computations of a non-graphical nature.
This project focused on using GPUs for image processing. We implemented two different
load-balancing algorithms for use with the GPU and showed the advantages of programming
the GPU over the central processing unit (CPU) for such problems.
1.1 General Purpose GPU Programming
Parallel computing has become the go-to method for dealing with problems that have large
data sets and computationally intense calculations. Some examples of such problems are
scientific modeling simulations, weather forecasting, and modeling the relations between
many heavenly bodies. However, there are limitations to the use of the high-powered com-
puters that are necessary for executing these parallel applications. These super-computers
2
CHAPTER 1. INTRODUCTION 3
are often very large in size and are typically quite pricey.
Working with the GPU for computationally intensive problems has several advantages
over the alternative options for parallelism. The GPU is a much more physically and
financially manageable piece of machinery with a top of the line unit going for at most
a few thousand dollars. It also has a growing community of enthusiasts that have been
showing impressive speed-up capabilities through GPU utilization. NVIDIA has become
the industrys leading proponent of GPGPU (General Purpose Graphics Processing Unit)
programming through their release and support of CUDA. CUDA is an extension to the C
programming language that has helped to make GPGPU programming more accessible to
developers.
1.2 Image Processing
Image processing refers to the manipulation and analysis of pictorial information[3]. Using
forms of image processing has become an important part of society. It is utilized in the
scientific and entertainment community to do such things as convert photographs to black
and white or increase sharpness of an image. In relation to this paper, manipulating an
image is an important step for implementation of computer vision. Computer vision involves
the automation of such important tasks as the diagnosis of illnesses or facial recognition.
The general idea behind image processing involves examining image pixels and manip-
ulating them as defined by the type of image processing desired. Image processing can be
a time consuming task and, luckily, lends itself nicely to conversion to a parallel algorithm.
1.3 Motivation
Performing research under the R.E. Lee Scholar program initially sparked interest in uti-
lizing parallel algorithms for high performance computing. The Summer of 2007 and 2008
was spent doing work with this concept. Preliminary research was done for this thesis prior
to the beginning of classes in the Fall. We spent this time becoming proficient in program-
ming with CUDA and converted a simple physics heat-diffusion algorithm to a GPU based
solution.
Inspiration for this project was taken from Professor Tania S Douglas and her research
group, MRC/UCT Medical Imaging Research Unit, Department of Human Biology, Univer-
sity of Cape Town in South Africa. This research group has been developing an automated
process for diagnosis of tuberculosis particularly for low-income and developing countries[4].
In the current environment, the diagnosis of tuberculosis is a time consuming task that re-
quires a highly trained technician to examine a sputum smear underneath a microscope.
Examining sputum smears (smears of matter taken from the respiratory tract) under a
microscope is the number one method for diagnosing tuberculosis, according to the World
Health Organization (WHO) [1].
The problem with the current method for TB diagnosis is the human element inherent
to it. Each slide must be closely examined by a medical technician with a level of compe-
tency that is not necessarily guaranteed. This problem is compounded by the fact that in
developing countries where TB is still a major health risk, there are usually a shortage of se-
nior pathologists to verify manual screening, a requirement of the WHO. Additionally, slide
examination is a tedious and time consuming task. On average a technician will examine
each sputum slide for five minutes and examine around 25 slides per day [4]. Automating
TB diagnosis will help to alleviate the need for highly trained medical technicians to per-
form unexciting tasks while at the same time increasing the accuracy of diagnosis and the
number of diagnoses that can be made in a period of time.
Our work on the GPU with CUDA is significant for the University of Cape Town group
because of the aforementioned advantages of utilizing the GPU in parallel applications.
Using CUDA has the potential for speedups that are orders of magnitude faster than its
sequential counterparts. Additionally the small space requirement and relatively low-cost of
implementing a GPU for scientific applications allows for a degree of portability unavailable
to a typical high powered computer. Creating a cost-effective method for medical computing
in undeveloped areas has the potential to help improve the conditions in places that do not
have the level of health care enjoyed in other countries.
The automated diagnosis of TB through the examination of sputum smears requires
several steps, one of which is the recognition of abnormal smears. This recognition requires
that the image of the smear go through some form of image processing involving edge
detection. There are a number of well documented edge detection algorithms but the
one which we chose to implement for this research was the Laplacian method of pixel
classification[3].
1.4 Thesis Layout
Our intention in this thesis is to show the advantages of utilizing CUDA for image processing.
In the second chapter we provide some background information on parallel computing in
general and GPGPU programming in addition to a more in-depth description of image
processing. The third chapter describes our two GPU algorithms. Chapter four reports our
method of experimentation and the results of our work, and we make our conclusions in
chapter five. We show that the edge detection algorithm on the GPU is considerably faster
and more efficient than that of the sequential version and discuss how a GPU implementation
of the entire diagnostic process could be achieved.

Chapter 2
Background
This thesis makes reference to the area of parallel computing. We used this area of computa-
tional science as foundation for our work with GPGPU programming with CUDA. Typical
parallel computing has a number of advantages and disadvantages, which we discuss in this
section. We also lay out how GPGPU programming compares to parallel computing.
2.1 Parallel Computing
Parallel computing is a simple idea. The human brain is well versed in performing tasks in
parallel; however, a single computer processor can only do one thing at a time in sequential
order. A parallel computer, which can perform multiple computations at once, can be used
to solve simple problems in a matter of minutes that would normally take hours or days on
a single processor.
The most basic concept behind parallel computing is the distribution of the workload
between the individual processors that are working together to perform computations. For
example, consider the problem of matrix addition, which is an embarrassingly parallel ap-
plication. Embarrassingly parallel problems are those that require no other effort beyond
dividing up the work and having each processor operate on their portion as though it were
a sequential algorithm. Suppose we have a parallel machine with p processors, and we want
6
CHAPTER 2. BACKGROUND 7
to add two matrices with p elements each. This problem only requires that we distribute
the two corresponding elements from each matrix to their own processor, let the processor
add the elements, and then collect them into a solution matrix, as illustrated in Figure 2.1.
Other problems, such as matrix multiplication, are more complicated due to their require-
ment for processor cooperation or more careful and efficient distribution of data for their
calculations.
Matrix 1 Matrix 2
P0 P1 P2 P3 P0 P1 P2 P3
P4 P5 P6 P7 P4 P5 P6 P7
+
P8 P9 P10 P11 P8 P9 P10 P11
Figure 2.1: Parallel matrix addition. Each processor, Pn, gets an element from each
matrix.
The ideal theoretical parallel computer consists of p processors with an infinite amount of
memory. With such resources available, we can divide the workload between p processors to
the point where each processor works with the smallest possible piece of data. In practice
however, there does not exist such a computer. Efficient and proper use of a parallel
computer involves careful workload distribution between processors.
There are two basic architectures used with parallel computers: SIMD and MIMD.
Single Instruction Multiple Data (SIMD) architecture is defined by a simultaneous execution
of a single instruction on multiple pieces of data[9]. For example, referring back to the
matrix addition case, every processor has access to each element in both matrices and each
processor executes the same instructions, adding the corresponding elements based on their
processor ids. In Multiple Instruction Multiple Data (MIMD) architecture, each processor
runs independently using a its own set of instructions[9]. A Beowulf cluster, a common type
of parallel computer, is implemented using the MIMD architecture[9].
Inter-processor communications allow for the transmission of data between processors

and sending of signals for reporting on their status. In many cases this allows for the syn-
chronization of the individual processors among their group, preventing some from moving
forward before the others are ready. Often, synchronization is crucial for reliable execution
of applications due to sharing of memory space. A race condition may occur if two proces-
sors require read/write access to the same data register. When this happens you cannot
predict which processor gains access to the data first and therefore you cant guarantee
the integrity of the data[8]. Race conditions can be avoided through inter-processor com-
munications. How the processors communicate usually depends on the type of processor
relationship being implemented.
Washington and Lees Beowulf cluster, The Inferno, is a 64 processor cluster that uses
the MPI (Message Passing Interface) protocol for communication between processors. In
clusters such as this one, there is no central shared memory repository for data; instead data
is often distributed by a central processor and stored in each processors local memory. Early
parallel computers made use of a central shared memory pool, but as time has progressed it
has become difficult and expensive to make larger machines with this form of memory[6].
MPI is used in order to distribute data between processors and allow for inter-processor
communication. With the message-passing model, we are able to work with very large sets
of data without being restricted by the size of a global shared memory pool[6].
The type of relationship between processors working in parallel determines the structure
for the program itself. The master/slave relationship is a common paradigm for working
in parallel. This consists of a sole master processor that presides over a group of slave
processors. The master is responsible for organizing and distributing the data while the
slaves operate on the data[6]. Typically, the slaves, upon starting, signal the master that
they are ready and waiting. As these signals are received, the master processor distributes
the data among the slave processors. The master is only responsible for managing the
data. In some cases, the master processor may clean up loose ends, but generally the slave
processors do the majority of the computing, as illustrated in Figure 2.2. Once they have
finished, the master collects the resulting data set from the slaves. The work-pool method
for processor communication, as illustrated in Figure 2.3, is similar to the master/slave
method, except that the slave processors make requests for smaller chunks of data from a
pool managed and distributed by the master processor [8].
P2
P3
Data
Data
P4
P1
Data Data
Master
P0
Figure 2.2: Master/Slave workload distribution.
P2
P3
Data
Data Request
Data Request
Data
P4
P1 Data Request
Data Data
Data Request
Work Pool
P0
Figure 2.3: Work pool workload distribution.
When the data set is large or the computation complex, utilizing a parallel computer and
an algorithm that employs an effective load balancing scheme can result in an exponential
increase in performance. With enough processors working together, the only performance
limitation is the time for inter-processor communication.
2.2 Graphics Processing Units
A GPU is quite different from a CPU. The GPU is concerned with one thing, the processing
of graphics data whereas the CPU is responsible for general computations and system
administration. GPUs process graphical data in the form of vertices in a geometric space.
This data is converted into a 2-dimensional image for display on the monitor through a
process known as the graphics pipeline[5]. The pipeline consists of a number of stages,
as illustrated in Figure 2.4. At each stage, the input consists of a set of vertices that are
manipulated or transformed. Data can be streamed into the graphics pipeline since vertices
are able to be processed independently from each other. Thus, the basic operations of the
graphics pipeline can be performed in parallel.
Figure 2.4: The OpenGL graphics pipeline.
Over time, the GPU has evolved into a very different component. The first genera-
tion of graphics processors were not themselves programmable[5]. Their instructions were
hard-coded in the chip-set with data transmitted from the CPU. However, this changed
with the development of the programmable GPU. The original programmable GPUs were
utilized with graphics processing APIs such as OpenGL and DirectX[5]. Developers gained
more control over the GPU in the next generation with graphics specific languages such as
NVIDIAs Cg[5].
Due in large part to increased consumer demand over the years for computer graphics
that continue to dazzle the eye, graphics processing units have evolved into highly parallel,
multithreaded, many-core processors with tremendous computational horsepower and very
high memory bandwidth[2]. Todays GPUs are capable of performing the necessary cal-
culations in real-time without skipping a beat so that gamers and researchers can become
immersed in the newest state-of-the-art computer games and scientific imaging applications.
The tremendous power and level of control the developer has over the current generation of
GPUs has made it attractive for developers to utilize the GPU for general purpose appli-
cations in addition to graphics processinga practice known as General Purpose Graphics
Processing Units (GPGPU). The rising popularity of GPGPU has led to the newest gen-
eration of GPUs being constructed with the idea that they might not necessarily be used
exclusively for graphics processing.
2.2.1 General Purpose Graphics Processing Unit
In recent years a new area of parallel computing has begun to garner a good deal of attention
due to its affordability and power. GPGPU programming takes the highly parallel nature
of the GPU and applies it to computationally expensive algorithms. The area of GPGPU
programming focuses on using these programmable graphics cards for more than what they
were intended for, that is, general purpose high end computations.
GPGPU programming came about from the powerful nature of the GPU. Graphics
processing involves a heavy volume of mathematically intensive operations to create and
transform objects within a geometric space. Since the GPU traditionally only handles one
aspect of the computer, there is considerably more space on the chip for data processing
and storage to the point where the number of transistors present on the GPU has, over the
past five years or so, greatly surpassed the number of transistors on the area of the CPU,
as illustrated in Figure 2.5[2]. Furthermore, since Moores Law of increasing computational
power also applies to the GPU, we can expect the number of transistors on the area of a
state-of-the-art graphics processor to double approximately every two years[2]. Addition-
ally, a powerful GPU simply resides in the computer tower along with the other pieces of
hardware as opposed to a parallel cluster, which takes up an entire room.
Figure 2.5: GPU vs. CPU speeds over time.
Taking advantage of the GPUs ability for speed is not that simple. Early GPGPU
methods for programming were tedious with high learning curves because data had to
be represented in ways completely different from typical programming methods on the
CPU. Data in the GPU must be stored in the form of vertices. This adds complication to
programming the GPU for tasks other than those of a graphical nature.
When considering the use of GPGPU programming, it is important to consider the
advantages and disadvantages. As stated earlier, the highly parallel nature of the GPU is
capable of far superior performance on certain types of computations when compared to the
CPU. Also, the price of a graphics card is another huge draw. A top-of-the-line GPU sells
for only a few thousand dollars whereas a traditional parallel computer of any reasonable
size goes for many times that.
2.2.2 Compute Unified Device Architecture
NVIDIA, the most prevalent graphics card producing corporation in the computing industry,
released its Compute Unified Device Architecture and the corresponding CUDA language in
2006 as one of the first programming languages meant specifically for GPGPU programming.
CUDA is an extension of the C programming language that adds some syntax for working
with the GPU. NVIDIAs newest generation of graphics cards are CUDA capable allowing
GPU programming with this simple extension to the C programming language.
The CUDA based graphics cards use a SIMD-like architecture referred to by NVIDIA
as Single Instruction Multiple Thread (SIMT)[2]. As an application is executing, each
thread is mapped to one of the multiprocessor cores (8 to 128 cores per multiprocessor,
up to 30 multiprocessors, depending on the card). Each thread has an id that is used to
distribute the workload among them. Each core is able to run in parallel, resulting in as
many as 3840 threads running simultaneously. When there are too many threads for a total
parallel execution, scheduling of threads on cores is handled by the hardware. Scheduling
being handled on the GPU results in very fast context switches giving the illusion of a fast
parallel execution.
The section of code that is executed by the GPU is known as the device kernel. Before
this kernel can execute, data must be transferred to an allocated memory space on the GPU.
To make the most of CUDA, the programmer must distribute data throughout the GPUs
memory and determine what type of memory to utilize[7]. Allocating correct memory
requires a basic understanding of how the different memory types are accessed and which
type is most efficient for the task at hand because if memory distribution is not utilized
correctly performance can be even worse than if the problem were simply being solved on
the CPU.
When allocating memory on the device, there are primarily three kinds of memory that
can be accessed: global memory, shared memory, and local registers[2]. Figure 2.6 illustrates
how different types of memory is arranged with respect to each other. Data stored in global
memory is available to all threads at once. Global data is loaded into the device from the
CPU during a memory allocation stage that occurs prior to the kernel execution. Memory
locations that are accessed consecutively within memory are most efficiently allocated to
global memory; however, care must be taken because if different thread blocks on the GPU
require read/write access to the same global memory register, there is no guarantee the
value in the memory location is correct due to race conditions.[2]
In the case of shared memory, data stored in this memory type is only accessible by
threads in a common block. If able to divide data into chunks, loading this into shared
memory allows for extremely fast retrieval within the thread block. Much like global mem-
ory, the developer must be careful using shared memory to avoid race conditions and the
resulting data corruption. Shared memory on this most recent generation of GPUs is limited
to 16Kb per block[2].
Figure 2.6: CUDA memory organization and access.
CUDA takes care of organizing the threads in the GPU. The user simply specifies how
the threads are divided amongst a collection of blocks, the number of these blocks and
how they relate to each other in a grid of up to three dimensions is also specified by the
user. Thread blocks have three dimensions. Taking advantage of these dimensions is useful
for working with data of various sizes. There are restrictions, however, as a thread block
has a limit to the number of threads that can be allocated to the block. CUDA allows
for a maximum of 512 threads per block[2]. How these threads are organized is up to the
user, so a one dimensional block of 512 threads and a two dimensional block of 32 x 16
are both valid. Some possible thread blocks are shown in Figure 2.7. Additionally, in the
background, threads within a block are grouped together in what is referred to as a warp[2].
These warps are made up of at most 32 threads and always contain consecutive threads
of increasing ids. The order of execution of warps is undefined. However, warps within
the same block are able to synchronize with each other for safe global and shared memory
access.
1-D Thread Block
tx
0 1 2 3 4 5
6
ty 0
2-D Thread Block

0 1 2 3 4 5
6
0
1
ty
2
3-D Thread Block

tx
0 1 2 3 4 5
0
ty 1
0
tz 1
Figure 2.7: Some possible thread block allocations.
Each thread within a block is assigned a unique thread id that is determined by its
placement within the block dimensions. Typically these thread ids play a part in the
division of labor between threads of a block. For instance, the thread with id of one may
be responsible for all data points in the first column of some collection. Additionally each
block of threads is also assigned a unique block id that, like the thread id, determines the
part of the workload each block is responsible for[2].
Blocks of threads execute a CUDA kernel. A kernel is a globally defined function that
is run by all threads. The threads within a block execute the kernel independently of each
other[2]. This independent thread execution results in the need for a way of synchronizing
the threads when data is shared. This would ensure reliable data retrieval. Luckily, CUDA
has a built in syncthreads() function that, when called within the kernel, forces all threads
to wait upon reaching the syncthreads() function call until all threads within the same
block reach that point within the kernel[2]. Typically, the syncthreads() function is needed
after the threads have loaded data into shared memory, before they begin retrieving and
performing computations on it. This synchronization process is illustrated in Figure 2.8

Threads
Thread Block
Shared Memory Load
SYNCHRONIZE
Execute Calculations
Figure 2.8: Threads loading data into memory and synchronizing.

2.3 Image Processing
Image processing is defined as the manipulation of images. Operations on images that are
considered a form of image processing include zooming, converting to gray scale, increas-
ing/decreasing image brightness, red-eye redaction in photographs, and, in the case of this
study, edge detection, as illustrated in Figures 2.9 and 2.10. These operations typically
involve an exhaustive iteration over each individual pixel in an image.
Figure 2.9: Test image before edge detection.
Figure 2.10: Test image after edge detection.

A common method for image processing is pixel classification. Pixel classification defines
a pixels class based on one of its features, in the case of edge detection, the feature examined
is its intensity versus the intensity of its neighbor pixels. Pixel classification is not limited
to edge detection alone; it is also used for converting an image to gray-scale. (Gray-scale
conversion is also used in our work since we found that edges of images that had been
run through an edge-detection algorithm were easier to discern if they had been converted
to gray scale first). Pixel classification works as follows: for each pixel in an image, its
desired feature is examined, and the pixel is modified as specified. For the Laplacian edge
detection method this process is defined by an image kernel. (Note: this kernel is unrelated
to the CUDA kernel that is executed on the GPU). The Laplacian image kernel is a 3 x
3 two-dimensional array, as shown in Figure 2.11. This kernel is applied to each pixel in
the image and takes into account the pixels neighbors in a 3 x 3 area around it. Given the
pixel identified as xi,j and the kernel k the formula for the new value of xi,j is as follows:
out i , j= x i1, j1 k i 1, j1 x i , j1 k i , j 1 x i1, j1 k i 1, j1 x i 1, j k i1, j x i , j k i , j

x i1, j k i 1, j x i 1, j 1 k i1, j 1 x i , j 1 k i , j1 x i 1, j 1 k i1, j 1
This algorithm is non-trivial for large images as it must perform this calculation three
times total for each pixel in an image, once for the red, blue, and green RGB values. The
number of computations that must be performed along with the ability to represent the
data as a two dimensional array indicated that edge detection would greatly benefit from a
parallel implementation on the GPU with CUDA.

Kernel Pixel Neighborhood
-1 -1 -1 xi-1,j-1 xi,j-1 xi+1,-+1
-1 8 -1 xi-1,j xi,j xi+1,j
-1 -1 -1 xi-1,j+1 xi,j+1 xi+1,j+1
Figure 2.11: Laplacian edge detection.

Chapter 3
GPU Edge Detection Algorithms
The organization of multidimensional thread blocks into multidimensional grids suits CUDA
development to the processing of arrays of data. As a result, data that can easily be
represented in these forms are usually best suited for migration to CUDA for processing.
Such data types include images that can be represented as a two-dimensional matrix where
each entry corresponds to a single pixel in the image. An image pixel consists of a discrete
red, green, and blue component in the range [0 . . . 255].
To develop a CUDA parallel algorithm for Laplacian edge detection, we took two ap-
proaches. The first was straight forward in its data distribution scheme and organization
of thread blocks, while the second took a new approach in an attempt to increase efficiency
within thread blocks.
3.1 One Pixel Per Thread Algorithm
Our first implementation of the Laplacian edge detection algorithm using CUDA is fairly
straightforward. We create a two-dimensional grid that is overlaid on the image, segmenting
it into several rectangular sections, as illustrated in Figure 3.1. For simplicity, we assume
the image can be evenly divided into full sized segments. Processing images dimensions
that do not divide evenly would not be an overly complicated addition to the application.
19
CHAPTER 3. GPU EDGE DETECTION ALGORITHMS 20
Each thread within the thread block corresponds to a single pixel within the image.
However, each thread is not necessarily only responsible for loading one pixel entry into
the shared memory. The nature of the Laplacian pixel group processing method for edge
detection requires that a 3x3 area surrounding the target pixel be analyzed to calculate
the output. Therefore, threads on the edge of a thread block must examine pixels that
are outside the dimensions of the thread block. In order to ensure accuracy of the output
image, these threads are responsible for loading the pixels they are adjacent to that do not
have a mapping in the thread block into shared memory. That is, the threads on the edge of
the block load the boundary pixels into shared memory. This extra step is performed after
the initial shared memory load that all threads perform. To compensate for the required
extra space, the two-dimensional shared memory array is allocated to have dimensions of
(blockDim.x + 2, blockDim.y + 2). This allocates two additional rows and two additional
columns of shared memory.
Shared
Memory
Threads
Figure 3.1: Thread blocks for single pixel per thread method.
Once the block has loaded its respective section of the target image into shared memory,
a syncthreads() function is called so that the threads can regroup before proceeding. With
the integrity of the data verified the kernel then proceeds with the convolution of the image.
The source code follows:
void edged(uchar3* pixelsIn, uchar3* pixelsOut, int1* devKernel, int width)

{
__shared__ uchar3 sData[blockDim.x + 2][blockDim.y + 2];
int ndx = compute2DOffset(width, threadidx, blockidx, blockDim);
loadImageBlock(sData, pixelsIn, ndx);
__syncthreads();
/*****solve*****/
int3 value = makeint3(0,0,0);
for(int u = -1; u < 2; u++)
{
for(int v = -1; v < 2; v++)
convolve(value, kernel[u+1][v+1].x * sData[tx + 1 + u][ty + 1 + v]);
}
/******clamp RGB values*******/

clampRGB(value);
pixelsOut[ndx] = value;
}
For each thread in the block we iterate through the convolution kernel and the 3x3
pixel group in which the target pixel is the center element. After applying the convolution
formula the pixels red, green, and blue values may be outside the range of [0 . . . 255]. We
fix this by clamping the RGB values so that if the value is greater than 255, it is set to
255 and if the value is less than 0, it is set to 0. Without this clamping step, the value
mod 256 is the new color value. This would produce the incorrect pixel value. As pixels
are calculated, they are stored in the out-pixel array that belongs to the designated output
image. Once the CUDA kernel has finished executing, the allocated memory within the
GPU is freed and the program exits.
3.2 Multiple Pixels Per Thread Algorithm
Our second implementation of the edge detection algorithm takes a different approach to
the data distribution aspect of the problem. Instead of creating two-dimensional thread
blocks that map directly to the target image, we create a series of one dimensional thread
blocks that are each responsible for a 2-D shared memory space, as illustrated in Figure 3.2.
Threads
Shared
Memory
Figure 3.2: Thread blocks for multiple pixels per thread method.
The basic flow of the second GPU implementation works the same as the initial one.
After allocating GPU memory and copying the source image to the GPU, the threads
are responsible for copying the global memory into shared memory. Because the blocks
are responsible for more pixels than threads, iteration through the image is necessary. A
number of f or loops are utilized within the CUDA kernel to cycle through the portions of
the image each block is responsible for. In order to maximize efficiency there is one for loop
per stage in the algorithm. This allows each stage to complete before moving on rather
than alternating between loading data into shared memory and solving.
Loading data into the 2-D shared array works similarly to the first implementation. An
initial load is done of the pixel that maps directly to a thread in the block. Then the kernel
iterates down the image segment within a f or loop. In each iteration we increment the
index by width, where width is the width of the image. The first and last threads in the
block are responsible for loading the left and right boundaries of the shared memory block,
and all threads load the pixels they correspond to in the upper and lower boundaries.
Following the successful data transfer to shared memory space, the actual computation
algorithm is nearly identical to the first implementation. The main difference is that the
calculations must be done for one row of the shared memory space at a time. Because of
the fewer CUDA threads executing. After each pixel is determined through the convolu-
tion process, it is stored temporarily in the shared 2-D array until the entire convolution
algorithm is completed. We can store the end-pixel in the shared memory space without
corrupting the data needed for the next computation since the algorithm iterates through
the shared array one row at a time. The convolution process only requires knowledge of
a pixels immediate neighbors in a 3x3 region, therefore, once one row is completed, the
kernel no longer needs the row above in any future calculations. This allows the kernel
to store the newly determined pixels in the previous row of shared memory with impunity
so long as the syncthreads() function is invoked to ensure thread synchronization. Once
all threads have finished convolution for the entire array of shared memory, the end-pixels
stored in shared space are then copied to the output pixel array.
void edged(uchar3* pixelsIn, uchar3* pixelsOut, int1* devKernel, int width)

{
__shared__ uchar3 sData[blockDim.x + 2][SHARED_SIZE_Y + 2];
int3 value = makeint3(0,0,0);
int ndx = compute2DOffset(width, threadidx, blockidx, blockDim);
/*****load shared data*****/

for(int i = 1; i < SHARED_SIZE_Y-1; i++)
loadImageBlock(sData, pixelsIn, ndx, i);
/*****sync from shared memory load******/

__syncthreads();
{
value.x = value.y = value.z = 0;
for(int u = -1; u < 2; u++)
{
for(int v = -1; v < 2; v++)
convolve(value, kernel[u+1][v+1].x * sData[tx + 1 + u][ty + 1 + v]);
}
/******clamp RGB values******/

clampRGB(value);
/****make sure all threads are done with this round of data****/
__syncthreads();
/***store calculated pixel values in shared memory that****/

/*** will no longer be used ****/
sData[tx + 1][ i - 1] = value;
}

{
loadOutImage(sData, pixelsOut, ndx, i);
}
}
Chapter 4
Evaluation and Results
To evaluate the parallel GPU algorithms, we performed several tests on images and com-
pared the times to the sequential implementation. The before and after sputum smear
images can be seen in Figures 4.1 and 4.2. The results of these tests showed an impressive
speedup of our parallel algorithm over the sequential. In addition, we further investigated
the ideal number of threads per block to maximize the efficiency of our algorithms.
4.1 Method
We evaluated our algorithms on three different GPUs whose specifications are in Table 4.1.
All machines run Linux Fedora 9 and use CUDA 2.0. For each each implementation, a C++
driver which runs a main method that processes the arguments from the user, opens/creates
the image objects to be manipulated, and initializes the convolution kernel. This driver then
invokes the CUDA source code file, which in turn calls the CUDA kernel (the GPU device
function), which executes the respective algorithms described above.
Our edge detection algorithms were evaluated on a collection of example images of
sputum smears provided by the University of Cape Town research group. Each image had
dimensions of 1280x968 pixels, and the block dimensions were adjusted accordingly so that
they would fit the image evenly. To make the detected edges in the output image as stark
25
CHAPTER 4. EVALUATION AND RESULTS 26
as possible, it was usually best to first convert the image to gray-scale. Since edge detection
usually works by detecting a sudden change in pixel brightness levels we observed that a
gray-scale image will work best for an edge detection algorithm. Converting the image to
gray-scale helped bring out the contrast at object edges, making them more distinguishable
than in images with color.
Model GeForce 8400GS

Total graphics memory 512Mb
Number of Multiprocessors 1
Number of Cores 8
Clock Rate 1.4 Ghz
Concurrent copy and execution No
Model GeForce 8500GT

Number of Cores 16
Clock Rate 0.92 Ghz
Concurrent copy and execution Yes
Model GeForce 9800GTX

Number of Cores 128
Clock Rate 1.89 Ghz
Concurrent copy and execution Yes
Table 4.1: GPU Specifications the NVIDIA graphics cards
4.2 Results
Results differed, as expected, depending on the hardware being utilized. On average the
speed up of computation of the GPU algorithm versus the sequential algorithm was on
the scale of an order of magnitude under the fastest GPU running the single pixel per
thread algorithm and two to three orders of magnitude with the multiple pixels per thread
algorithm, as illustrated in Figures 4.3 and 4.4. Running on the slowest machine, the
8400GS, times only averaged about 20ms faster than the sequential algorithm under the
Figure 4.1: Original sputum smear image.
single pixel per thread algorithm and a 2.5x speedup with the multiple pixels per thread
algorithm. As can be seen in Figure 4.5, the multiple pixels per thread algorithm was
consistently and considerably faster than the single pixel per thread algorithm.
After showing that the results of these implementations were far superior to the sequen-
tial one, we wanted to investigate how much of an impact different sized thread blocks have
on execution. We ran each algorithm a number of times with thread block sizes starting at
32 and doubling every time, stopping at 256, illustrated in Figure 4.6 and Figure 4.7.
Figure 4.2: Sputum smear after being run through edge detection algorithm.
250
200
150 Sequenti al
8400GS
ms
8500GT
9800GTX/
100 GTX+
50
Figure 4.3: Sequential time vs one pixel per thread GPU algorithm; 32 threads.
250
200
150 Sequential
ms 8400GS
8500GT
9800GTX/
100 GTX+
50
Figure 4.4: Sequential time vs multiple pixels per thread GPU algorithm; 32 threads.
250
200
150
One Pixel
per Thread
Mul tipl e
ms
100 Pixels per

Thread
50
0
8500GT
8400GS 9800GTX/GTX+
Figure 4.5: Comparison of GPU algorithms on different machines; 64 threads.

250
200
150
8400GS
ms 8500GT
100 9800GTX/GT
X+
50
0
64 threads 256 threads
Figure 4.6: Differences in computation time with different sized thread blocks; one pixel
per thread algorithm.
120
100
80
60 8400GS
8500GT
ms
9800GTX/GT
40 X+
20
0
Figure 4.7: Differences in computation time with different sized thread blocks; multiple
pixels per thread algorithm.
Chapter 5
Conclusions
The GPU is an impressively powerful piece of hardware that has become well suited for par-
allel applications. We have shown that an excellent candidate for one of these applications
would be edge detection especially where speed is of the essence. There are also a number
of possible projects that can be taken up as future work with this thesis as a foundation.
5.1 Conclusion
GPGPU programming is a powerful tool that, when applied correctly, can give impressive
results. In this paper we have described two possible load-balancing algorithms for use with
performing edge detection and compared them to the sequential counterpart. Our findings
have indicated that our second implementation ,the multiple pixels per thread method, was
significantly more efficient than the single pixel per thread method.
5.2 Future Work
Edge detection is only one part of the auto-diagnosis project that the South-African re-
search group is undertaking. Further investigation of aspects of their research that could
be implemented in GPGPU programming with CUDA has the potential to make automatic
diagnosis an even more attractive project. Parts of their research that could have the poten-
31
CHAPTER 5. CONCLUSIONS 32
tial for benefit include the auto-focus algorithm for the microscope and the actual diagnosis
of the image after having gone through the edge detection process.
GPUs are intended for working with data streams. Future work could involve investi-
gating the possibility of streaming multiple images into the GPU for edge detection. Cre-
ating data streams would enable copying of data to the device as the prior image is being
processed[2]. Working with multiple images over the course of a single execution would
likely make working with CUDA even more efficient than it already is.
GPU programming has the potential to be just as effective at performing fast com-
putations as traditional parallel computing. Further investigation into this process could
produce a very fast, relatively inexpensive method for efficient medical computations and
image processing. With this technology it is feasible that we could medical care could be
improved in areas without access to expensive hospitals or medical experts.

Bibliography
[1] Tuberculosis Fact Sheets. http://who.int/mdiacentre/factsheets/fs104/en/, 2007.
[2] CUDA Programming Guide. NVIDIA Corporation, Santa Clara, CA, 2009.
[3] Gregory A. Baxes. Digital Image Processing: Principles and Applications. John
Wiley and Sons, Inc, New York, NY, 1994.
[4] Tania S. Douglas, Rethabile Khutlang, Sriram Krishnan, Andrew

Whitelaw, and Genevieve Learmonth. Image segmentation for automatic de-
tection of tuberculosis in sputum smears. 2008.
[5] Randima Fernando and Mark J. Kilgard. The Cg Tutorial: The Definitive Guide
to Programmable Real-Time Graphics. Addison-Wesley, New York, NY, 2003.
[6] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable
Parallel Programming with the Message-Passing Interface; second edition. The MIT
Press, Cambridge, MA, 1999.
[7] Tom R. Halfhill. Parallel processing with cuda. Microprocessor Report, 2008.
[8] Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne. Operating Sys-
tem Concepts; sixth edition. John Wiley and Sons, Inc, New York, NY, 2002.
[9] Barry Wilkinson and Michael Allen. Parallel Programming: Techniques and
Applications Using Networked Workstations and Parallel Computers. Prentice Hall,
Upper Saddle River, NJ, 2005.
33

Edge Detection

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Edge Detection

Uploaded by

Copyright:

Available Formats

A PARALLEL ALGORITHM FOR FAST EDGE DETECTION ON

THE GRAPHICS PROCESSING UNIT

The Faculty of the Department of Computer Science

Washington and Lee University

In Partial Fulfillment Of the Requirements for

Honors in Computer Science

Alexander Lee Jackson

1.1 General Purpose GPU Programming . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Thesis Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 General Purpose Graphics Processing Unit . . . . . . . . . . . . . . 11

2.2.2 Compute Unified Device Architecture . . . . . . . . . . . . . . . . . 13

2.3 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 GPU Edge Detection Algorithms 19

3.1 One Pixel Per Thread Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Multiple Pixels Per Thread Algorithm . . . . . . . . . . . . . . . . . . . . . 21

4 Evaluation and Results 25

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

computations of a non-graphical nature.

1.1 General Purpose GPU Programming

1.2 Image Processing

scientific and entertainment community to do such things as convert photographs to black

and white or increase sharpness of an image. In relation to this paper, manipulating an

quires a highly trained technician to examine a sputum smear underneath a microscope.

Health Organization (WHO) [1].

number of diagnoses that can be made in a period of time.

because of the aforementioned advantages of utilizing the GPU in parallel applications.

have the level of health care enjoyed in other countries.

The automated diagnosis of TB through the examination of sputum smears requires

1.4 Thesis Layout

In the second chapter we provide some background information on parallel computing in

general and GPGPU programming in addition to a more in-depth description of image

of the entire diagnostic process could be achieved.

2.1 Parallel Computing

computer involves careful workload distribution between processors.

Single Instruction Multiple Data (SIMD) architecture is defined by a simultaneous execution

of parallel computer, is implemented using the MIMD architecture[9].

Inter-processor communications allow for the transmission of data between processors

relationship being implemented.

for processor communication, as illustrated in Figure 2.3, is similar to the master/slave

pool managed and distributed by the master processor [8].

Figure 2.2: Master/Slave workload distribution.

Figure 2.3: Work pool workload distribution.

limitation is the time for inter-processor communication.

2.2 Graphics Processing Units

graphics pipeline can be performed in parallel.

Figure 2.4: The OpenGL graphics pipeline.

multithreaded, many-core processors with tremendous computational horsepower and very

cations in addition to graphics processinga practice known as General Purpose Graphics

exclusively for graphics processing.

2.2.1 General Purpose Graphics Processing Unit

processing involves a heavy volume of mathematically intensive operations to create and

as illustrated in Figure 2.5[2]. Furthermore, since Moores Law of increasing computational

state-of-the-art graphics processor to double approximately every two years[2]. Addition-

hardware as opposed to a parallel cluster, which takes up an entire room.

Figure 2.5: GPU vs. CPU speeds over time.

be represented in ways completely different from typical programming methods on the

When considering the use of GPGPU programming, it is important to consider the

size goes for many times that.

2.2.2 Compute Unified Device Architecture

GPU programming with this simple extension to the C programming language.

as Single Instruction Multiple Thread (SIMT)[2]. As an application is executing, each

/clamp RGB values*/

/load shared data/

/*sync from shared memory load**/

/clamp RGB values/

/*store calculated pixel values in shared memory that**/