GPU Computing Manmath and Karan

CSULB
Paper on GPU Computing

CECS 570
Manmath Kirsan and Karan Bhandari 12/13/2011
Keywords: GPU computing, high performance computing, massively parallel processing
GPU Computing
GPU Computing
Introduction In order to extenuate the ad-infinitums havocs associated with real time processing, magnanimous levels of parallel computing and massive computational requirements, the modern computing world has integrated the graphics processing unit (GPU) into its computational arsenal. Gone are the days when the GPUs are merely used for graphic, vertex and fragment computing. Today they form a vital component of supercomputers, save lives, process a gargantuan chunk of data and perform computation at amazing rates. In this paper we delve into various aspects revolving around GPU. We discuss its architecture, framework, variations, performance and issues, and then the paper will reveal its applications and examples where GPUs are deployed Architecture GPU computing resources are ample. GPUs evolution from special purpose graphics processor to fully operational, parallel programmable processor is incredible. GPU was initially designed to function as a graphics pipeline to manipulate geometric primitives in order to perform vertex operations like transform, shading, fragment operations, composition, light interaction and rasterization. GPU was focusing on more throughput rather than latency. Over the years a paradigm shift towards Unified Shader model was realized after many years of separate instruction set for vertex and fragment operations. One can visualize the GPU as a programmable engine brandished by fixed supporting functional units. [17] The architecture of a modern GPU pipeline exposes task parallelism (output of any successive task is inserted as an input to the next task) and data parallelism (computation of multiple entities within each stage). Customization of hardware to achieve specific tasks bestows us more power since greater compute and area efficiency is provided. Special purpose fixed functional components are replaced by programmable components in the modern GPU. [17] ALUs in the GPU meet the IEEE requirements for single-precision floating point computation. Shift from a graphic specific approach to general purpose computation and permitting the GPUs to arbitrarily read and write to memory allows chip to be shepherded for general purpose parallel computing. [31] These programmable parallel machines constructed around an array of multiprocessors each of which is multithreaded. There can be 1024 co-resident threads in an individual multiprocessor. Apart from scheduling management of threads is done by the hardware. For example a high end GPU such as Tesla C1060 is composed of 30,720 co-resident threads within the realms of 30 multiprocessors (240 cores) [4]. These figures project the overpowering nature of the GPU when compared to a CPU. GPUs reduce their reliance on the cache and rests their weight upon multithreading. So this implies that algorithms need to have high degree of parallelism. Fig. 1 depicts the Tesla C1060 with four subsections namely PCIe (Peripheral Component interconnect express) bus, GPU itself, streaming multiprocessor and an off-chip memory. There is no need to save and restore thread state because each thread has access to dedicated set of registers. Each microprocessor is equipped with 64-Kbyte register file (total of 16,384 32bit registers) [4]. A shared low latency on-chip memory with speeds analogous to L1 cache is available apart from managed RAM available at the disposal of all the threads.
Fig. 1. Tesla C1060
GPU Computing
Frameworks CuPP CuPP is a C++ framework for easy integration of CUDA into existing C++ applications. It was basically designed for integration of NVIDIAs GPGPU. Low level interface is provided using smart pointers and memory allocation functions. High level interface include a C++ STL wrapper and type transformers. Wrapper synchronizes device and the host. If vectors fail to confirm needs of an application, developers can define their own data structures using type transformations. Type transformers can also be used to provide different representation of same data for host and the device respectively. CuPP is build up on Boost libraries and can be roughly divided into five components which are device management, memory management, support for classes, data structures and C++ kernel call. [1] EcoG EcoG is a power efficient cluster architecture generally used for scientific computing. The main idea behind the development of EcoG is to create a GPU based cluster which is power efficient. EcoG is a joint effort by the University of Illinois, Nvidia and NCSAs (National center for supercomputing applications) Innovative Systems Lab. The main goals of EcoG cluster were to use high performance GPU using low power, use of lower power CPU and use of RAM just big enough than GPUs. The targeting applications are fluid dynamics, molecular dynamics and HPL works passably well. The architecture of EcoG consists of Intel P55 Micro ATX Intel motherboard with 4 Gb DDR3 memory, Tesla C2050 GPU and a core i3 530 2.93 GHz dual core processor. EcoG has won greenest self build cluster award at 500green SC10. Variations CUDA (Compute Unifed Device Architecture) In the year 2006 NVIDIA had laid the foundation for CUDA with the launch of GeForce 8800 GTX GPU. This GPU was not only CUDA ready but also it was industry's first DirectX10 GPU. CUDA harnessed the power of unified shader pipeline which was unlike the erstwhile graphics card which had separate computing resources into virtex and pixel shaders. [31] CUDA is a heterogeneous mixture of standard C and NVIDIAs mini collection of keywords. One neither needs to delve into OpenGL or DirectX nor try to map it to a graphic task. Data-parallel sections of the code are written as kernels whereas the CPU behaves as host. The treads are grouped into blocks which are conglomerated into grids. During the execution of CUDA applications the threads are scheduled in clusters of warps. One common instruction is executed by the warp so complete efficiency is realized. It is recommended to avoid conditional branches since treads of warp diverge on its occurrence. The GPU and the host communicate with each other via the GPUs global memory. CUDA allows programmer to read to and fro from the GPUs global cache. It should be used sparingly as it can pose as a speed hog. [7] NVIDIAs CUDA architecture is a superlative platform for writing delightful parallel programs. It provides easy to use abstractions for hierarchical thread organizations, memory and synchronization. Without the need to concoct and learn a plethora of new programming constructs. CUDA architecture is capable of supporting the graphic trend setters like OpenCL, DirectX Compute and the archaic yet relevant Fortran and C. [7] Due to CUDAs ease of portability from CPU program to CPU cluster, low learning curve and provision of shared memory multithreaded model, its power is harnessed by engineers and scientists in data intensive domains like data mining, bioinformatics, high energy physics, weather prediction and scientific simulation. [7] For example if we need to execute a CUDA code on the CPU we would write code as depicted in fig. 2. The code that is designated to run on the host appears just like C which is used by common man. In order to execute code on the device (or kernel) we need to invoke a kernel function which is specified by adding a _global_ qualifier to standard C. One needs to append angled brackets and a numeric tuple which linguistically conveys to the compiler that the code with special syntax should be executed on the GPU and the rest should be executed on the host. [31]
GPU Computing
Fig. 2 CUDA code In order to allocate memory on the device one must utilize cudaMalloc which is similar to a call to standard C s malloc and in order to deallocate memory invocation to cudaFree() must be made. 1) Atomics: Often there are occasions were multiple threads attempt to attain access to shared values and in the end a wrong output is obtained due to out of sequence thread access. Interruption to read-modify-write by foreign threads can end up in a parody of errors. To circumvent it CUDA C can call operations that have the constraint set to atomic which allows safe access to the memory. 2) Streams: Apart from data-parallel execution CUDA enabled graphics processor are capable of achieving task parallelism which involves doing two disparately diverse tasks in parallel. Example update over the network and rendering o the GUI are disparate tasks. Creation, destruction, synchronization, scheduling and parallelization is possible via the CUDA Stream APIs. 3) Tool Suite: The paparazzi toolkit that accompanies CUDA makes parallel execution on GPU more palatable. CUDA toolkit enables you to compile and run CUDA applications. CUFFT allows programmers to perform fast fourier transformations using CUDA. CUBLAS allows the users to perform linear algebra. GPU Computing SDK by NVIDIA contains a rich repository of sample code for many domains such as Image/Video processing, system integration, textures, finance, data compression, physical simulation and many others. CUDA-GDB allows developers to set breakpoints, trace output, and inspect elements, view state and a myriad range of debugging operations. NVIDIAs Parallel Nsight is a visual debugger for Microsoft Visual Studio. Finally performance can be done by CUDA visual profiler. [31] Coprocessing As we have gleaned through our lectures in CECS 570 that applications have both serial and parallel sections. We can leverage the power of co-processing to execute the sequential portion of the code on the CPU and transfer the computational burden of the parallel section of the code to the GPU. CPU core is optimized for reduced latency on one thread and GPU is fine tuned for parallel codes aggregate throughput. A typical co-processing system grants ten percent and ninety percent of its area to CPU core and many GPU cores respectively. Comparison of two scenarios highlights the preponderance of co-processing systems. In the first scenario we consider a parallel intensive program. It is witnessed that system with 1 CPU core takes 200 time units, system with 500 GPU cores takes 5.4 time units, system with 10 CPU cores takes 20.9 time units and a co-processing system with 1 CPU and 450 GPU cores takes 1.44 time units. In the next scenario we consider a predominantly sequential program. Here a system with 1 CPU core takes 200 time units, system with 450 GPU cores takes 750.1 time units, system with 10 CPU cores takes 155 time units and a co-processing system with 1 CPU and 450 GPU cores takes 150.11 time units. [16] We can wrench out extremely conducive performance figures from a co-processing system. It should be the most preferred choice for a plethora of computational problems. OpenCL In order to harness the power of heterogeneous GPUs, CPUs, embedded processors, cell processors, DSP processors and diverse computing kernels, an open standard termed OpenCL was introduced. Prior to the advent of OpenCL, parallel programming models like OpenMP and MPI was utilized but alas they were not meant for GPUs. OpenCL is portable and reduces the learning curve. OpenCL can work on CUDA enabled platforms as well as more. Its programming model encompasses library routines, environment variables, and a set of compiler directives that affects the run time of the program. It has a lower learning curve when compared to OpenMP because OpenMP requires us to allocate resources, excavate the parallelizable sections to append compiler directives, make sure locks/protections are in place and have a master thread that spawns children threads. One needs to comprehend the memory structure and supporting architecture to write efficient OpenCL programs. OpenCL programs require more effort to craft the kernel code in order to harness the computing capabilities but it is rest assured that they will be portable. OpenMP is targeted on expensive servers and multicore architectures but OpenCL is targeted on
GPU Computing
relatively inexpensive GPUs and the efforts are rewarding because a speedup ranging between 20 and 100 is observed. Thus ubiquitous supercomputing is achieved with the use of OpenCL. [20] SkelCL (Portable Skeleton Library for High Level GPU programming) The need to use of SkelCL aroused due to fact that in low level programming models in GPUs such as OpenCL and CUDA where explicit data transfer is performed from the systems main memory (CPU access) to the main memory of the GPU and back. Moreover, the programmer is responsible for the memory allocation and deallocation which is done explicitly. As a result the GPU program show low level boilerplate code. SkelCL is a library based on OpenCL; therefore, is hardware independent and can run on OpenCL capable devices. The main advantage of SkelCL over OpenCL and CUDA is its ease of implementation, high level approach and implicit data exchange between CPU and GPU which is implemented using abstract vector data type. It is based on data parallel algorithmic skeletons formally seen as high order functions. Function takes simple values as well as user defined variables as arguments. For execution of a specified calculation in parallel function is capable of customizing a skeleton. Basic skeletons such as Zip and Reduce are provided by SkelCL. Zip skeleton is responsible for combining two vectors element wise and Reduce skeleton uses binary operation to combine elements of a vector. Vector Class provided by SkelCL is responsible for unified abstraction of contagious memory which is accessible by CPU as well as GPU. Vectors consist of pointers to main memory which is accessible by CPU and GPU. When vectors are created on a host machine and memory allocation is done accordingly on GPU as well. This provides implicit access of data if memory is accessed by CPU which is previously modified by GPU and vice versa. Previously compiled kernels are stored on disk which reduces the overhead of compiling the source code every time when functions source code is merged with skeletons code for generation of code for OpenCL kernel. Challenges faced by CUDA and OpenCL due to introduction of multiple GPUs for programming multiple devices are resolved by SkelCL. SkelCL provides vector to all the available devices which can be evenly distributed on all the devices or a complete copy can be preserved by each device. Synchronous computations are performed when vector is distributed over multiple devices where in each device is responsible for execution of input vector part in parallel. User can customize distribution of vector between multiple devices where data exchange is handled automatically by SkelCL. [21] CUDACL CUDACL is a tool for CUDA and OpenCL developers which assist them to declare one or more parallel code blocks is a program and run it in a GPU. It gave an edge to the developers by providing CUDA and OpenCL kernel calls inside an existing program and was based on detailed study of CUDA and OpenCL. [9]
Fig. 3. Block Diagram of CUDACL
GPU Computing
The above diagram shows block diagram of a typical CUDACL. Oval shape represents inputs and outputs, boxes show internal details and double circle show configuration file. CUDCL provides an abstract API for CUDA and OpenCL with automatic management of CPU and GPU variable mapping. Here no explicit mapping from GPU to CPU is performed by the programmer. Formally, programmers had to create mappings for which variable size had to be specified which is now handled by the CUDACL. The programmer has to provide with several configurations settings before executing a code block in parallel from a sequential program. The efficiency of the program depends on the customized configurations with specified thread size and block size. Three dimensional threads and two dimensional blocks are present in the existing architecture. API is capable of retrieving best value of work group for the given local memory usage, register usage and device. ExcelTM sheet is used by CUDA to calculate the best value of the block. Users are provided with options to choose between customized configurations or to automatically find the best work item size and work group for the given problem using OpenCL API. [9] Performance GPUs are a power house. Their highly parallel nature and architectural magnificence at an appallingly low cost make it an attractive option for high performance computing. Many GPUs have thousands of parallel computation threads running on hundreds of thread processors. More transistors are doing data processing rather than data caching and flow control thereby making it the favorite choice for arithmetic intensive data parallel computations. Main memory bandwidth is way higher than many processing lords. Every year GPUs performance has been escalating at a factor of two per year which is way ahead of the curve as portrayed by Moores law. [32] CPU versus the GPU.
Fig. 4. Graph depicting the bludgeoning prowess of GPUs over CPU GPUs have a bandwidth of 50GB/s for the system memory whereas the CPU has a main memory bandwidth of 8.5GB/s. 50GFlops per die is the current peak performance of a CPU whereas a GPU surpasses its mark by having over 500GFlops. [32] had conducted an evaluation of the speedup and cost involved via a parallel algorithm for matrix multiplication. The experiment was conducted to compare between AMD FireStream (625MHz, 1GB GDDR3 at 993Mhz) 9250 graphics card and AMD Phenom 9950 Quad Core CPU processor. The speed up and relevant results in seconds is depicted below. N 2 16 256 4096 GPU Time 0.009834 0.009781 0.012142 6.487091 CPU Time 0.000002 0.000023 0.118239 3678.568022 Speed-Up 0.000227 0.000780 9.737783 567.059751
GPU Computing
Table 1: Speed Up As the monstrosity of the matrix increases the GPU appears to be way ahead of the curve the CPU depicts. GPU is not beneficial for pusillanimous problems, but yes for massive computations- GPUs are the best bet for the buck. The code for the GPU was running a modified version of the generic n 3 matrix multiplication with the kernel codes were defined explicitly; kernel functions are called on each and every element of the stream. [32] CUDA versus OpenCL [10] had utilized 16 benchmarks and had done a comprehensive comparison of two extremely powerful programming frameworks namely CUDA and OpenCL. OpenCL is the more portable when compared to CUDA because CUDA is NVIDIA centric. But OpenCL has been recently conceived, castigating it for a minor slackness could be deemed inappropriate. When both real and simulated performance runs are made, they could be aberrations with the real ones since ATI GPUs, Cell BroadBand engine and others run OpenCL but not CUDA. [10] CUDA and OpenCL share some similarities like Global memory, constant memory and shared memory. The OpenCL equivalent for CUDAs thread is work-item. In order to compare both the frameworks many popular benchmarks like radix sort, matrix multiplication, prefix sums, FFT, breath first search and finite difference were used. For convenience a unifying performance metric is used which is termed as Performance Ratio (PR)
Understandably, when PR>1 then performance of OpenCL is better than that of CUDA. Now the GPU having higher memory bandwidth will obviously emerge as the winner. Now another challenge getting in the way of performance evaluation is that both CUDA as well as OpenCL have their own set of compiler directives that will further optimize their performance. And sometimes when the framework runs of the memory then they exit with out of memory message. When we compare the peak performance then we take TP BW (Theoretical Peak Bandwidth) into consideration. TPBW=(Memory Clock)*(MemoryInterfaceWidth/8)*2*10 -9 OpenCL defeats CUDA by 8.5% when the test is conducted on GTX280 and 2.4% when we test is conducted on GTX480. When we pit floating point operations capability against each other then the theoretical peak floating point operations per second in both the frameworks are almost equivalent with OpenCL having a mere lead by a whisker even on CUDA platforms. When we harness their programming models by using directives or enabling/disabling hardware features then performance differences are observed because OpenCL avoids such features in order to keep its portability virtue. There are many optimizations that exist on native kernels like enable/disable texture memory, coalesce global memory, utilization of shift operations to avoid expensive calculations like division and modulus and use of optimizations to enhance branch predictions. CUDAs loop unrolling is more advanced than OpenCLs. In certain applications like FFT, CUDA outperforms OpenCL since CUDAs front end compiler is heavily optimized. OpenCLs performance drops in some application due to portability. But on the whole CUDAs performance and OpenCLs performance are almost comparable. When vendor independent directives are used in OpenCL, it is faster to program. Tools like auto-tuner could be used to fine tune OpenCLs performance. There is no reason for OpenCL to underperform provided the conditions are the same. [10] GPU versus FPGA Now let us digress from this analysis and focus on the superpowers. Both GPU and FPGA have shown delightful performance statistics when compared to the CPU. For comparison HC-1 FPGA- based HPCS system and the GPU NVIDIAs GTX285 GeForce 200b was utilized. The HC-1 has 2U servers, four Virtex-5 as application engines with 128GB of DDR2RAM and possessing a memory bandwidth of 80GB/sec whereas the GPU is a 240core powerhouse running at 1.4GHz with 4GB of external DDR3 RAM and possessing a memory bandwidth of 159GB/sec. The development model for FPGA is in C/C++ and the GPU code is scribbled in CUDA. The benchmarks that both the computing monsters has to undergo were the batch generation of random numbers, matrix multiplication, second order simulation of N-body and sum of large vectors of random numbers.
GPU Computing
The figures below depict the performance statistics comparison of GPU and FPGA for random number generation, sum of vector of random numbers and multiplication of matrices.
Fig. 5. Performance comparison GPU (red) versus FPGA (blue), graph 1 is random generation, graph 2 is matrix multiplication and graph 3 is sum of 64 bit vectors For most of the benchmarks GPU has surpassed the speed of FPGA save the generation of random numbers. FPGA can be suited for specialized applications. Due to predominance of GPUs Cray supercomputer has opted for integrating NVIDIAGPUs instead of FPGA in its blood stream. Issues Thread Cooperation Most algorithms require the concurrent processes to transmit values between each other for synchronization purpose. Like the travelling sales man problem requires the developer to share the cost of intermediate nodes so that pruning can be accomplished or the case of summing large arrays using divide and conquer method. Seldom one comes across problems where subsections of problems compute and terminate without any message passing. In the upcoming section we now discuss the CUDA way of achieving thread cooperation. In a GPU all cores are grouped into blocks and we can execute multiple threads within a block. Parallel threads have a greater range of abilities when compared to parallel blocks. Kernel invocation resembles the code snippet as given below: nameOfTheFunction<<NumBlocks, NumThreads>>>(device_a,device_b,device_c); If we needed N threads within one block we need to have figures within angular brackets as <<1,N>>> and if there is a need to commence N blocks with a thread each then the angular brackets would resemble <<N,1>>>. Need to access a thread within a block could be satiated by using threadIdx.x and accessing a block could be achieved by blockIdx.x. A programmer is allowed to use no greater than 65,535 blocks and is limited to 512 threads per block. In order to exceed 512 blocks we need to use a combination of threads and blocks. A programmer need not exceed this because actual GPU has fewer processing units. Hardware scheduling and waiting issues could crop up. CUDA decouples parallelization, execution and scheduling from the programmers burden and takes care of it. The programmer can heave a sigh of relief about the number of processing units because NVIDIA ships with numerous arithmetic units ranging between 8 and 480. In order to access the index from the two dimensional space in linear fashion we could write the following snippet. threadIdx.x + blockIdx.x * blockDim.x; The variable can be made shared by using the keyword __shared__. This shared variable is replicated in each block. Synchronization helps to evade the evil race condition which is achieved by making a call to __syncthreads(), this modifies the block within the shared cache. The [31] book depicts examples of dot-product and shared bitmap to further illustrate the concept of synchronization. Apart from CPU and GPU collaboration the issues of collaborating within and among the GPU blocks were also looked into. It bears a semi resemblance to standard C. Optimization A huge constellation of optimizations are available for various GPU architectures. Despite disparate hardware a reasonable amount of effort has been ploughed in by the programming community to make GPU programming easier to harness the capability of the underlying hardware. [8] It is recommended to divide the algorithm into threads using shared memory since CUDA allows rapid communication via shared memory; redundant reads are evaded via shared memory. Once the threads are segregated they must be grouped into blocks. [8]
GPU Computing
Then certain programming choices could be made to make GPU computing more alluring. One of them being global memory coalescing. It allows us to group successive locations into one load instructions thereby enhancing its throughput. [8] Other optimizations are reducing global memory bandwidth; enhance occupancy of threads by ensuring that they are idle for the least amount of time. Slow off chip memory should be accessed as less as possible. And of course even above and beyond the optimizations mentioned above one must stick to the principles and do what is required to adhere to the norms and avoid pitfalls like dangling pointers, memory leaks, reduce clock cycles per instructions and reduce memory bank conflicts. [8] Resource Sharing Computing resources are unbalanced when we weigh both the microprocessors and numerous GPU cores. This may result in underutilization of many a component. In order to combat the asymmetrical paradox we ought to virtualize the resources so that the underutilized microprocessors could share the GPUs. High performance systems with the combination of CPUs and GPUs are definitely more efficient and impressive; they tend to bring in its own set of challenges. Currently SPMD (Single Program Multiple Data) is used for programming on parallel, heterogeneous systems. Offloading sections of code to function on GPU is daunting due to diverse instruction set architecture (ISA). It is extremely necessary to maintain one to one mapping between GPUs and CPUs because every program section must have access to GPU accelerator since GPU are crafted for calculation intensive applications. [22] [22] has proposed that a virtual view of the GPU can be exposed to each CPU thereby ensuring that each CPU must have access to its own virtual GPU. Co-scheduling between them can induce overheads. The infrastructure of setting up the virtual GPU with the CPU resembles the fig. 5.
Fig. 6. Virtual GPU Infrastructure The runtime virtualization layer manages the memory resources and GPU computing by setting up virtual memories, shared requests/responses (implemented as POSIZ message queue with handshaking for synchronizing the responses) , setting upper limits and barriers and setting up concurrent I/O and kernel execution layer. API process layer is located above the runtime layer. This abstract API layer interacts with the user process so that they enjoy transparent access to run time layer. Programs can specify which GPU function they wish to accomplish. A good deal of overhead espouses from message synchronization and data transfer between the API layer and runtime layer. But the overhead time is annulled by the performance enhancements we accomplish. It can vary with the speed up of twice or even thrice, we can surmise this because [22] had conducted numerous benchmark based performance analysis with various kernels kike CG [33], MM (floating point 20482 single precision) and BlackScholes [34].
10
GPU Computing
The sound analytical execution model can strike a balance between the asymmetric distribution of sparse CPU and numerous GPUs. The virtual SPMD execution scenario does depict significant performance improvement when applied to any high performance systems. Virtualization In order to enable multiple virtual machines to operate within the realms of a single physical environment we need a virtual machine monitor to oversee device management, intrusion detection, device reuse and memory management module. This is difficult in GPU because interface is proprietary and designers like to keep their cards close to their chest. This makes it difficult to abstract hardware layer for a virtualization scheme. [14] In order to extenuate it (source) has proposed vCUDA. It functions by redirecting CUDA API in virtual machines to privileged domain with real GPU enabled. Virtual machine remote procedure tool aids to speed up data transfer. The result is a high performing virtual machine monitoring agent. Virtualization at system level manages to decouple the hardware from the software. [14] GPU virtualization can be achieved by wither front end virtualization or back end virtualization. Front end virtualization involves API remoting and device emulation whereas the back end virtualization is about the Virtual machine monitor pass-through. Other methods could involve device emulation, protocol redirection and self virtualized hardware. Current APIs like OpenCL, CUDA, Direct3D, DirectCompute, etc do not support front end virtualization in the form of API redirection. Device emulation is rather difficult due to insufficient documentation for the devices. [14] vCUDA has three user space modules, one of them being the virtual GPU which is a database in the guest operating system. Then other is the stub which behaves as a stub in the server when we consider the vCUDA to be a client server model. We consider the server memory to be a part of host OSs address space and we consider the device memory to be a part of the memory space of host OSs graphical device. [14] vCUDA will lie in the guest OS instead of the actual CUDA library. It will redirect and intercept API call from our software application to vCUDA module. vCUDA offers two types of execution modes which are traditional RPCs like XMLRPC and other one is SHARE mode which build based on the foundation of VMRPC. [14] Despite hardware specifications being secretive vCUDA can aid us to have a virtual setup up and running on the GPU. So through the vCUDA, some functional extensions a communication platform like RPC we can achieve virtualization on the GPU platform. Applications GRID A grid enabled toolkit called GridCUDA is developed by NVIDIA for efficient implementation of GPGPU resources for parallel execution. This enables programmers to write programs using CUDA API and explore GPGPU resources available in grids with support to multithreaded programs. Instead of linking CUDA libraries programmers need to link GridCUDA libraries. Resource broker is responsible for resource allocation as per the need of the program. GridCUDA interacts with resource broker; resource broker then looks for available resources and allocates them transparently. Remote procedure calls are used for execution where GridCUDA client redirects invoked CUDA functions to allocated GPU resources. Execution by resource providers on the server of GridCUDA is performed on respected hosts. Registration of the information of the GPU resources by GridCUDA server is done automatically and waits for remote procedure call requests coming from remote user programs. GridCUDA has a client and server model where client is responsible for redirecting the CUDA function calls using remote procedure calls from user programs to remote GPU devices. GridCUDA is responsible for handling of the RCP requests incoming form GridCUDA client. CUDA functions called by user programs are performed when GridCUDA client invokes the CUDA driver. Operation objects such as CUmodule, CUcontext and CUfunction are created by GridCUDA for each client to bind GPUs and are repeatedly used remote CUDA procedure calls during execution of CUDA apps. When execution of an application is finished, operation objects are destroyed bound GPU devices are released for use by other applications. [24] CUDA and MPI Message passing interface is standardized message passing system used in parallel programming and has been favorite choice for high performance computing for more than a decade. MPI can be implemented with CUDA for delivering high performance in parallel computations. CUDA and MPI show variations in programming styles but are dependent on inherent parallelism of the applications. CUDA with MPI programming approach was
11
GPU Computing
implemented on Strassens algorithm and Conjugate Gradient algorithm where MPI was used for distributed mechanisms and CUDA played role of main execution engine. Results showed increase in performance level in applications having inherent parallelism characteristics. CUDA + MPI showed high performance level compared to MPI cluster; thus, low cost high performance clusters can be built using MPI and CUDA. Cluster GPUs have evolved to become high performance accelerators for parallel programming. GPUs consists of several processing units which are able to achieve 1 TFlops for single precision arithmetic and 80 GFlops for double precision calculations. GPUs have now become necessary for applications having floating point arithmetic and memory operations running in parallel. GPUs are cost effective high performance computing accelerators capable of reducing power, space and cooling demands and surpass CPU only cluster of similar aggregate capabilities. NVIDIA provides commercially available Tesla GPU for high performance computing. These are available as add on boards and in rack mount cases containing four GPU devices. [28] Others 1) Medical imaging: Breast cancer had plagued numerous women. Advances have been made towards its cure via chemotherapy and radiation, but its side-effects are ignominious. Some have survived but many could not escape the fangs of death. Surgeries have left a dreaded scar on the victims. The mammogram is an efficient way to detect breast cancer early but it is riddled with limitations. Multiple X-Rays must be taken thus endangering the patients chest, specific imaging out to be done, in extreme cases biopsy has to be conducted. The series of events and actions can culminate in a bizarre end. In order to combat the limitations of mammogram the TechniScan was released which utilizes a 3D ultrasound approach but it is not widely used due to computations limits, now from the ashes emerges GPU to salvage the situation. With the computing prowess of GPUs the TechniScans power was well utilized to resurrect the victims of breast cancer. The TechniScan Svara harnesses the power of NVIDIAs Tesla C1060. Within 20 minutes the doctors are well informed of the complications that have afflicted the breast cancer patient. [31] 2) Computational Fluid Dynamics: Extremely efficient and robust rotors and blades was something which were very elusive. Fluid dynamics and modeling of air movements entailed a lot of number crunching computations. Supercomputers were needed to exorcise the lost spirit of computation. But the combo of NVIDIAs GPU and CUDA in a cluster form came to the forefront to expiate the void. Speedy redress and streamlined feedback enabled researchers to achieve breakthrough in the field of fluid dynamics. [31] 3) Environmental Science: Detergents used to rein on the purgative powers of surfactants. One needs to determine the purging capabilities, texture of the detergent, environmental hazard effects must also be tested for, interaction with water ought to be tested. This deserves extensive laboratory testing and intense research. GPU is adept to handle such a load. Temple University and Procter and gamble have teamed up to use molecular simulation to measure surface interaction with dirt, water and other materials. They have used Highly Optimized Object oriented Many-particle Dynamics written by department of energy. Examples
Fig. 7. (Left) Realization of FDTD; (Right) Periodic analysis of computing threshold and binarizing of images
12
GPU Computing
FDTD FDTD algorithm is implemented in CUDA to achieve high efficiency. The implementation of FDTD can be realized using rectangular coordinates. Rectangular coordinates are considered as elementary cells where generic space point P can be identified using vector (i, j, k). Any function F on a given generic point P is identified using the notification F|ni, j ,k which means that computation of function F at given time nt, where t is time step at point (i ,j ,k). FDTD solutions are in time domain such that its implementation for solution of Maxwells differential equation uses E field (the time derivate) and H field (across the space). Leap frog implementation scheme is used for the implementation of temporal and spatial fields to solve Maxwells equation. At each mesh point (i, j, k) on given time t = nt +1/2 the component at Hn+1/2 is calculated as a function with respect to the previous value i.e. H n-1/2 at the same point including the E field at t=nt on mesh points which belong to neighbors of (i, j, k). Other E fields are similarly computed. FDTD can be realized using Fig. 7. [27] Implementation of CUDA using FDTD comprises of two kernels for calculations of H and E components separately. Data transfer between host and device is performed before kernel invocation and after kernel invocation. At first, GPUs global memory is updated by the main CPU which is then transferred to the kernel as per need. When all the computations are performed the resultant data is copied on the main memory of the CPU. Image Processing The capability of GPU showing rapid increase in speed compared to CPU can be efficiently implemented in image processing for implementation of algorithms such as DTC encoding and decoding, histogram equalization, edge detection etc. This can be realized by comparing NVIDIA Ge80 series GPU and Intel 64-bit dual core CPU where Ge80 has 520 GFlops and Intel dual core has only 32 GFLops. Image processing requires computations of common procedures in parallel over various pixels and can be implemented on still images and videos. [30] Case study on FamilySearch Digital Image processing shows that use of GPU is more efficient in image processing than multiple CPUs. FamilySearch is responsible for obtaining images of genealogical records in large amount. Source of the images are digital capture and microfilm which are processed by various CPU based servers called DPC (Digital processing center). Image processing taking place in DCP utilizes function calls to IIP (Integrated performance primitives) and OpenCV libraries, replacing these CPU functions with GPU equivalent would increase the performance and relieve developers from optimization burden. NVIDIA has produced NPP libraries (NVIDIA performance primitives) as a response to Intels IIP libraries which is implemented using GPU in CUDA. Solutions provided by NNP libraries can be easily integrated with existing projects such as FamilySearch which use IPP. Performance testing using both the libraries was performed for cropping of image. FAmilySearch uses image cropping algorithm extensively to provide uniform border around images. Cropping of image has three main phases. First the threshold value of the image is computed then the image is binarized and at last image bound is computed which determines part of the document. Threshold is calculated using histogram which can be easily parallelized using GPU. As in fig. 7, three images of varying dimensions were selected for performance analysis. For the small image GPU showed 14% faster performance compared to CPU. Medium image showed slightly higher increase in performance rate. For large image, performance increased by 16x. Results show that use of GPU in image processing had high performance level than use of multiple CPUs. [30] Atmospheric modeling Highly data parallel applications such as atmospheric model requires heavy computations which are carried out in parallel and can be implemented using GPU. NCAR (National Center of Atmospheric research) uses CAM (community atmosphere model) which is responsible for climate and weather research. CAM simulates entire climate system and earths atmosphere which includes ocean, sea ice and land. In CAM there are 139,000 lines of code written in Fortran 90 which simulates computationally expensive processes including emission, transmission, absorption and reflection of various light wavelengths in atmospheric layers. The GPU implementation uses Nvidia GeForce 9800 GX2. The specifications of this device are as follows: consists of dual GPUs, each having 16 multithreaded streaming multi processors and DDR memory of 512 Mb. Each streaming multiprocessor consists of eight scalar processor cores, 16 Kb of local memory and two special function units. The atmosphere is subdivided into 3 dimensional cells with the help of latitude longitude grid. All the programs are written in CUDA which improvise on huge number of threads running in parallel. Grids are made up of blocks which in turn are made up of organized threads. Unique index is given to each thread in a block and each block in a grid. A global unique index is given to a thread by combining local thread index and box index in CUDA
13
GPU Computing
application. Thread indices are mapped on to data elements with multiple threads performing same operation on different data.
Fig 8. Block diagram of GeForce 9800 GX2 Central part is the global memory, constant and texture memory. Above and below the memory there are 16 streaming multiprocessors (sm). Each streaming processor consists of 8 scalar processors (sp) and two special function (sf) units. Each block is equipped with 16 Kb of local memory and cache memory for texture (tc) and constant memory (cc). The programmer has to specify number of threads in a block and number of blocks in a grid required for execution of the program. CUDA code is written in special kernel function which is executed by the GPU which is launched on the grid. Execution of each thread block is carried out independently on streaming multiprocessor. Global memory is shared by all the threads and threads within a block have local shared memory for faster access which optimizes memory access important for CUDA performance.
H264-AVC H.264/AVC is a standard used for video rendering developed by ITU-T VCEG (video coding experts group) and ISO/IEC MPEG (Moving picture Experts Group) having wide range of implementation in High definition TV (HDTV) and 3D video coding. It consists of modules such as decoder, deblocking filter and is very complex module and can be optimized by using CUDA technology.
14
GPU Computing
Fig. 9. Proposed architecture for CUDA capable GPU Traditional CPU can be used to program GPU in CUDA programming environment. The fundamental component of CUDA is the kernel function which acts as coprocessor to the CPU. Execution of kernel function is carried out on threads and threads are arranged in blocks. Each block consist of maximum 512 threads and comprises of local memory so that threads in a block can communicate with each other and co-work. Threads present in different blocks cannot communicate with each other. There are global, local and shred memories and only shared memory space is cached and is on-chip. Vertex shader and pixel shader are integrated in Stream Processor. Each streaming multiprocessor is made up of eight stream processor. Stream multiprocessor feature SIMD (single instruction multiple data) architecture where same instructions execute on different data. Streaming processor show functionalities such as multiply-add (MAD) unit and additional multiply (MUL) unit each running on 1.35 GHz. Use of CUDA enabled GPU show high performance in execution of data adaptive processing algorithms. Conclusion Through the discussions we surmise that GPUs are the clear forerunners for high performance computing due to its bludgeoning prowess. GPU has surpassed computational boundaries and made the computationally infeasible solutions very palatable. It is imperative for the co-processing of CPU and GPU to take on the mantle of responsibly and scotch the roads to massively parallel computation.
REFERENCES
[1] J. Breitbart, "CuPP - A framework for easy CUDA integration," in Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, 2009, pp. 1-8. [2] I. Buck, "GPU computing: Programming a massively parallel processor," in Code Generation and Optimization, 2007. CGO '07. International Symposium on, 2007, pp. 17-17. [3] Chien-Ping Lu, "K3 moore's law in the era of GPU computing," in VLSI Design Automation and Test (VLSI-DAT), 2010 International Symposium on, 2010, pp. 5-5. [4] J. Cohen and M. Garland, "Novel Architectures: Solving Computational Problems with GPU Computing," Computing in Science & Engineering, vol. 11, pp. 58-63, 2009. 5 F. Feinbube, P. Tr ger and A. Polze, Joint Forces: From Multithreaded Programming to GPU Computing, Software, IEEE, vol. 28, pp. 51-57, 2011. [6] Feng Cui, Changjian Cheng, Feiyue Wang, Wei Wei, Lefei Li and Yumin Zou, "Accelerated GPU computing technology for parallel management systems," in Intelligent Control and Automation (WCICA), 2010 8th World Congress on, 2010, pp. 5343-5347. [7] M. Garland, "Parallel computing with CUDA," in Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, 2010, pp. 1-1.
15
GPU Computing
[8] P. Goorts, S. Rogmans, S. Vanden Eynde and P. Bekaert, "Practical examples of GPU computing optimization principles," in Signal Processing and Multimedia Applications (SIGMAP), Proceedings of the 2010 International Conference on, 2010, pp. 46-49. [9] F. Jacob, D. Whittaker, S. Thapaliya, P. Bangalore, M. Mernik and J. Gray, "CUDACL: A tool for CUDA and OpenCL programmers," in High Performance Computing (HiPC), 2010 International Conference on, 2010, pp. 1-11. [10] Jianbin Fang, A. L. Varbanescu and H. Sips, "A comprehensive performance comparison of CUDA and OpenCL," in Parallel Processing (ICPP), 2011 International Conference on, 2011, pp. 216-225. [11] D. H. Jones, A. Powell, C. -. Bouganis and P. Y. K. Cheung, "GPU versus FPGA for high productivity computing," in Field Programmable Logic and Applications (FPL), 2010 International Conference on, 2010, pp. 119-124. [12] N. P. Karunadasa and D. N. Ranasinghe, "Accelerating high performance applications with CUDA and MPI," in Industrial and Information Systems (ICIIS), 2009 International Conference on, 2009, pp. 331-336. [13] R. Kelly, "GPU Computing for Atmospheric Modeling," Computing in Science & Engineering, vol. 12, pp. 26-33, 2010. [14] Lin Shi, Hao Chen and Jianhua Sun, "vCUDA: GPU accelerated high performance computing in virtual machines," in Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, 2009, pp. 1-11. [15] D. Luebke, "CUDA: Scalable parallel programming for high-performance scientific computing," in Biomedical Imaging: From Nano to Macro, 2008. ISBI 2008. 5th IEEE International Symposium on, 2008, pp. 836-838. [16] J. Nickolls and W. J. Dally, "The GPU Computing Era," Micro, IEEE, vol. 30, pp. 56-69, 2010. [17] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone and J. C. Phillips, "GPU Computing," Proceedings of the IEEE, vol. 96, pp. 879-899, 2008. [18] Qihang Huang, Zhiyi Huang, P. Werstein and M. Purvis, "GPU as a general purpose computing resource," in Parallel and Distributed Computing, Applications and Technologies, 2008. PDCAT 2008. Ninth International Conference on, 2008, pp. 151-158. [19] M. Showerman, J. Enos, C. Steffen, S. Treichler, W. Gropp and W. -. W. Hwu, "EcoG: A Power-Efficient GPU Cluster Architecture for Scientific Computing," Computing in Science & Engineering, vol. 13, pp. 83-87, 2011. [20] Slo-Li Chu and Chih-Chieh Hsiao, "OpenCL: Make ubiquitous supercomputing possible," in High Performance Computing and Communications (HPCC), 2010 12th IEEE International Conference on, 2010, pp. 556-561. [21] M. Steuwer, P. Kegel and S. Gorlatch, "SkelCL - A portable skeleton library for high-level GPU programming," in Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, 2011, pp. 1176-1182. [22] Teng Li, V. K. Narayana, E. El-Araby and T. El-Ghazawi, "GPU resource sharing and virtualization on high performance computing systems," in Parallel Processing (ICPP), 2011 International Conference on, 2011, pp. 733-742. [23] Ting Liu, Eryan Yang, Ronghui Cheng and Ying Fu, "CUDA-based H.264/AVC deblocking filtering," in Audio Language and Image Processing (ICALIP), 2010 International Conference on, 2010, pp. 1547-1551. [24] Tyng-Yeu Liang and Yu-Wei Chang, "GridCuda: A grid-enabled CUDA programming toolkit," in Advanced Information Networking and Applications (WAINA), 2011 IEEE Workshops of International Conference on, 2011, pp. 141-146. [25] R. Vuduc and K. Czechowski, "What GPU Computing Means for High-End Systems," Micro, IEEE, vol. 31, pp. 74-78, 2011. [26] Wei Cao, Lu Yao, Zongzhe Li, Yongxian Wang and Zhenghua Wang, "Implementing sparse matrix-vector multiplication using CUDA based on a hybrid sparse matrix format," in Computer Application and System Modeling (ICCASM), 2010 International Conference on, 2010, pp. V11-161-V11-165. [27] Zhang Bo, Xue Zheng-hui, Ren Wu, Li Wei-ming and Sheng Xin-qing, "Accelerating FDTD algorithm using GPU computing," in Microwave Technology & Computational Electromagnetics (ICMTCE), 2011 IEEE International Conference on, 2011, pp. 410-413. [28] Zhe Fan, Feng Qiu, A. Kaufman and S. Yoakum-Stover, "GPU cluster for high performance computing," in Supercomputing, 2004. Proceedings of the ACM/IEEE SC2004 Conference, 2004, pp. 47-47. [29] Zhihui Zhang, Qinghai Miao and Ying Wang, "CUDA-based jacobi's iterative method," in Computer Science-Technology and Applications, 2009. IFCSTA '09. International Forum on, 2009, pp. 259-262.
16
GPU Computing
[30] Zhiyi Yang, Yating Zhu and Yong Pu, "Parallel image processing based on CUDA," in Computer Science and Software Engineering, 2008 International Conference on, 2008, pp. 198-201. [31] Jason Sanders and Edward Kandrot, CUDA Bu Example: Addison Wesley, 2010 [32] Gang Chen, Guobo Li, Songwen Pei and Baifeng Wu, "High performance computing via a GPU," in Information Science and Engineering (ICISE), 2009 1st International Conference on, 2009, pp. 238-241. [33] M. Malik, T. Li, U. Sharif, R. Shahid, T. El-Ghazawi and G. Newby,Productivity of GPUs under Different Programming Paradigms, submitted to Concurrency and Computation: Practice and Experience. 34 F. Black and M. Scholes, The pricing of options and corporate liabilities, Journal of Political Economy, vol. 81, no. 3, May-June 1973, pp. 637654.

GPU Computing Manmath and Karan

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GPU Computing Manmath and Karan

Uploaded by

Copyright:

Available Formats

CSULB

Paper on GPU Computing

Keywords: GPU computing, high performance computing, massively parallel processing

Fig. 1. Tesla C1060

Fig. 3. Block Diagram of CUDACL

You might also like