CUDA Workshop

CUDA Workshop
High Performance GPU computing

EXEBIT- 2014
Karthikeyan
CPU vs GPU
• CPU – Very fast, serial, Low Latency
• GPU – Slow, massively parallel, High Throughput
• Play Demonstration
Compute Unified Device Architecture
CUDA
• Exposes GPU computing for general purpose
• Flexible and scalable architecture
• Based on industry-standard C/C++
• Small set of extensions to enable heterogeneous
programming
• Straightforward APIs to manage devices, memory
etc.
• For NVIDIA GPUs only
Concepts to be covered
• Heterogeneous computing
• Blocks, Threads
• Indexing
• Shared memory
• __syncthreads()
• Warps, Divergence
• Asynchronous operation
• Handling errors
• Managing devices
Heterogeneous Computing
• CPU – Host, CPU RAM – Host Memory
• GPU – Device, GPU RAM – Device Memory
www.nvidia.com
Hello World!
• GPU code – kernel
• __global__ indicates it runs on device
• Triple angle brackets mark a call from host code to
device code __global__ void mykernel(void) {
• “kernel launch” cuPrintf(“Hello World!\n”);
}
• Returns ‘void’ int main(void) {
mykernel<<<1,1>>>();
printf(“CPU Hello World!\n");
return 0;
}
Hello World!
• nvcc helloworld.cu
• ./a.out
Working with codes
• Open Terminal
• ssh –X user#@10.21.1.166 (user 1-25)
• ssh –X guest@10.6.5.254 (user 26-50)
• ssh –X guest@192.168.1.211 (user 26-50)
• cd codes/helloworld/
• make
• ./helloworld
• gedit &
Hello World! Parallel
• Change to mykernel<<<N,1>>>();
• Launches N blocks
• CPU calls kernel & continues its work
__global__ void mykernel(void) {
cuPrintf(“Hello World!\n”);
• Compile: }
int main(void) {
$ cd helloworld_blocks int N=100;
$ make mykernel<<<N,1>>>();
$ ./helloworld_blocks printf(“CPU Hello World!\n");
return 0;
}
Processing Flow
PCI Bus
Copy from Host Memory (CPU)

to Device Memory (GPU)
Processing Flow
CPU launches Kernel
PCI Bus
• Kernel accesses memory

at much faster rate
• Utilizes on-chip cache memory
Processing Flow
PCI Bus
Copy results back

from Device Memory (GPU)
to Host Memory (CPU)
Device Memory Management
• cudaError_t cudaMalloc( void ** devPtr, size_t size_bytes)
• cudaError_t cudaMemcpy ( void* dst, void* src, size_t count,
enum cudaMemcpyKind kind)
– cudaMemcpyHostToHost Host -> Host
– cudaMemcpyHostToDevice Host -> Device
– cudaMemcpyDeviceToHost Device -> Host
– cudaMemcpyDeviceToDevice Device -> Device
• Example: int a[100], *dev_a;
cudaMalloc (&dev_a, sizeof(int)*100);
cudaMemcpy( dev_a , a , sizeof(int)*100,
cudaMemcpyHostToDevice);
Vector Addition
• How to identify which block it is?
0 0 0
• Each block takes care of one element
1 1 2
• blockIdx.x
2 2 4
• You can have 3 dimensional blocks
3 4 8
• blockIdx.x, blockIdx.y, blockIdx.z
• gridDim.x dim3(65535,65535,1024)
void vectoradd(int *a, int *b, int *c) { __global__ void vectoradd(int *a, int *b, int *c) {
for( int i=0; i<100; i++) int i = blockIdx.x;
c[i] = a[i] + b[i]; c[i] = a[i] + b[i];
} }
Vector Addition
__global__ void vectoradd(int *a,int *b, int *c) {
int i= blockIdx.x; Compile:
c[i] = a[i] + b[i]; $ cd vectoradd/
} $ make
int main(void) { $ ./vectoradd
int host_a[100], host_b[100],host_c[100];
int *dev_a, *dev_b, *dev_c;
cudaMalloc( &dev_a, sizeof(int)*100); Memory Allocation
cudaMalloc( &dev_b, sizeof(int)*100);
cudaMalloc( &dev_c, sizeof(int)*100); Memory Copy
cudaMemcpy(dev_a, host_a, sizeof(int)*100, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, host_b, sizeof(int)*100, cudaMemcpyHostToDevice);
vectoradd<<<N,1>>>(dev_a,dev_b,dev_c);
cudaMemcpy(host_c, dev_c, sizeof(int)*100, cudaMemcpyDeviceToHost);
return 0;
}
Threads
• A block can have many threads.
• For vector addition, the kernel launch would be
vectoradd<<<1,N>>>(da,db,dc);
• Maximum thread Dimension (3-dimensional)
(1024,1024,64)
__global__ void vectoradd(int *a, int *b, int *c) {
• threadIdx.x int i=threadIdx.x;
c[i]=a[i]+b[i];
• blockDim.x }
Compile:
$ cd vectoradd_threads/
$ make
$ ./vectoradd_threads
Threads
• 3D mesh
inside 3D mesh
• Why threads?
– Communicate
– Synchronize
• Blocks can’t
Built-in Variables
• threadIdx.x, threadIdx.y, threadIdx.z
• blockIdx.x, blockIdx.y, blockIdx.z
• blockDim.x, y, z (1024,1024,64) – Number of threads
per block.
• gridDim.x, y, z (65535, 65535, 1024) - Number of
blocks in a kernel call (called as Grid of blocks)
Index Calculation
• Using blocks and threads simultaneously.
• i=threadIdx.x + blockDim.x*blockIdx.x;
threadIdx.x threadIdx.x threadIdx.x threadIdx.x

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x=0 blockIdx.x=1 blockIdx.x=2 blockIdx.x=3
• blockDim.x=8 (no of threads in a block)

• gridDim.x=4 (no of blocks in that kernel launch)
• add<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(
Boundary Conditions
• Usually blockDim.x is in multiples of 32
• Always put boundary conditions on data size
__global__ void vectoradd(int *a, int *b, int *c, int N) {
int i = threadIdx + blockDim.x * blockIdx.x;
if(i<N)
c[i]=a[i]+b[i];
}
Compile:
$ cd vectoradd_full/
$ make
$ ./vectoradd_full
For Very Large N
• Very large N (N>106)
__global__ void vectoradd(int *a, int *b, int *c, long N) {

long i = threadIdx + blockDim.x * blockIdx.x;
for(; i<N; i+= gridDim.x * blockDim.x)
c[i]=a[i]+b[i];
}
Compile:
$ cd vectoradd_large/
$ make
$ ./vectoradd_large
Block Scheduling
• Streaming Multiprocessors are
executing units - SM
• Different GPUs have different
no of SMs.
• There is communication
among threads.
• No communication
among blocks.
• No specific order in block
scheduling.
Block Scheduling
• All threads in a block execute in a single SM
• No guarantee in order of execution
• Hardware schedules based on available SMs
3 SMs available
BLOCK 1
BLOCK 2
BLOCK 3
BLOCK 4
Block Scheduling
3 SMs available
BLOCK 1
BLOCK 2
BLOCK 3
BLOCK 4
Block Scheduling
3 SMs available
BLOCK 1
BLOCK 2
BLOCK 4
Block Scheduling
3 SMs available
BLOCK 1
BLOCK 2
BLOCK 4
1-D Stencil
• Compute a(i)+a(i+1)+a(i+2)
__global__ void stencil(int *a, int *b) { Compile:
int i=threadIdx.x; $ cd 1dstencil/
b[i]=a[i]+a[i+1]+a[i+2]; $ make
} $ ./1dstencil
0 1 2 3 4 5 6 7
threadIdx.x=0
threadIdx.x=1
Global Memory
• Till now we have been using global memory
for our computations.
• Very slow to access
• Allocated using cudaMalloc(..)
1-D Stencil Revisited
• Compute a(i)+a(i+1)+a(i+2)
__global__ void stencil(int *a, int *b) {
int i=threadIdx.x;
b[i]=a[i]+a[i+1]+a[i+2];
}
• Data could be shared among threads
0 1 2 3 4 5 6 7
threadIdx.x=0
threadIdx.x=1
• 3 global read + 1 global write per thread

Shared Memory
• Memory shared among threads inside a block.
• Cannot be accessed from another block
• Declared inside kernel code
• __shared__ int a[100];
• On-chip, very fast
1-D Stencil Shared
• Copy to shared memory
__global__ void stencil(int *a, int *b) { Compile:
int i = threadIdx.x; $ cd 1dstencil_shared
__shared__ sa[100]; $ make
sa[i] = a[i]; $./1dstencil_shared
b[i] = sa[i]+sa[i+1]+sa[i+2];
}
• Write the result to global memory

• Shared memory is visible a block only.
• Cannot be accessed by other blocks, CPU
Access Times
• Registers (1-2 cycles)
• Shared memory (10 cycles)
• Global memory (100s of cycles)
• Local memory (100s of cycles)
Run time Comparison
• 3 global read + 1 global write per thread
• 3*100+100=400 cycles
• 1 global read + 3 shared read + 1 global write per

thread
• 1*100+3*10+1*100= 230 cycles
• Use nvprof ./file_name to see the runtime of

programs
Memory Hierarchy
• Registers
– Per thread on chip
– Data lifetime = thread lifetime
• Local memory
– Per thread off-chip memory (DRAM)
– Data lifetime = thread lifetime
• Shared memory
– Per thread block : on-chip memory
– Data lifetime = block lifetime
• Global (device) memory
– Accessible by all threads and host (CPU)
– Data lifetime= Entire program
from allocation to de-allocation
• Host (CPU) memory
– Not directly accessible by
CUDA threads
__syncthreads()
• Synchronizes all threads within a block
• Waits till all the threads execute till __syncthreads();
• Used to prevent RAW, WAR, WAW hazards
– RAW – Read After Write
– WAR – Write After Read
– WAW – Write After Write
• Synchronize to commit all the memory writes, reads and
computation.
__syncthreads();
Reduction
• Addition of N numbers
• Other operations +,*, AND, OR, XOR, maximum, minimum etc.
void reduce(int *a, int *result) {
*result=0;
for( int i=0; i<100; i++)
result=result+a[i];
}
• Serial – How to parallelize?

Reduction
• Addition of N numbers
• Other operations +,*, AND, OR, XOR, maximum, minimum etc.
void reduce(int *a, int *result) {
*result=0;
for( int i=0; i<100; i++)
result=result+a[i];
}
• Serial – How to parallelize?

• Using associative property !
• a+b+c+d = (a+b)+ (c+d)
Reduction
• N numbers  log2(N) steps to compute
• Share result of 1st step to other threads in 2nd step.
0 1 2 3 4 5 6 7 threadIdx.x
1 5 9 13
6 22
• Some algorithms are not straight forward to implement in

parallel.
Reduction kernel
• Read to Shared memory
• Operate & write to shared memory Compile:
$ cd reduction/
• Write to global memory $ make
__global__ void reduce(int *a, int *result) { $ ./reduction
int i= threadIdx.x;
__shared__ s_a[N];
s_a[i]=a[i];
__syncthreads( );
for( int stride=1; stride < N; stride*=2){
if(i%stride==0)
s_a[2*i]=s_a[2*i] + s_a[2*i+stride];
__syncthreads( );
}
*result=s_a[0];
}
CUDA programming model
CUDA programming model
Blocks mapped to SM
Warps
• Inside SM, threads are split into group of 32 threads called
warps.
• All threads in single warp execute in parallel.
• If executing warp needs waiting or barrier, it is put into hold
and another warp is dispatched for execution.
• This is taken care by warp scheduler
• All threads in a warp execute SAME instruction.
Warp
• No guarantee on order of warps dispatched.
• GPU architectures – Tesla,Fermi, Kepler
• Warp size = 32
• Fermi
– 2 warp schedulers
– 2 instruction units
Divergence
• Alternative threads in a warp execute different
• each warp takes 2 time step
if ( threadIdx.x%2==0)
a[ threadIdx.x ] +=1
else
}
Warp 1
0 2 4 6 8 . . . if
1 3 5 7 9 . . . else
Divergence
• All threads in a warp executes same instruction
• each warp takes 1 time step
if ( threadIdx.x<32)
else
}
Warp 1 (if)
Warp 2 (else)
Reduction revisited
• Divergence at all strides 0 1 2 3 4 5 6 7
• Each thread in a warp

execute different 1 5 9 13
instructions
6 22
• Solution:
• Modify condition 0
for( int stride=1; stride < N; stride*=2){

if( i % stride==0)
sa[2*i] = sa[2*i] + sa[2*i+stride];
__syncthreads( );
}
*result=sa[0];
}
Reduction (No Divergence)
• Add elements stride away
0 1 2 3 4 5 6 7
4 6 8 10
12 16 for( int stride=blockDim.x; stride >0; stride/=2)

if( i < stride)
sa[i]=sa[i] + sa[i+stride];
28 __syncthreads( );
*result=sa[0];
}
• No Divergence for stride>=32 Compile:
$ cd reduction_nodiv/
$ make
$ ./reduction_nodiv
Resource allocation
• Split your program to small kernels. Why?
• Each SM has limited registers, shared memory.
• The amount depends on compute capability of GPU
• 1.0, 1.1, 1.2, 1.3, 2.x, 3.0, 3.5, 5.0
• Fermi, Tesla, Kepler
• Global memory is large (>512MB)
• nvcc –Xptxas=-v filename.cu
Resource limits
• Number of thread blocks per SM is limited by
– Registers
– Shared memory usage
– No. of blocks per SM
– Number of threads
Limits 1.3 2.X 3.X 5.0
Registers/SM 16K 32K 64K 64K
Shared Memory/SM 16KB 48KB 48KB 64KB
Blocks/SM 8 8 16 32
Threads/SM 1024 1536 2048 2048
• Occupancy
Asynchronous
• Kernel launches are Asynchronous
• cudaMemcpy, cudaMalloc – Synchronous
• cudaMemcpyAsync() - Asynchronous, does not block the CPU
• cudaDeviceSynchronize() - Blocks the CPU until all preceding
CUDA calls have completed
• Asynchronous calls – can utilize CPU also while GPU is busy.
Handling Errors
• All CUDA API calls return an error code (cudaError_t)
– Error in the API call itself
– Error in an earlier asynchronous operation (e.g. kernel)
• Get the error code for the last error:
– cudaError_t cudaGetLastError(void)
• Get a string to describe the error:
– char *cudaGetErrorString(cudaError_t)
• printf("%s\n", cudaGetErrorString(cudaGetLastError()));
Device Management
• Application can query and select GPUs
– cudaGetDeviceCount(int *count)
– cudaSetDevice(int device)
– cudaGetDevice(int *device)
– cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
• Multiple host threads can share a device
• A single host thread can manage multiple devices
– cudaSetDevice(i) to select current device
– cudaMemcpy(…) for peer-to-peer copies
Summary
• Write and launch CUDA C/C++ kernels
– __global__, <<<>>>, blockIdx, threadIdx, blockDim
• Manage GPU memory
– cudaMalloc(), cudaMemcpy(), cudaFree()
• Manage communication and synchronization
– __shared__, __syncthreads()
– cudaMemcpy() vs cudaMemcpyAsync(), cudaDeviceSynchronize()
• Resource limits
– Registers, Shared memory, blocks/SM, threads/SM
Advanced concepts(not covered)
• Memory Coalescing
• Constant memory
• Streams
• Atomics
• Shared memory conflicts
• Texture memory
Tools
• nvcc – NVIDIA compiler
• nvprof - command line profiler
• nvvp – Visual profiler
• cuda-memcheck – Memory bugs
• Nsight – Visual Studio, Eclipse
• Allinea DDT
Libraries
• CUBLAS – CUDA accelerated Basic Linear Algebra
• CUFFT – Fast Fourier Transform (1D, 2D, 3D)
• Thrust – C++ template library (similar to C++ STL)
• CULA –Dense, Sparse Linear Algebra
• OpenCV – Computer Vision, Image processing
• AccelerEyes ArrayFire
• MATLAB, LabVIEW, Mathematica, Python
• ABACUS, AMBER, ANSYS, GROMACS, LAAMPS, NAMD, …
Online Resources
• http://developer.nvidia.com/cuda-training
• Coursera - Heterogeneous computing
• Udacity - CS344 Intro To Parallel Programming
• GPU computing Webinars
• CUDA Documentation
• Books
– CUDA by Example
– Programming Massively Parallel Processors: A Hands-on Approach
– GPU GEMS
Questions?

CUDA Workshop

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CUDA Workshop

Uploaded by

Copyright:

Available Formats

CUDA Workshop

High Performance GPU computing

Copy from Host Memory (CPU)

• Kernel accesses memory

Copy results back

threadIdx.x threadIdx.x threadIdx.x threadIdx.x

blockIdx.x=0 blockIdx.x=1 blockIdx.x=2 blockIdx.x=3

• blockDim.x=8 (no of threads in a block)

global void vectoradd(int a, int b, int *c, long N) {

• 3 global read + 1 global write per thread

• Write the result to global memory

• 1 global read + 3 shared read + 1 global write per

• Use nvprof ./file_name to see the runtime of

• Serial – How to parallelize?

• Serial – How to parallelize?

• Some algorithms are not straight forward to implement in

• Each thread in a warp

for( int stride=1; stride < N; stride*=2){

12 16 for( int stride=blockDim.x; stride >0; stride/=2)

You might also like