You are on page 1of 58

CUDA Workshop

High Performance GPU computing


EXEBIT- 2014
Karthikeyan
CPU vs GPU
• CPU – Very fast, serial, Low Latency
• GPU – Slow, massively parallel, High Throughput
• Play Demonstration
Compute Unified Device Architecture
CUDA
• Exposes GPU computing for general purpose
• Flexible and scalable architecture
• Based on industry-standard C/C++
• Small set of extensions to enable heterogeneous
programming
• Straightforward APIs to manage devices, memory
etc.
• For NVIDIA GPUs only
Concepts to be covered
• Heterogeneous computing
• Blocks, Threads
• Indexing
• Shared memory
• __syncthreads()
• Warps, Divergence
• Asynchronous operation
• Handling errors
• Managing devices
Heterogeneous Computing
• CPU – Host, CPU RAM – Host Memory
• GPU – Device, GPU RAM – Device Memory

www.nvidia.com
Hello World!
• GPU code – kernel
• __global__ indicates it runs on device
• Triple angle brackets mark a call from host code to
device code __global__ void mykernel(void) {
• “kernel launch” cuPrintf(“Hello World!\n”);
}
• Returns ‘void’ int main(void) {
mykernel<<<1,1>>>();
printf(“CPU Hello World!\n");
return 0;
}
Hello World!
• nvcc helloworld.cu
• ./a.out
Working with codes
• Open Terminal
• ssh –X user#@10.21.1.166 (user 1-25)
• ssh –X guest@10.6.5.254 (user 26-50)
• ssh –X guest@192.168.1.211 (user 26-50)
• cd codes/helloworld/
• make
• ./helloworld
• gedit &
Hello World! Parallel
• Change to mykernel<<<N,1>>>();
• Launches N blocks
• CPU calls kernel & continues its work
__global__ void mykernel(void) {
cuPrintf(“Hello World!\n”);
• Compile: }
int main(void) {
$ cd helloworld_blocks int N=100;
$ make mykernel<<<N,1>>>();
$ ./helloworld_blocks printf(“CPU Hello World!\n");
return 0;
}
Processing Flow

PCI Bus

Copy from Host Memory (CPU)


to Device Memory (GPU)
Processing Flow
CPU launches Kernel

PCI Bus

• Kernel accesses memory


at much faster rate
• Utilizes on-chip cache memory
Processing Flow

PCI Bus

Copy results back


from Device Memory (GPU)
to Host Memory (CPU)
Device Memory Management
• cudaError_t cudaMalloc( void ** devPtr, size_t size_bytes)
• cudaError_t cudaMemcpy ( void* dst, void* src, size_t count,
enum cudaMemcpyKind kind)
– cudaMemcpyHostToHost Host -> Host
– cudaMemcpyHostToDevice Host -> Device
– cudaMemcpyDeviceToHost Device -> Host
– cudaMemcpyDeviceToDevice Device -> Device
• Example: int a[100], *dev_a;
cudaMalloc (&dev_a, sizeof(int)*100);
cudaMemcpy( dev_a , a , sizeof(int)*100,
cudaMemcpyHostToDevice);
Vector Addition
• How to identify which block it is?
0 0 0
• Each block takes care of one element
1 1 2
• blockIdx.x
2 2 4
• You can have 3 dimensional blocks
3 4 8
• blockIdx.x, blockIdx.y, blockIdx.z
• gridDim.x dim3(65535,65535,1024)

void vectoradd(int *a, int *b, int *c) { __global__ void vectoradd(int *a, int *b, int *c) {
for( int i=0; i<100; i++) int i = blockIdx.x;
c[i] = a[i] + b[i]; c[i] = a[i] + b[i];
} }
Vector Addition
__global__ void vectoradd(int *a,int *b, int *c) {
int i= blockIdx.x; Compile:
c[i] = a[i] + b[i]; $ cd vectoradd/
} $ make
int main(void) { $ ./vectoradd
int host_a[100], host_b[100],host_c[100];
int *dev_a, *dev_b, *dev_c;
cudaMalloc( &dev_a, sizeof(int)*100); Memory Allocation
cudaMalloc( &dev_b, sizeof(int)*100);
cudaMalloc( &dev_c, sizeof(int)*100); Memory Copy
cudaMemcpy(dev_a, host_a, sizeof(int)*100, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, host_b, sizeof(int)*100, cudaMemcpyHostToDevice);
vectoradd<<<N,1>>>(dev_a,dev_b,dev_c);
cudaMemcpy(host_c, dev_c, sizeof(int)*100, cudaMemcpyDeviceToHost);
return 0;
}
Threads
• A block can have many threads.
• For vector addition, the kernel launch would be
vectoradd<<<1,N>>>(da,db,dc);
• Maximum thread Dimension (3-dimensional)
(1024,1024,64)
__global__ void vectoradd(int *a, int *b, int *c) {
• threadIdx.x int i=threadIdx.x;
c[i]=a[i]+b[i];
• blockDim.x }

Compile:
$ cd vectoradd_threads/
$ make
$ ./vectoradd_threads
Threads
• 3D mesh
inside 3D mesh
• Why threads?
– Communicate
– Synchronize
• Blocks can’t
Built-in Variables
• threadIdx.x, threadIdx.y, threadIdx.z
• blockIdx.x, blockIdx.y, blockIdx.z
• blockDim.x, y, z (1024,1024,64) – Number of threads
per block.
• gridDim.x, y, z (65535, 65535, 1024) - Number of
blocks in a kernel call (called as Grid of blocks)
Index Calculation
• Using blocks and threads simultaneously.
• i=threadIdx.x + blockDim.x*blockIdx.x;

threadIdx.x threadIdx.x threadIdx.x threadIdx.x


0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x=0 blockIdx.x=1 blockIdx.x=2 blockIdx.x=3

• blockDim.x=8 (no of threads in a block)


• gridDim.x=4 (no of blocks in that kernel launch)
• add<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(
Boundary Conditions
• Usually blockDim.x is in multiples of 32
• Always put boundary conditions on data size
__global__ void vectoradd(int *a, int *b, int *c, int N) {
int i = threadIdx + blockDim.x * blockIdx.x;
if(i<N)
c[i]=a[i]+b[i];
}

Compile:
$ cd vectoradd_full/
$ make
$ ./vectoradd_full
For Very Large N
• Very large N (N>106)

__global__ void vectoradd(int *a, int *b, int *c, long N) {


long i = threadIdx + blockDim.x * blockIdx.x;
for(; i<N; i+= gridDim.x * blockDim.x)
c[i]=a[i]+b[i];
}
Compile:
$ cd vectoradd_large/
$ make
$ ./vectoradd_large
Block Scheduling
• Streaming Multiprocessors are
executing units - SM
• Different GPUs have different
no of SMs.
• There is communication
among threads.
• No communication
among blocks.
• No specific order in block
scheduling.
Block Scheduling
• All threads in a block execute in a single SM
• No guarantee in order of execution
• Hardware schedules based on available SMs
3 SMs available
BLOCK 1

BLOCK 2

BLOCK 3

BLOCK 4
Block Scheduling
• All threads in a block execute in a single SM
• No guarantee in order of execution
• Hardware schedules based on available SMs
3 SMs available

BLOCK 1

BLOCK 2

BLOCK 3
BLOCK 4
Block Scheduling
• All threads in a block execute in a single SM
• No guarantee in order of execution
• Hardware schedules based on available SMs
3 SMs available

BLOCK 1

BLOCK 2

BLOCK 4
Block Scheduling
• All threads in a block execute in a single SM
• No guarantee in order of execution
• Hardware schedules based on available SMs
3 SMs available

BLOCK 1

BLOCK 2

BLOCK 4
1-D Stencil
• Compute a(i)+a(i+1)+a(i+2)
__global__ void stencil(int *a, int *b) { Compile:
int i=threadIdx.x; $ cd 1dstencil/
b[i]=a[i]+a[i+1]+a[i+2]; $ make
} $ ./1dstencil

0 1 2 3 4 5 6 7

threadIdx.x=0
threadIdx.x=1
Global Memory
• Till now we have been using global memory
for our computations.
• Very slow to access
• Allocated using cudaMalloc(..)
1-D Stencil Revisited
• Compute a(i)+a(i+1)+a(i+2)
__global__ void stencil(int *a, int *b) {
int i=threadIdx.x;
b[i]=a[i]+a[i+1]+a[i+2];
}
• Data could be shared among threads
0 1 2 3 4 5 6 7

threadIdx.x=0
threadIdx.x=1

• 3 global read + 1 global write per thread


Shared Memory
• Memory shared among threads inside a block.
• Cannot be accessed from another block
• Declared inside kernel code
• __shared__ int a[100];
• On-chip, very fast
1-D Stencil Shared
• Copy to shared memory
__global__ void stencil(int *a, int *b) { Compile:
int i = threadIdx.x; $ cd 1dstencil_shared
__shared__ sa[100]; $ make
sa[i] = a[i]; $./1dstencil_shared
b[i] = sa[i]+sa[i+1]+sa[i+2];
}

• Write the result to global memory


• Shared memory is visible a block only.
• Cannot be accessed by other blocks, CPU
Access Times
• Registers (1-2 cycles)
• Shared memory (10 cycles)
• Global memory (100s of cycles)
• Local memory (100s of cycles)
Run time Comparison
• 3 global read + 1 global write per thread
• 3*100+100=400 cycles

• 1 global read + 3 shared read + 1 global write per


thread
• 1*100+3*10+1*100= 230 cycles

• Use nvprof ./file_name to see the runtime of


programs
Memory Hierarchy
• Registers
– Per thread on chip
– Data lifetime = thread lifetime
• Local memory
– Per thread off-chip memory (DRAM)
– Data lifetime = thread lifetime
• Shared memory
– Per thread block : on-chip memory
– Data lifetime = block lifetime
• Global (device) memory
– Accessible by all threads and host (CPU)
– Data lifetime= Entire program
from allocation to de-allocation
• Host (CPU) memory
– Not directly accessible by
CUDA threads
__syncthreads()
• Synchronizes all threads within a block
• Waits till all the threads execute till __syncthreads();
• Used to prevent RAW, WAR, WAW hazards
– RAW – Read After Write
– WAR – Write After Read
– WAW – Write After Write
• Synchronize to commit all the memory writes, reads and
computation.

__syncthreads();
Reduction
• Addition of N numbers
• Other operations +,*, AND, OR, XOR, maximum, minimum etc.
void reduce(int *a, int *result) {
*result=0;
for( int i=0; i<100; i++)
result=result+a[i];
}

• Serial – How to parallelize?


Reduction
• Addition of N numbers
• Other operations +,*, AND, OR, XOR, maximum, minimum etc.
void reduce(int *a, int *result) {
*result=0;
for( int i=0; i<100; i++)
result=result+a[i];
}

• Serial – How to parallelize?


• Using associative property !
• a+b+c+d = (a+b)+ (c+d)
Reduction
• N numbers  log2(N) steps to compute
• Share result of 1st step to other threads in 2nd step.
0 1 2 3 4 5 6 7 threadIdx.x

1 5 9 13

6 22

• Some algorithms are not straight forward to implement in


parallel.
Reduction kernel
• Read to Shared memory
• Operate & write to shared memory Compile:
$ cd reduction/
• Write to global memory $ make
__global__ void reduce(int *a, int *result) { $ ./reduction
int i= threadIdx.x;
__shared__ s_a[N];
s_a[i]=a[i];
__syncthreads( );
for( int stride=1; stride < N; stride*=2){
if(i%stride==0)
s_a[2*i]=s_a[2*i] + s_a[2*i+stride];
__syncthreads( );
}
*result=s_a[0];
}
CUDA programming model
CUDA programming model
Blocks mapped to SM
Warps
• Inside SM, threads are split into group of 32 threads called
warps.
• All threads in single warp execute in parallel.
• If executing warp needs waiting or barrier, it is put into hold
and another warp is dispatched for execution.
• This is taken care by warp scheduler
• All threads in a warp execute SAME instruction.
Warp
• No guarantee on order of warps dispatched.
• GPU architectures – Tesla,Fermi, Kepler
• Warp size = 32
• Fermi
– 2 warp schedulers
– 2 instruction units
Divergence
• Alternative threads in a warp execute different
• each warp takes 2 time step

if ( threadIdx.x%2==0)
a[ threadIdx.x ] +=1
else
a[ threadIdx.x ] +=2
}

Warp 1
0 2 4 6 8 . . . if

1 3 5 7 9 . . . else
Divergence
• All threads in a warp executes same instruction
• each warp takes 1 time step
if ( threadIdx.x<32)
a[ threadIdx.x ] +=1
else
a[ threadIdx.x ] +=2
}

Warp 1 (if)

Warp 2 (else)
Reduction revisited
• Divergence at all strides 0 1 2 3 4 5 6 7

• Each thread in a warp


execute different 1 5 9 13
instructions
6 22
• Solution:
• Modify condition 0

for( int stride=1; stride < N; stride*=2){


if( i % stride==0)
sa[2*i] = sa[2*i] + sa[2*i+stride];
__syncthreads( );
}
*result=sa[0];
}
Reduction (No Divergence)
• Add elements stride away
0 1 2 3 4 5 6 7

4 6 8 10

12 16 for( int stride=blockDim.x; stride >0; stride/=2)


if( i < stride)
sa[i]=sa[i] + sa[i+stride];
28 __syncthreads( );
*result=sa[0];
}
• No Divergence for stride>=32 Compile:
$ cd reduction_nodiv/
$ make
$ ./reduction_nodiv
Resource allocation
• Split your program to small kernels. Why?
• Each SM has limited registers, shared memory.
• The amount depends on compute capability of GPU
• 1.0, 1.1, 1.2, 1.3, 2.x, 3.0, 3.5, 5.0
• Fermi, Tesla, Kepler
• Global memory is large (>512MB)
• nvcc –Xptxas=-v filename.cu
Resource limits
• Number of thread blocks per SM is limited by
– Registers
– Shared memory usage
– No. of blocks per SM
– Number of threads
Limits 1.3 2.X 3.X 5.0
Registers/SM 16K 32K 64K 64K
Shared Memory/SM 16KB 48KB 48KB 64KB
Blocks/SM 8 8 16 32
Threads/SM 1024 1536 2048 2048

• Occupancy
Asynchronous
• Kernel launches are Asynchronous
• cudaMemcpy, cudaMalloc – Synchronous
• cudaMemcpyAsync() - Asynchronous, does not block the CPU
• cudaDeviceSynchronize() - Blocks the CPU until all preceding
CUDA calls have completed
• Asynchronous calls – can utilize CPU also while GPU is busy.
Handling Errors
• All CUDA API calls return an error code (cudaError_t)
– Error in the API call itself
– Error in an earlier asynchronous operation (e.g. kernel)
• Get the error code for the last error:
– cudaError_t cudaGetLastError(void)
• Get a string to describe the error:
– char *cudaGetErrorString(cudaError_t)
• printf("%s\n", cudaGetErrorString(cudaGetLastError()));
Device Management
• Application can query and select GPUs
– cudaGetDeviceCount(int *count)
– cudaSetDevice(int device)
– cudaGetDevice(int *device)
– cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
• Multiple host threads can share a device
• A single host thread can manage multiple devices
– cudaSetDevice(i) to select current device
– cudaMemcpy(…) for peer-to-peer copies
Summary
• Write and launch CUDA C/C++ kernels
– __global__, <<<>>>, blockIdx, threadIdx, blockDim
• Manage GPU memory
– cudaMalloc(), cudaMemcpy(), cudaFree()
• Manage communication and synchronization
– __shared__, __syncthreads()
– cudaMemcpy() vs cudaMemcpyAsync(), cudaDeviceSynchronize()
• Resource limits
– Registers, Shared memory, blocks/SM, threads/SM
Advanced concepts(not covered)
• Memory Coalescing
• Constant memory
• Streams
• Atomics
• Shared memory conflicts
• Texture memory
Tools
• nvcc – NVIDIA compiler
• nvprof - command line profiler
• nvvp – Visual profiler
• cuda-memcheck – Memory bugs
• Nsight – Visual Studio, Eclipse
• Allinea DDT
Libraries
• CUBLAS – CUDA accelerated Basic Linear Algebra
• CUFFT – Fast Fourier Transform (1D, 2D, 3D)
• Thrust – C++ template library (similar to C++ STL)
• CULA –Dense, Sparse Linear Algebra
• OpenCV – Computer Vision, Image processing
• AccelerEyes ArrayFire
• MATLAB, LabVIEW, Mathematica, Python
• ABACUS, AMBER, ANSYS, GROMACS, LAAMPS, NAMD, …
Online Resources
• http://developer.nvidia.com/cuda-training
• Coursera - Heterogeneous computing
• Udacity - CS344 Intro To Parallel Programming
• GPU computing Webinars

• CUDA Documentation
• Books
– CUDA by Example
– Programming Massively Parallel Processors: A Hands-on Approach
– GPU GEMS
Questions?

You might also like