Professional Documents
Culture Documents
www.nvidia.com
Hello World!
• GPU code – kernel
• __global__ indicates it runs on device
• Triple angle brackets mark a call from host code to
device code __global__ void mykernel(void) {
• “kernel launch” cuPrintf(“Hello World!\n”);
}
• Returns ‘void’ int main(void) {
mykernel<<<1,1>>>();
printf(“CPU Hello World!\n");
return 0;
}
Hello World!
• nvcc helloworld.cu
• ./a.out
Working with codes
• Open Terminal
• ssh –X user#@10.21.1.166 (user 1-25)
• ssh –X guest@10.6.5.254 (user 26-50)
• ssh –X guest@192.168.1.211 (user 26-50)
• cd codes/helloworld/
• make
• ./helloworld
• gedit &
Hello World! Parallel
• Change to mykernel<<<N,1>>>();
• Launches N blocks
• CPU calls kernel & continues its work
__global__ void mykernel(void) {
cuPrintf(“Hello World!\n”);
• Compile: }
int main(void) {
$ cd helloworld_blocks int N=100;
$ make mykernel<<<N,1>>>();
$ ./helloworld_blocks printf(“CPU Hello World!\n");
return 0;
}
Processing Flow
PCI Bus
PCI Bus
PCI Bus
void vectoradd(int *a, int *b, int *c) { __global__ void vectoradd(int *a, int *b, int *c) {
for( int i=0; i<100; i++) int i = blockIdx.x;
c[i] = a[i] + b[i]; c[i] = a[i] + b[i];
} }
Vector Addition
__global__ void vectoradd(int *a,int *b, int *c) {
int i= blockIdx.x; Compile:
c[i] = a[i] + b[i]; $ cd vectoradd/
} $ make
int main(void) { $ ./vectoradd
int host_a[100], host_b[100],host_c[100];
int *dev_a, *dev_b, *dev_c;
cudaMalloc( &dev_a, sizeof(int)*100); Memory Allocation
cudaMalloc( &dev_b, sizeof(int)*100);
cudaMalloc( &dev_c, sizeof(int)*100); Memory Copy
cudaMemcpy(dev_a, host_a, sizeof(int)*100, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, host_b, sizeof(int)*100, cudaMemcpyHostToDevice);
vectoradd<<<N,1>>>(dev_a,dev_b,dev_c);
cudaMemcpy(host_c, dev_c, sizeof(int)*100, cudaMemcpyDeviceToHost);
return 0;
}
Threads
• A block can have many threads.
• For vector addition, the kernel launch would be
vectoradd<<<1,N>>>(da,db,dc);
• Maximum thread Dimension (3-dimensional)
(1024,1024,64)
__global__ void vectoradd(int *a, int *b, int *c) {
• threadIdx.x int i=threadIdx.x;
c[i]=a[i]+b[i];
• blockDim.x }
Compile:
$ cd vectoradd_threads/
$ make
$ ./vectoradd_threads
Threads
• 3D mesh
inside 3D mesh
• Why threads?
– Communicate
– Synchronize
• Blocks can’t
Built-in Variables
• threadIdx.x, threadIdx.y, threadIdx.z
• blockIdx.x, blockIdx.y, blockIdx.z
• blockDim.x, y, z (1024,1024,64) – Number of threads
per block.
• gridDim.x, y, z (65535, 65535, 1024) - Number of
blocks in a kernel call (called as Grid of blocks)
Index Calculation
• Using blocks and threads simultaneously.
• i=threadIdx.x + blockDim.x*blockIdx.x;
Compile:
$ cd vectoradd_full/
$ make
$ ./vectoradd_full
For Very Large N
• Very large N (N>106)
BLOCK 2
BLOCK 3
BLOCK 4
Block Scheduling
• All threads in a block execute in a single SM
• No guarantee in order of execution
• Hardware schedules based on available SMs
3 SMs available
BLOCK 1
BLOCK 2
BLOCK 3
BLOCK 4
Block Scheduling
• All threads in a block execute in a single SM
• No guarantee in order of execution
• Hardware schedules based on available SMs
3 SMs available
BLOCK 1
BLOCK 2
BLOCK 4
Block Scheduling
• All threads in a block execute in a single SM
• No guarantee in order of execution
• Hardware schedules based on available SMs
3 SMs available
BLOCK 1
BLOCK 2
BLOCK 4
1-D Stencil
• Compute a(i)+a(i+1)+a(i+2)
__global__ void stencil(int *a, int *b) { Compile:
int i=threadIdx.x; $ cd 1dstencil/
b[i]=a[i]+a[i+1]+a[i+2]; $ make
} $ ./1dstencil
0 1 2 3 4 5 6 7
threadIdx.x=0
threadIdx.x=1
Global Memory
• Till now we have been using global memory
for our computations.
• Very slow to access
• Allocated using cudaMalloc(..)
1-D Stencil Revisited
• Compute a(i)+a(i+1)+a(i+2)
__global__ void stencil(int *a, int *b) {
int i=threadIdx.x;
b[i]=a[i]+a[i+1]+a[i+2];
}
• Data could be shared among threads
0 1 2 3 4 5 6 7
threadIdx.x=0
threadIdx.x=1
__syncthreads();
Reduction
• Addition of N numbers
• Other operations +,*, AND, OR, XOR, maximum, minimum etc.
void reduce(int *a, int *result) {
*result=0;
for( int i=0; i<100; i++)
result=result+a[i];
}
1 5 9 13
6 22
if ( threadIdx.x%2==0)
a[ threadIdx.x ] +=1
else
a[ threadIdx.x ] +=2
}
Warp 1
0 2 4 6 8 . . . if
1 3 5 7 9 . . . else
Divergence
• All threads in a warp executes same instruction
• each warp takes 1 time step
if ( threadIdx.x<32)
a[ threadIdx.x ] +=1
else
a[ threadIdx.x ] +=2
}
Warp 1 (if)
Warp 2 (else)
Reduction revisited
• Divergence at all strides 0 1 2 3 4 5 6 7
4 6 8 10
• Occupancy
Asynchronous
• Kernel launches are Asynchronous
• cudaMemcpy, cudaMalloc – Synchronous
• cudaMemcpyAsync() - Asynchronous, does not block the CPU
• cudaDeviceSynchronize() - Blocks the CPU until all preceding
CUDA calls have completed
• Asynchronous calls – can utilize CPU also while GPU is busy.
Handling Errors
• All CUDA API calls return an error code (cudaError_t)
– Error in the API call itself
– Error in an earlier asynchronous operation (e.g. kernel)
• Get the error code for the last error:
– cudaError_t cudaGetLastError(void)
• Get a string to describe the error:
– char *cudaGetErrorString(cudaError_t)
• printf("%s\n", cudaGetErrorString(cudaGetLastError()));
Device Management
• Application can query and select GPUs
– cudaGetDeviceCount(int *count)
– cudaSetDevice(int device)
– cudaGetDevice(int *device)
– cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
• Multiple host threads can share a device
• A single host thread can manage multiple devices
– cudaSetDevice(i) to select current device
– cudaMemcpy(…) for peer-to-peer copies
Summary
• Write and launch CUDA C/C++ kernels
– __global__, <<<>>>, blockIdx, threadIdx, blockDim
• Manage GPU memory
– cudaMalloc(), cudaMemcpy(), cudaFree()
• Manage communication and synchronization
– __shared__, __syncthreads()
– cudaMemcpy() vs cudaMemcpyAsync(), cudaDeviceSynchronize()
• Resource limits
– Registers, Shared memory, blocks/SM, threads/SM
Advanced concepts(not covered)
• Memory Coalescing
• Constant memory
• Streams
• Atomics
• Shared memory conflicts
• Texture memory
Tools
• nvcc – NVIDIA compiler
• nvprof - command line profiler
• nvvp – Visual profiler
• cuda-memcheck – Memory bugs
• Nsight – Visual Studio, Eclipse
• Allinea DDT
Libraries
• CUBLAS – CUDA accelerated Basic Linear Algebra
• CUFFT – Fast Fourier Transform (1D, 2D, 3D)
• Thrust – C++ template library (similar to C++ STL)
• CULA –Dense, Sparse Linear Algebra
• OpenCV – Computer Vision, Image processing
• AccelerEyes ArrayFire
• MATLAB, LabVIEW, Mathematica, Python
• ABACUS, AMBER, ANSYS, GROMACS, LAAMPS, NAMD, …
Online Resources
• http://developer.nvidia.com/cuda-training
• Coursera - Heterogeneous computing
• Udacity - CS344 Intro To Parallel Programming
• GPU computing Webinars
• CUDA Documentation
• Books
– CUDA by Example
– Programming Massively Parallel Processors: A Hands-on Approach
– GPU GEMS
Questions?