Professional Documents
Culture Documents
Before:
Solving large sparse linear systems
Parallel GMRES solver on GPU clusters
Sparse banded matrices
After:
All types of sparse matrices (large bandwidths!)
Parallel sparse matrix-vector products on a GPU cluster:
Pb: overheads in CPU/CPU and GPU/CPU communications
Solution: data compression of shared vectors
GPU clusters
Experimental results
GPU clusters
SP SP SP SP SP SP
SP SP SP SP SP SP
SP SP SP SP SP SP
SP SP SP SP SP SP
Global memory
GPGPU programming
CUDA (Compute Unified Device Architecture programming)
GPU processors GPU processors GPU processors GPU processors GPU processors GPU processors
102 GB/s 102 GB/s 102 GB/s 102 GB/s 102 GB/s 102 GB/s
Device memory Device memory Device memory Device memory Device memory Device memory
PCIe 8 GB/s PCIe 8 GB/s PCIe 8 GB/s PCIe 8 GB/s PCIe 8 GB/s PCIe 8 GB/s
InfiniBand 20 GB/s
GMRES method
Generalized Minimal RESidual solver
Vector x
Left shared Local Right shared
Proc 0
Proc 1
=
Proc 2
Proc 3
Proc 1 =
SpMV product
CPU α CPU β
communication
Local compressed vector x 20 GB/s
on CPU α x x x x x x
Node α Compression 0 2 3 6 10 11
GPU α CPU α
communication
Global vector x on GPU α x x x x x x x x x x x x 8 GB/s
0 1 2 3 4 5 6 7 8 9 10 11
Experimental results
GPU cluster:
InfiniBand
Six Quad-Core Xeon E5530
Two Tesla C1060 GPUs per CPU
→ Cluster of 12 GPUs
TCPU
Performance measure: speed-ups TGPU of the cluster of 12 GPUs
compared to:
Cluster of 12 CPU cores
Cluster of 24 CPU cores
Proc 0
Real sparse matrix
Proc 1 right_part
left_part
Proc 2 right_part
left_part
Proc 3
Conclusion:
Both versions of GMRES are faster on GPU clusters
GMRES with data compression is more efficient
Data compression minimizes: GPU/CPU & CPU/CPU communication
overheads
Future work:
Other structures: matrices with large bandwidths
Data partitioning: minimizing the communication volume