













# Parallel Reduction Similar to brackets for a basketball tournament log(n) passes for n elements How would you implement this in CUDA?































| 2°                                                                 |
|--------------------------------------------------------------------|
| sharedfloat partialSum[]                                           |
| // load into shared memory                                         |
| unsigned int t = threadIdx.x;                                      |
| <pre>for(unsigned int stride = blockDim.x / 2;</pre>               |
| stride > 0;                                                        |
| stride /= 2)                                                       |
|                                                                    |
| syncthreads();                                                     |
| if (t < stride)                                                    |
| partialSum[t] +=                                                   |
| partialSum[t + stride];                                            |
| }                                                                  |
| Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html |















### Warp Partitioning: how threads from a block are divided into warps Knowledge of warp partitioning can be used to: Minimize divergent branches Retire warps early

























































### Memory Coalescing

- Memory coalescing rearrange access patterns to improve performance
- Useful today but will be less useful with large on-chip caches

### Memory Coalescing

- The GPU coalesce consecutive reads in a half-warp into a single read
- Strategy: read global memory in a coalesce-able fashion into shared memory
  - Then access shared memory randomly at maximum bandwidth

See Appendix G in the NVIDIA CUDA C Programming Guide for coalescing alignment requirement

Ignoring bank conflicts – next lecture

SM Resource Partitioning
Recall a SM dynamically partitions
resources:

Thread block slots
Registers
Shared memory
SM









### SM Resource Partitioning

- Performance Cliff. Increasing resource usage leads to a dramatic reduction in parallelism
  - For example, increasing the number of registers, unless doing so hides latency of global memory access

### SM Resource Partitioning

### CUDA Occupancy Calculator

http://developer.download.nvidia.com/comput e/cuda/CUDA\_Occupancy\_calculator.xls

### **Data Prefetching**

 Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];
float f = a \* b + c \* d;
float f2 = m \* f;



## Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a \* b + c \* d; float f2 = m \* f; Execute instructions that are not dependent on memory read



### Data Prefetching

 Prefetching data from global memory can effectively increase the number of independent instructions between global memory read and use









