



## Agenda

- Parallelism Review
- GPU Architecture Review
- CUDA







## Matrix Multiply Reminder

- Vectors
- Dot products
- Row major or column major?
- Dot product per output element

## **GPU** Architecture Review

## GPUs are:

Parallel

Multithreaded

□ Many-core

### GPUs have:

Tremendous computational horsepowerHigh memory bandwidth

# GPU Architecture Review

- GPUs are specialized for
   Compute-intensive, highly parallel computation
   Graphics!
- Transistors are devoted to:
  - □ Processing
  - □Not:
    - Data caching
    - Flow control



# Let's program this thing!

## **GPU** Computing History

- 2001/2002 researchers see GPU as dataparallel coprocessor
  - □ The GPGPU field is born
- 2007 NVIDIA releases CUDA
   CUDA Compute Uniform Device Architecture
   GPGPU shifts to GPU Computing
- 2008 Khronos releases OpenCL specification

## **CUDA** Abstractions

- A hierarchy of thread groups
- Shared memories
- Barrier synchronization

## CUDA Terminology

- Host typically the CPU
   Code written in ANSI C
- Device typically the GPU (data-parallel)
   Code written in extended ANSI C
- Host and device have separate memories
- CUDA Program

 $\Box$  Contains both host and device code

## CUDA Terminology

- Kernel data-parallel function
  - Invoking a kernel creates lightweight threads on the device
    - Threads are generated and scheduled with hardware
- Similar to a shader in OpenGL?



























- Thread blocks execute independently
   In any order: parallel or series
   Scheduled in any order by any number of
  - cores
  - Allows code to scale with core count















| CUDA Memory Transfers                                                                                                         | CUDA Memory Transfers                                                                                                                                     |
|-------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>float *Md int size = Width * Width * sizeof(float); cudaMalloc((void**)&amp;Md, size); cudaFree(Md); Size in bytes</pre> | <ul> <li>cudaMemcpy()</li> <li>Memory transfer</li> <li>Host to host</li> <li>Host to device</li> <li>Device to host</li> <li>Device to device</li> </ul> |
| Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf                              | Similar to buffer objects in OpenGL                                                                                                                       |











| 2°                                                                                                                                          |
|---------------------------------------------------------------------------------------------------------------------------------------------|
| CUDA Memory Transfers                                                                                                                       |
| Destination (device) Source (host)<br>cudaMemcpy (Md, M size, cudaMemcpyHostToDevice);<br>cudaMemcpy (P, Pd, size, cudaMemcpyDeviceToHost); |
| Host Device<br>Global Memory<br>Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf            |







# Matrix Multiply: CPU Implementation void MatrixMulOnHost(float\* M, float\* N, float\* P, int width) { for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { float sum = 0; for (int k = 0; k < width; +++) { float a = M[i \* width + k]; sum += a \* b; } P[i \* width + j] = sum; } Code from: http://courses.engr.illinois.edu/ece498/al/lecture3/k20uda%20threads%20spring%202010.ppt</pre>

| Matrix Multiply: CUDA Skeleton                                                                                                                                                                                                                       |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| int main(void) { 1. // Allocate and initialize the matrices M, N, P // I/O to read the input matrices M and N 2. // M* N on the device MatrixtAulOnDevce(M, N, P, width); 3. // I/O to write the output matrixP // Free matrices M, N, P return 0; } |
| Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdl                                                                                                                                                     |

































