Please enable JavaScript.
Coggle requires JavaScript to display documents.
HPC - Coggle Diagram
HPC
CUDA
Streaming Multiprocessor (SM)
No interruption. Good at streaming of data inputs
- Hardware perspective
- There are multiple SPs
- There are few SFUs
Scalar Processor (SP)
This is the core of the GPU
- Only can perform very simple arithmetic operations
-
Memory contents
- L1 Caches (Instruction, Data)
- A shared memory: Threads share memory through here
Warp Scheduler
At least 1 warp scheduler
- Chooses an eligible warp and executes all the threads in it
- If any of the threads in the warp stalls, the warp is deactivated
- If no eligible warps, the GPU idles
- Context switching the warps is very fast
- all threads in a warps execute at the same time
-
-
Grid
Set of blocks with the same size of SM
- Grid maps each block to a SM
Block (threadblock)
Set of GPU threads (CUDA abstraction)
- Organized as a 3D structure of threads
- A block contains set of warps
Warps
Set of threads. The size matches with the number of cores in SM
- Unit of execution in GPU
- Minimum execution unit available for the GPU
- Scheduled to be executed
- This is the best number of threads to be used
-
-
Resident Block
(CUDA specific)
- The compiler allocates/plan the cache beforehand
-
-
Threads
Each thread in a block has an unique index (x,y,z)
- Maximum number of threads depends on the architecture (ex. 1024 in Fermi)
Memories
Shared Memory
- On-chip
- use
__shared__
qualifier before variable decl
- Block cache is stored here when it is context-switched
Registers
Assigned to each thread
- On-chip
- Distributed to the blocks evenly
- We can assign more 'operations' to a thread to utilize more caches
-
Local Memory
- It's a virtual memory
- Data that doesn't fit into register are stored here
-
-
Constant Memory
- Uses L1 cache, so it is very limited
- All threads can access it
- Use
__constant__
qualifier before variable decl
Global Memory
This is where a host can allocate/copy memories
- Initialized by host
- Visible to all threads
- cudaMalloc() / cudaFree() / cudaMemcpy()
-
Device
The GPUs.
- It has own DRAM
- Runs many threads in parallel
GPU threads
- much lighter than of CPU's
Kernel
The function that runs in Device
- Provided to Grid and all threadblocks will run the same kernel
-
-
-
-