Please enable JavaScript.

Coggle requires JavaScript to display documents.

HPC - Coggle Diagram

- - - - Zero idle time
        There is no overhead for context-switching between each warps
    - - Latency Hiding
        Warps switches so fast that the GPU will not notice the change
      - Low Occupancy
        num of \(resident blocks * block size < theoretical limit\)
        
        num of threads per block doesn't divide the theoretical limit evenly. (For example, for the theoretical limit of 1536, but we can have a block of the maximum size of 1024 threads in CUDA)
      - Consuming too much resources
        
        Too much blocks, so not enough blocks can be resident
  - - - Warps
        Set of threads. The size matches with the number of cores in SM
        
        Unit of execution in GPU
        
        Minimum execution unit available for the GPU
        
        Scheduled to be executed
        
        This is the best number of threads to be used
        
        Divergence
        Any two threads on a warp are scheduled to execute different instructions
        
        Resident Threads
        A subset of threads in a warp whose instructions are executed at the same time
        
        Resident Block
        (CUDA specific)
        
        The compiler allocates/plan the cache beforehand
      - Queued
        (and will mapped)
        
        Up to 8 blocks/SM
      - Threads deployed into SPs in SM
      - Threads
        Each thread in a block has an unique index (x,y,z)
        
        Maximum number of threads depends on the architecture (ex. 1024 in Fermi)
      - Memories
        
        Shared Memory
        
        On-chip
        
        use __shared__ qualifier before variable decl
        
        Block cache is stored here when it is context-switched
        
        Registers
        Assigned to each thread
        
        On-chip
        
        Distributed to the blocks evenly
        
        We can assign more 'operations' to a thread to utilize more caches
        
        Local Memory
        
        It's a virtual memory
        
        Data that doesn't fit into register are stored here
        
        Page In/Out
        
        Constant Memory
        
        Uses L1 cache, so it is very limited
        
        All threads can access it
        
        Use __constant__ qualifier before variable decl