Please enable JavaScript.

Coggle requires JavaScript to display documents.

Throughput-Oriented GPU Memory Allocator (實作 (單一 Malloc 介面兩種 allocator…

- - - - User Level
        
        individual threads running on the GPU request dynamic allocation by calling malloc.
      - OS level : CUDA_malloc
  - - - linked-list to track active bins ( avoid active bin replacement operations)
      - 跟 CPU allocator 很像，在這裏每個 SM 分配一個 Arena
      - 資料型別
        
        Chunk list
        
        不足的 Chunk 由 TBuddy 獲取
        
        管理一堆 Chunk 作為初始空間
        根據 bin 的即時需求將 Chunk 中的
        uninitialized bin 配給到不同 bin list 中
        
        use collective mutex (4.2.2) to protect
        
        多個 thread 會同時要求 Chunk (來自同一個 thread block)
        
        基礎型別
        
        以 Bin 為單位進行最小的資源分配管理器，bin 內部的分配方式跟 slab 很像
        
        4KB
        
        128 byte header
        
        包含分配 bitmap
        
        可能還裝有 tail 指標
        
        Allocated in fixed size
        
        大單位是 Chunk，裡面會被切成多個固定大小的 Bin
        
        最前面的兩個 bin 空間有特殊用途 -
        索引剩餘空間的 bitmap 與
        補足其他 bin header 空間的 tails (128 Kb each)
        
        512 KB
        可裝 103 個 bin
        
        Set of bins
        (由bin中裝的物件大小來分類)
        
        bin list
        
        Bulk semaphore 控管
        
        相同大小 bin 的 linked list
        
        RCU
        
        從 allocate 流程中可以看到會有一堆 thread 在爬 bin list (尋找可用 bin )，但只有少部分的 thread 在update bin list，為了最大化效能，這裡要採用 RCU
        
        不能採用普通 RCU，因為每個 thread 的 reg 控管數量太吃空間。
        
        借用 SRCU 的實作特性，降低 reg 使用
        
        1 more item...
        
        介紹移除 bin list 的 RCU 手段
        
        Writer thread 會需要新增 free 使用的 callback，等到 reader 離開之後再執行 free
      - allocate 流程
        
        對目標 bin size 的 bulk semaphore 進行 wait
        
        從對應 bin list 中找一個可用的 bin
        
        如果沒有可用的 bin，從 chunk list 中找一個可用的 chunk
        並將可用的 bin 初始化，加入目標 bin list 之中
    - - multithread 需求
        
        特製化的 Buddy system
        
        A static binary tree which can represent a buddy system and can be traverse by many threads concurrently
        
        實作細節
        
        before state transition of any node, lock that node and its parent (2 node) to prevent race condition.
        
        三種節點狀態
        
        allocation 情況
        
        free 情況
        
        永遠都要嘗試 merge
        
        採用 Bulk semaphore 做數量管理
      - 沒辦法跟 kernel 一樣簡單隨便分配連續空間
        因為沒有 virtual memory (addressing space) 不夠
        會有大量 fragmentation 所以採用 buddy system
  - - - 用 counting semaphore 管理動態資源
        
        Two stage Resource Management
        
        Non scalable: counting semaphore
        
        Scalable:
        Bulk semaphore