Please enable JavaScript.

Coggle requires JavaScript to display documents.

LLM Systems Seminar (CS7670): Week02b, Megatron-LM (paper, hotcrp,…

- - - - Borrowed from stanford CS336
    - - Borrowed from Stanford CS363
  - - - Borrowed from GShard paper
    - - Borrowed from Stanford CS363
    - - Borrowed from RingAttention paper
- - - - Tofu, Supporting very large models using automatic dataflow graph partitioning (EuroSys'19)
      - FlexFlow, Beyond data and model parallelism for deep neural networks (MLSys'19)
    - - PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
        
        Q: FSDP vs. Tensor parallelism (Megatron-LM)?
      - ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
    - - Training Deep Nets with Sublinear Memory Cost
  - - - Model FLOPs utilization (MFU) is the ratio of the observed throughput to the theoretical maximum throughput assuming 100% of peak FLOPs
      - In current practice, MFU typically fall 40%–60%
      - MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
  - - - function f
      - function g
      - pytorch all_reduce
    - - NCCL doc
    - - Megatron-LM first commit
      - Megatron-LM tensor parallelism
    - - Using the Output Embedding to Improve Language Models
      - GPT-2 forward (Megatron-LM)
  - - - post of Andrej Karpathy
    - - Goal: The wall-clock time per training iteration should stay flat as you scale up.
      - Overhead analysis?
        
        Table3 of MegaScale