Please enable JavaScript.
Coggle requires JavaScript to display documents.
LLM Systems Seminar (CS7670): Week02b, Megatron-LM (paper, hotcrp,…
LLM Systems Seminar (CS7670): Week02b, Megatron-LM (
paper
,
hotcrp
,
lottery
)
0. (recall) training parallelization
Research problem: too big to fit into one GPU
machine architecture
Borrowed from
stanford CS336
too big?
Borrowed from
Stanford CS363
3D parallelism
Borrowed from
stanford CS336
Borrowed from
stanford CS336
Borrowed from
stanford CS336
New parallelism techniques
expert parallelism (GShard, Switch Transformer)
Borrowed from GShard paper
sequence parallelism (Megatron-LM v2)
Borrowed from
Stanford CS363
context parallelism (RingAttention)
Borrowed from RingAttention paper
1. overall assessment of the paper
metrics
Technical Soundness
Evaluation and Evidence
Clarity and Communication
Impact and Significance
Novelty and Contribution
if you were a reviewer of this paper, would you accept this paper?
Given the paper, what do you want to work on as a follow-up project?
2. quick walkthrough of the paper
2. Background
Early days (MLSys): FlexFlow and Tofu
Tofu,
Supporting very large models using automatic dataflow graph partitioning
(EuroSys'19)
FlexFlow,
Beyond data and model parallelism for deep neural networks
(MLSys'19)
Data parallelism (recall)
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Q: FSDP vs. Tensor parallelism (Megatron-LM)?
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
trade-off: computation vs. memory
Training Deep Nets with Sublinear Memory Cost
1. Introduction
MFU
Model FLOPs utilization (MFU) is the ratio of the observed throughput to the theoretical maximum throughput assuming 100% of peak FLOPs
In current practice, MFU typically fall 40%–60%
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
pre-norm vs. post-norm (recall)
3. model parallel Transformers
Code 1
function f
function g
pytorch all_reduce
communication ops
NCCL doc
Megatron-LM github
Megatron-LM first commit
Megatron-LM tensor parallelism
embedding & last FFN
Using the Output Embedding to Improve Language Models
GPT-2 forward (Megatron-LM)
5. Experiments
"The devil is in the details"
post of Andrej Karpathy
Weak scaling
: Fix the per-device workload and increase the number of devices and the problem size proportionally.
Goal: The wall-clock time per training iteration should stay flat as you scale up.
Overhead analysis?
Table3 of
MegaScale
3. questions & open discussions
Sanchit Ahuja
Communication Bottlenecks & Scalability
: Megatron-LM showed excellent scaling within a node with NVSwitch. But what about multi-node scaling?
The authors chose 8 GPUs as the model-parallel max. Why not 16?
Arya Wu
Generality of the approach
: Megatron-lm’s method is somewhat tailored to transformers (leveraging qkv heads, etc.). can it be applied to other architectures?
Shahid Kamal
Reproducibility and Openness
: Only large industry labs or supercomputing centers could replicate an 8.3B model training in 2019. This raises issues of access and research democratization.
Junbeom In
Ethical and Environmental Considerations
: The paper did not discuss the energy consumption or carbon footprint of training on 512 GPUs for days, nor the data privacy issues of using huge text corpus.
4. other topics in training (if time)