Please enable JavaScript.
Coggle requires JavaScript to display documents.
LLM Systems Seminar (CS7670): Week07a, Fault Tolerance (paper, HotCRP,…
LLM Systems Seminar (CS7670): Week07a, Fault Tolerance (paper, HotCRP, lottery)
-
-
3. Gemini
-
- goal: minimizing the wasted time
-
- Wasted time = lost training time + retrieving latest checkpoint
-
-
-
4. Debate
Debate Question: Is Gemini's design sufficient to scale to today's LLM training scales (hundreds of GPUs, dedicated clusters, 500B+ model parameters)?
Pro: Yes, Gemini's design is likely scalable in practice for today.
Con: No, Gemini falls short for today's LLM training scale and architecture.
-
-
-