Please enable JavaScript.

Coggle requires JavaScript to display documents.

LLM Systems (CS7670): Week 01 (homepage) - Coggle Diagram

- - - - The self-attention mechanism scales quadratically with input length, yet the paper doesn't explore longer-than-standard sequence lengths. How did performance and training efficiency hold up?
    - - Given that the model relies heavily on parameter-intensive linear transformation matrices, why is the claim framed as “attention is all you need” rather than acknowledging the contribution of these components?
    - - How can we determine which heads are meaningfully contributing to model accuracy in multi-head attention? Which improve performance and which are redundant or coincidentally present when building model?