Please enable JavaScript.
Coggle requires JavaScript to display documents.
Transformers, Input, Input, Input, Transformers States, Input,…
-
-
-
-
-
-
-
-
-
-
-
-
- Transformers although often and originally have equal number of encoders and decoders (Layers) this is not strictly true, and currently models like BERT only use Encoder layers, also models like GPT only use Decoder layers.
- All "N" of decoders/encoders (Layers) have the same structure (All decoders or all encoders are equal).
- All encoders data gets passed through all encoders, receiving input from the previous one.
- Each decoder gets the input from the last encoder and the previous decoder, for the first decoder layer, this input is the shifted target sequence (e.g., the sequence of words that the model is generating).
-
- The embedding only happens in the bottom-most (the first) encoder.
Semantic information means that the embedding vector captures relationships between words based on their meanings, contextual usage, and similarities to other words.
Initially, the embedding vectors are randomly initialized when training begins.
h Times (This refers to the number of heads in the attention mechanism - The total number of the embedding size is divided by this number)
Large dot products can lead to very large gradients during training, which can slow down learning or lead to numerical instability.
-
-
-
This layer transforms the concatenated vector into the appropriate size before moving it on to the next part of the Transformer