Please enable JavaScript.
Coggle requires JavaScript to display documents.
Transformer don't learn, Provided - Coggle Diagram
Transformer don't learn
-
Experiments
-
-
-
Measure how the predictions change
when post-hoc removing/replacing with linear layers.
Interpret resulting changes/errors.
Questions to be answered:
What did the Attention learn?
Long/Short range?
Are we finding less objects/are they learning to improve boundaries better?
--> Extract corresponding patches from models!
Architectures
-
-
-
-
-
-
-
-
Position of the Q,K,V calculation and shapes are weird
-
-
-
-
-
Visualizations
Visualizes that the
different architectures follow different
ways of incorporating Attention
Could help explain different behaviors in Dataset ablation
Open Questions
Datasets
ACDC & Synapse seem to be rather common
maybe it is worth to evaluate if those would be good for evaluation?
-