Please enable JavaScript.
Coggle requires JavaScript to display documents.
LSTM attention, Audio, AST - Coggle Diagram
LSTM attention
DAGA
LRP обычно с image processing
Arras LRP LSTM
LRP для звука
CRNN based multiple DoA estimation Using Acoustic Intensity Features for Ambisonics recordings
Audio Feature Discovery with Convolutional Neural Networks
Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals
1 секундные сигналы
Relevance for each time frequency bin (sum of magnitude and phase channel relevance)
Additive spectrograms
Смысл статьи: beyond visual comparison
Magnitude and phase: individual channels
LRP-ab, LRP-e, LSTM: lrp-e=0.01
У arras: последний time step LSTM
Здесь: backward pass для всех LSTM timesteps
Attention: просто выбрать фиксированное количество LSTM шагов для Dense
Auralization: Inverse stft -- будут артефакты.
Reliable explanations
Higher order gradient calculations
[51] Singla Understanding impacts of high-order
loss approximations and features in deep learning interpretation
IG, LRP -- modified gradient function
Towards better understanding
of gradient-based attribution methods for deep neural networks
Мало теоретических сравнений
Sensitivity и сравнивают с perturbation
Немного разные формулировки проблем в разных методах
DeepLIFT: сравнение активаций для baseline и x. Baseline внутри: значение при propagation baseline от предыдущего слоя
Occlusion: влияет patch size
Occlusion: все равно нужен полный сэмпл
Хотя можно посчитать chunk relevance * occlusion relevance
Occlusion: долго, слишком много propagation (для каждого отдельного пикселя)
Guided backprop, GradCAM, Deconv -- не универсальные (зависят от архитектуры)
Метрики качества
Completeness: Axiomatic attribution for deep networks.
Summation to delta: IG
Точно нельзя трансформеры?
Training the transformers from scratch requires a lot much data than CNN. This is because CNNs encode prior knowledge about the image domain like translational equivariance
However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias
Нет locality inductive bias
У Rizal TransformerEncoderLayer, а не сеть: без pretraining
Conformer
RNN is defacto for ASR
RNN хорошо моделируют audio sequences: ] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
Long distance interactions
High training efficiency
CNN ASR
J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde, “Jasper: An end-to-end convolutional neural acoustic model,” arXiv preprint arXiv:1904.03288,
2019.
“Quartznet: Deep
automatic speech recognition with 1d time-channel separable convolutions
Contextnet: Improving convolutional neural
networks for automatic speech recognition with global context
Deep convolutional neural networks for lvcsr
Convolutional neural networks for speech recognition
Improving saliency guided training
Iteratively mask features with small gradients
Benchmarking тоже с masking
Максимизация masked и unmasked output
Captum
IG использует forward_func(input), если не предоставляется target
grads = torch.autograd.grad(torch.unbind(outputs), inputs)
grad: Computes and returns the sum of gradients of outputs with respect to input
https://pytorch.org/docs/stable/autograd.html
allow_unused
Audio
ICASSP
Tampere
Sony
Hitachi
AST
DataParallel