Please enable JavaScript.
Coggle requires JavaScript to display documents.
Tran et al (2019) Video Classification with Channel-Separated…
Tran et al (2019) Video Classification with Channel-Separated Convolutional Networks
3D Convolution High Computational Cost
Kernel refactorization
Usually done to reduce number of floating point operations (FLOPS)
Usual approach: Separate 3D into 2D convolutions for space, 1D convolution for time
Paper approach: separate convolutions into channel convolutions and spatiotemporal convolutions
Reorganize convolution operations
Floating point operations
Number of parameters
Number of channel / feature interactions
Define as the effect a channel has on another channel
Usually occurs when two channels share a common filter
Reduce number of convolutions
Group convolution
Grouping convolution filters into subsets
Filter convolves on a subset of the input features
Reduces convolutions by a factor of G
Limits feature interaction since channels can only interact with channels in the same group
Depth convolution
Extreme case of group convolution where there is only one input for a filter
Number of output must have the same number of inputs
Reduces channel interaction to effectively none since there is only one channel in a group.
Builds foundation of kernel refactorization to produce convolutions for channel interactions and convolutions for spatiotemporal interactions
Channel-separated convolutional networks
Either 1 x 1 x1 convolutions
Models channel interactions through 1D convolutions
Or k x k x k depth-wise convolutions
Models spatiotemporal features
Interaction-preserved channel-separated networks
Has 1 x 1 x 1 convolutions before every depth-wise convolutions
Aims to preserve channel interaction modelling capabilities of original 3D network while reducing cost
Interaction-reduced channel-separated networks
Convert 3D convolution to depth-wise convolution
Removes channel interactions