Please enable JavaScript.
Coggle requires JavaScript to display documents.
Du Tran et al (2018), A Closer Look at Spatiotemporal Convolutions for…
Du Tran et al (2018), A Closer Look at Spatiotemporal Convolutions for Action Recongition
Mixed 3D 2D ConvNets
MCx
Based on the hypothesis that motion modeling is only needed on the early layers of the network and that deeper layers need spatial information
-
-
rMCx
Same intuition with MCx but reversed. Early layers have 2D convolutions while last layers have 3D convolutions
-
(2+1)D ConvNets
-
-
Given the same number of layers as a 3D ConvNet - assuming that a (2+1)D layer counts the same as a singular 3D layer - (2+1)D ConvNets doubles non-linearity while retaining the same number of parameters
-
Has lower training and testing error than a 3D model, implying that it is easier to optimize
3D ConvNets
Vanilla 3D convolution equivalent of a 2D ResNet model. Similar to 2D ResNet model but instead of 2D convolutions, it has 3D convolutions
-
-
Input preprocessing
Randomly crop clips into 8 x 112 x 112 to have more input samples and to have spatial and temporal jittering
-
Randomly choose 5 2-second long clips for every video. Video classification is performed by averaging performance on these clips
-
-
Model Training
-
Trained and evaluated performance of all examined models on both Sports 1M dataset and Kinetics dataset
-
All models examined in this research are all based on the ResNet architecture as it is the state-of-the-art architecture on image classificaiton
-