Du Tran et al (2018), A Closer Look at Spatiotemporal Convolutions for Action Recongition
Legend:
Purple - Hyperparameter Tuning
Blue: Model Training and Evaluation
Yellow - Data
Green - Architecture
Mixed 3D 2D ConvNets
(2+1)D ConvNets
3D ConvNets
2D Convolutions for video Classification
Input preprocessing
Datasets
Model Training
Model Evaluation
f-R2D
R2D
Randomly crop clips into 8 x 112 x 112 to have more input samples and to have spatial and temporal jittering
Resize clips to 128 x 171
Randomly choose 5 2-second long clips for every video. Video classification is performed by averaging performance on these clips
Flip clips horizontally with 50% probabilty
Large-scale video classification
Action recognition benchmarks
Sports 1M
Kinetics
HMDB51
UCF101
Employs same training methodology as C3D paper
Trained and evaluated performance of all examined models on both Sports 1M dataset and Kinetics dataset
Evaluated best-performing model R(2+1)D on multiple benchmark datasets such as HMDB51 and UFC101
Evaluated multiple versions of R(2+1)D
Pre-trained from Sports 1M and fine-tuned on benchmark datasets
Trained from scratch on benchmark datsets
Pre-trained from Kinetics and fine-tuned on benchmark datsets
2D convolutions over an entire video clip
Used frames as channels in 2D convolutions
Temporal information is only processed on the first convolution. Afterwards, the succeeding 2D convolutions don't have temporal information.
Similar architecture to R2D, however, input is a stream of video frames.
Temporal information is processed at the space-time pooling layer found at the end of the network.
Network convolves over images as if it was an image classification task
MCx
rMCx
Based on the hypothesis that motion modeling is only needed on the early layers of the network and that deeper layers need spatial information
Early layers have 3D convolutions while deeper layers have 2D convolutions
The x in MCx denotes the number of the last few layers which will have 2D convolutions
Same intuition with MCx but reversed. Early layers have 2D convolutions while last layers have 3D convolutions
The x in MCx denotes the number of the first few layers which will have 2D convolutions
All models examined in this research are all based on the ResNet architecture as it is the state-of-the-art architecture on image classificaiton
Vanilla 3D convolution equivalent of a 2D ResNet model. Similar to 2D ResNet model but instead of 2D convolutions, it has 3D convolutions
Uses 3D convolutions to convolve over a series of frames on a video clip
Approximation of 3D convolutions by using 2D convolutions for space and 1D convolutions for time
Uses kernel refactorization to re-arrange 3D convolutions to 2+1 convolutions.
Given the same number of layers as a 3D ConvNet - assuming that a (2+1)D layer counts the same as a singular 3D layer - (2+1)D ConvNets doubles non-linearity while retaining the same number of parameters
Able to process more complicated informations due to it having more non-linearity
Has lower training and testing error than a 3D model, implying that it is easier to optimize
Shown that 2D models are outperformed by 3D models.