Du Tran et al (2018), A Closer Look at Spatiotemporal Convolutions for Action Recongition

Legend:



Purple - Hyperparameter Tuning

Blue: Model Training and Evaluation

Yellow - Data

Green - Architecture

Mixed 3D 2D ConvNets

(2+1)D ConvNets

3D ConvNets

2D Convolutions for video Classification

Input preprocessing

Datasets

Model Training

Model Evaluation

f-R2D

R2D

Randomly crop clips into 8 x 112 x 112 to have more input samples and to have spatial and temporal jittering

Resize clips to 128 x 171

Randomly choose 5 2-second long clips for every video. Video classification is performed by averaging performance on these clips

Flip clips horizontally with 50% probabilty

Large-scale video classification

Action recognition benchmarks

Sports 1M

Kinetics

HMDB51

UCF101

Employs same training methodology as C3D paper

Trained and evaluated performance of all examined models on both Sports 1M dataset and Kinetics dataset

Evaluated best-performing model R(2+1)D on multiple benchmark datasets such as HMDB51 and UFC101

Evaluated multiple versions of R(2+1)D

Pre-trained from Sports 1M and fine-tuned on benchmark datasets

Trained from scratch on benchmark datsets

Pre-trained from Kinetics and fine-tuned on benchmark datsets

2D convolutions over an entire video clip

Used frames as channels in 2D convolutions

Temporal information is only processed on the first convolution. Afterwards, the succeeding 2D convolutions don't have temporal information.

Similar architecture to R2D, however, input is a stream of video frames.

Temporal information is processed at the space-time pooling layer found at the end of the network.

Network convolves over images as if it was an image classification task

MCx

rMCx

Based on the hypothesis that motion modeling is only needed on the early layers of the network and that deeper layers need spatial information

Early layers have 3D convolutions while deeper layers have 2D convolutions

The x in MCx denotes the number of the last few layers which will have 2D convolutions

Same intuition with MCx but reversed. Early layers have 2D convolutions while last layers have 3D convolutions

The x in MCx denotes the number of the first few layers which will have 2D convolutions

All models examined in this research are all based on the ResNet architecture as it is the state-of-the-art architecture on image classificaiton

Vanilla 3D convolution equivalent of a 2D ResNet model. Similar to 2D ResNet model but instead of 2D convolutions, it has 3D convolutions

Uses 3D convolutions to convolve over a series of frames on a video clip

Approximation of 3D convolutions by using 2D convolutions for space and 1D convolutions for time

Uses kernel refactorization to re-arrange 3D convolutions to 2+1 convolutions.

Given the same number of layers as a 3D ConvNet - assuming that a (2+1)D layer counts the same as a singular 3D layer - (2+1)D ConvNets doubles non-linearity while retaining the same number of parameters

Able to process more complicated informations due to it having more non-linearity

Has lower training and testing error than a 3D model, implying that it is easier to optimize

Shown that 2D models are outperformed by 3D models.