Please enable JavaScript.
Coggle requires JavaScript to display documents.
Du Tran et al (2017) ConvNet Architecture Search for Spatiotemporal…
Du Tran et al (2017) ConvNet Architecture Search for Spatiotemporal Feature Learning
3D Residual Networks
Based on 2D Residual Networks
ResNet contains skip connections to prevent vanishing gradients and overfitting
ResNet solves the vanishing gradient problem and allows neural networks to be deeper.
Converts inputs from L x W to 8 X L X W. This implies that 8 frames are fed as input to a 3D ResNet model
Convolutions are converted from d x d to 3 x d x d which means that temporal depth of 3 is examined per convolution.
2.5D ConvNets
Uses kernel refactorization to approximate 3D convolutions as 2D convolutions for space followed by 1D convolutions for time
Mixed 3D-2D ConvNets
Based from the hypothesis that motion is only needed on early layers of the network and that deeper layers only need spatial information
Basically, early layers have 3D convolutions while deeper layers have 2D convolutions
Input Preprocessing (for model training of Res3D)
Resize clips to 128 x 171
Randomly choose 5 2-second long clips for every video. Video classification is performed by averaging performance on these clips
Randomly crop clips into 8 x 112 x 112 to have more input samples and to have spatial and temporal jittering
Flip clips horizontally with 50% probabilty
Dataset
Sports 1M
Large-scale video classification dataset used for pretraining Res3D
UCF101
Action recognition dataset for comparing performance of Res3D
HMDB51
Another action recognition dataset for comparing performance of Res3D
ASLAN
Action similarity dataset for comparing performance of Res3D
THUMOS
Action detection dataset for comparing performance of Res3D
Sampling Rate
Examined {1, 2, 4, 8, 16, 32} sampling rates on video clips
Sampling rate of 1 is too short and does not provide any information about videos
Sampling rate of 32 is too long and that video frames are sparsed. No temporal information can no longer be inferred.
Best sampling rate is 2-4
Res3D Features
Training
Employs same training methodology as C3D paper
Evaluation
Evaluates on same benchmarks as C3D
Outperforms C3D
2 times faster, 2 times smaller, and more compact than C3D
Input Resolution
Examined 224 x 224, 112 x 112, 56 x 56
Best overall is 112 x 112 in terms of cost and performance
Smaller resolutions loose a lot of information while larger resolutions are much harder to train and would need more samples, which is not possible in the given setting
Type of Convolutions
Examined 3D convolutions, mixed 2D-3D convolutions, and 2.5D convolutions
Consistent with other literature, 3D convolutions performed best
Network Depth
Examined having number of residual blocks {10, 16, 18, 26, 34}
Residual block of 18 is good enough and provides the best tradeoff between cost and performance
Generally, the deeper the network, the better it performs.
Legend:
Purple - Hyperparameter Tuning
Blue: Model Training and Evaluation
Yellow - Data
Green - Architecture