Please enable JavaScript.
Coggle requires JavaScript to display documents.
Du Tran et al (2015) Learning Spatiotemporal Features with 3D…
Du Tran et al (2015) Learning Spatiotemporal Features with 3D Convolutional Neural Networks
3D Convolutions
Based on slow fusion approach of Karpathy (2014)
Used 3D convolutions to process both space and time information of videos
Network created is a series of 3D convolutions with 2 fully connected layers and 1 softmax layer at the end
Kernel Temporal Depths
Homogeneous temporal depth
Performs better than varying temporal depth
Network with temporal depth of 3 performs best among examined homogeneous depths
Varying temporal depth
Decreasing -Temporal depths decrease as information is propagated deeper in the network
Increasing - Temporal depths increase as information is propagated deeper in the network
No significant difference between the two
Dimensions for height and width of convolutions are kept constant in order to examine the effects of kernel temporal depths on the performance of the network.
Spatiotemporal Features
Uses a network that has 3 x 3 x 3 for all convolutions and 2 x 2 x 2 for all pooling layers
Based on a network trained on Sports 1M dataset.
Features are obtained by removing classification layers and last fully connected layer of network
Describes image appearances in first few layers and succeeding layers describe motion of frames
Dataset
Sports 1M
Large -scale video classification dataset used for pre-training the network
UCF101
Dataset for evaluating network feature + non deep network classifiers (SVM) on action recognition tasks
ASLAN
Action similarity dataset used for testing features obtained from the network and comparing performance with other state-of-the-art video feature descriptors
YUPENN and Maryland
Dataset for evaluating network features + traditional (SVM) classifiers on scene recognition tasks
Input preprocessing
Randomly choose 5 2-second long clips for every video. Video classification is performed by averaging performance on these clips
Randomly crop clips into 16 x 112 x 112 to have more input samples and to have spatial and temporal jittering
Resize clips to 128 x 171
Flip clips horizontally with 50% probability
C3D Features
Training
Learning rate of 0.003 divided by 2 for every 150k iterations
1.9M iterations
Evaluation
Network is evaluated on Sports 1M dataset
Video classification is done by averaging results on 10 randomly chosen clips from a video.
Top 5 video accuracy
Top 5 accuracy reports true when top 5 classification results of a network contains the true result
Top 1 video accuracy
Top 5 clip accuracy
Video clips are obtained by center-cropping on videos
Outperforms DeepVideo
Legend:
Blue: Model Training and Evaluation
Yellow - Data
Green - Architecture