Please enable JavaScript.
Coggle requires JavaScript to display documents.
J.Y. Ng et al (2015) Beyond Short Snippets: Deep Networks for Video…
J.Y. Ng et al (2015) Beyond Short Snippets: Deep Networks for Video Classification
RNN
Type of deep neural network that models temporal information.
LSTM
An RNN architecture that effectively models long-term information found in temporal data.
Used by this research to effectively model classify long videos as opposed to clips as LSTM can retain info for long-term
LSTM model uses image feature as inputs concatenated with optical flow features.
Classification
Feature Pooling
The paper used feature pooling to combine image features of every frame found in videos.
The argument is that this approach mimics the bag-of-words approach found in image classification, only this time, it's bag-of-words for video classification where a word is a feature found in a frame
Uses max pooling throughout different feature pooling architectures
Feature Pooling Architectures
Conv Pooling
Late Pooling
Slow Pooling
Local Pooling
Time-Domain Convolution
2D Image Classification
Networks trained on image classification tasks such as GoogleNet and AlexNet are used in this research as feature descriptors of frames of a video
This allows leveraging of advancements in the image classification domain to the video classification domain
Datasets
Sports 1M
Used for training and evaluation of feature-pooling models and LSTM models.
Large-scale video classification dataset that is publicly available at the time of publication.
UFC101
Another benchmark dataset for video classification.
Significantly smaller than Sports 1M
Model Training
Both models are trained on Sports 1M dataset from scratch.
Weights of 2D classification models are kept frozen. Only the architecture related to video classification are trained.
For UFC101, models are pre-trained on Sports 1M before fine-tuning to UFC101. This is because UFC101 is relatively small for deep learning architectures.
Model Evaluation
Both models are evaluated on Sports 1M after being trained from scratch
Models are only evaluated on UFC101 after fine-tuning on UFC101 from pre-training on Sports 1M
Optical Flow
Image feature that represents pattern of motion of objects in a scene
Used by RNN model along with 2D image features for solving video classification
Data Preprocessing
Videos are converted from15 fps to 1fps
Optical flow features are computed for a video before feeding to LSTM model.
uses 300 frames for 5 minute video
Resize frames to 256 x 256 and downsamples 220x220 region
Performs frame flipping horizontally with 50% probability
Legend:
Purple - Hyperparameter Tuning
Blue: Model Training and Evaluation
Yellow - Data
Green - Architecture