Please enable JavaScript.
Coggle requires JavaScript to display documents.
Karpathy et al, (2014) Large-scale Video Classification with Convolutional…
Karpathy et al, (2014) Large-scale Video Classification with Convolutional Neural Networks
Model Architecture
2D Convolutions
2D convolutions are for images. Need to model temporal information in videos
Use 3D Convolutions
Early Fusion
3D convolutions on first layer
Succeeding layers are 2D convolutions
Initial 3D convolution models time component
Late Fusion
Similar to single-frame but uses two input streams separated by a time interval T
The two input streams are joined together by a fully connected layer
Temporal information is provided by the two input streams and are processed by the fully connected layer
Slow Fusion
Uses 3D convolutions all throughout the model until the fully-connected layer
Performs best but is also the most computationally expensive
It performs best since it is the version that captures the most temporal information
3D convolutions can utilize time information by adding temporal depth to the convolutions.
Single-frame (baseline approach)
Classify images one frame at a time
Obtain average classification of frames to get video classification
Does not model temporal information but gives a baseline when only static information are modeled.
Convolutions are expensive.
Has a lot of floating point operations FLOPS
More convolutions give more FLOPS
Reduce cost by reducing layers
Significantly reduces performance
Not suitable since deeper networks tend to be more accurate
Reduce cost by reducing input resolution
Reducing input resolution reduces performance
Use muilti-resolution networks
Process two input streams, one with cropped original res, and the other one with downsampled frames
Performs better and faster than single-frame
Traditional Approaches
Hand-created features
Uses classifiers like SVM
Does not utilize large pre-trained data
Data
Sports 1M Dataset
Benchmark for video classification
Used by most video classification models for training
Composed of videos obtained from YouTube and classified via tags
Dataset is prone to errors since videos could have been mistagged
Video labels may not completelyl describe what a video represents
Data Augmentation and preprocessing
Crop to center
Resize to 200 x 200
Random sample 170 x 170
Random flipping with 50% chance
Model Training and Testing
Optimizer: Downpour stochastic gradient descent
Transfer Learning
Trained network on Sports 1M and applied on other classification dataset to test generalization capabilities
Layer fine-tuning
Fine-tune top layer
Best performing: Fine-tune top 3 layers
Fine-tune all layers
Testing on Sports 1M
Outperforms own baseline model single-network
Outperforms traditional approaches that use hand-made features
Slow-fusion is best performing variant