Please enable JavaScript.
Coggle requires JavaScript to display documents.
Video Classification Papers, Modeling temporal information that captures…
Video Classification Papers
Ogawa Et Al (2018), Favorite Video Classification Based On Multimodal Bidirectional LSTM
Model Architecture
RNN
LSTM
Variant of RNN that effectively captures long-term and short-term temporal data
Used in this research to take advantage of long-term information found across frames in video clips as well as signals in EEG
Sequence-to-one
RNN variant that takes a sequence data and outputs a single classification
An example use-case is determining whether a sentence makes readers happy or not
Used in this research to determine how a user reacts or feels when watching a video
Sequence-to-sequence
RNN variant the takes sequence of data as inputs and also outputs sequence of data
Used for problems like translation where your input is a sentence and your output is also a sentence
Bi-directional RNN
An RNN approach where information is propagated forward in time and then backwards in time before generating output
Used when useful information can also be observed when observing sequence of inputs backwards
Input Layer
Composed of a series of vectors with 1024 dimension video features concatenated with 1024 dimension EEG features
Data
Representation
Videos
Represent WUW information
Inception-v3 to obtain features per frame and PCA to reduce features dimensionality to 1024
EEG Signals
Represent WUL information
Obtained by sampling 1024 Hz and are used raw
Dataset
Obtained by recording EEG signals of volunteers while watching video trailers
EEG signals are mapped according to the appropriate time-frame in the video
Favorite video Classification
What a user likes (WUL)
Represents how users react on videos they watch
Useful information for video recommendation systems
Can be used for targeted ads
What a user watches (WUW)
Represents the information that can be found in videos
Most literature uses WUW features to estimate the WUL problem
Usually good enough on estimating WUL
May fail for examples like when a user watches an action movie, a WUW based WUL classification model might predict that the user likes action movies when in fact, the user watches the movie only because of the actor.
Tran et al (2019) Video Classification with Channel-Separated Convolutional Networks
3D Convolution High Computational Cost
Kernel refactorization
Usually done to reduce number of floating point operations (FLOPS)
Usual approach: Separate 3D into 2D convolutions for space, 1D convolution for time
Paper approach: separate convolutions into channel convolutions and spatiotemporal convolutions
Reorganize convolution operations
Floating point operations
Number of parameters
Number of channel / feature interactions
Define as the effect a channel has on another channel
Usually occurs when two channels share a common filter
Reduce number of convolutions
Group convolution
Grouping convolution filters into subsets
Filter convolves on a subset of the input features
Reduces convolutions by a factor of G
Limits feature interaction since channels can only interact with channels in the same group
Depth convolution
Extreme case of group convolution where there is only one input for a filter
Number of output must have the same number of inputs
Reduces channel interaction to effectively none since there is only one channel in a group.
Builds foundation of kernel refactorization to produce convolutions for channel interactions and convolutions for spatiotemporal interactions
Channel-separated convolutional networks
Either 1 x 1 x1 convolutions
Models channel interactions through 1D convolutions
Or k x k x k depth-wise convolutions
Models spatiotemporal features
Interaction-preserved channel-separated networks
Has 1 x 1 x 1 convolutions before every depth-wise convolutions
Aims to preserve channel interaction modelling capabilities of original 3D network while reducing cost
Interaction-reduced channel-separated networks
Convert 3D convolution to depth-wise convolution
Removes channel interactions
Karpathy et al, (2014) Large-scale Video Classification with Convolutional Neural Networks
Model Architecture
2D Convolutions
2D convolutions are for images. Need to model temporal information in videos
Use 3D Convolutions
Early Fusion
3D convolutions on first layer
Succeeding layers are 2D convolutions
Initial 3D convolution models time component
Late Fusion
Similar to single-frame but uses two input streams separated by a time interval T
The two input streams are joined together by a fully connected layer
Temporal information is provided by the two input streams and are processed by the fully connected layer
Slow Fusion
Uses 3D convolutions all throughout the model until the fully-connected layer
Performs best but is also the most computationally expensive
It performs best since it is the version that captures the most temporal information
3D convolutions can utilize time information by adding temporal depth to the convolutions.
Single-frame (baseline approach)
Classify images one frame at a time
Obtain average classification of frames to get video classification
Does not model temporal information but gives a baseline when only static information are modeled.
Convolutions are expensive.
Has a lot of floating point operations FLOPS
More convolutions give more FLOPS
Reduce cost by reducing layers
Significantly reduces performance
Not suitable since deeper networks tend to be more accurate
Reduce cost by reducing input resolution
Reducing input resolution reduces performance
Use muilti-resolution networks
Process two input streams, one with cropped original res, and the other one with downsampled frames
Performs better and faster than single-frame
Traditional Approaches
Hand-created features
Uses classifiers like SVM
Does not utilize large pre-trained data
Data
Sports 1M Dataset
Benchmark for video classification
Used by most video classification models for training
Composed of videos obtained from YouTube and classified via tags
Dataset is prone to errors since videos could have been mistagged
Video labels may not completelyl describe what a video represents
Data Augmentation and preprocessing
Crop to center
Resize to 200 x 200
Random sample 170 x 170
Random flipping with 50% chance
Model Training and Testing
Optimizer: Downpour stochastic gradient descent
Transfer Learning
Trained network on Sports 1M and applied on other classification dataset to test generalization capabilities
Layer fine-tuning
Fine-tune top layer
Best performing: Fine-tune top 3 layers
Fine-tune all layers
Testing on Sports 1M
Outperforms own baseline model single-network
Outperforms traditional approaches that use hand-made features
Slow-fusion is best performing variant
Diba et al (2017), Temporal 3D Convnets
Model Training
Weights initialization
Random weight initialization
Initialization of weights via transfer learning
Can use knowledge from image classification tasks on video classification methods
Saves training time and can increase performance
Initialize weights by first training the video classification model on a larger dataset
Initialization of weights by replicating weights of a 2D basis model
Transfer Learning
Uses supervision learning with 2D models as the teacher.
Combine 2D model and 3D model into a model that solves 1 learning problem.
2D model accepts series of images as input while 3D model accepts a video clip.
The combined model solves the problem of correspondence between a video clip and a series of images.
This works since for a video frame and an image to match, the 2D model and the 3D model must represent them the same way.
3D model effectively learns how 2D model represents an image. And therefore, transfer learning is achieved.
2D model weights are frozen.
Data
Input Data
Frame Resolution
Compared 112 x 112 vs 224 x 224
Higher frame resolution yielded better performance
This could be because more information is present in more pixels
This could impact cost since there would be more frames to consider
Frame sampling rate
Evaluated on frame strides {1, 2, 4, 16}
Best performance was achieved when stride = 2
Dataset
HMDB51
Kinetics
Largest dataset among the three.
The model is pre-trained on this dataset before applying on the other dataset. This is to test the effects of pre-training from a larger dataset.
UCF101
Model Architecture
Transfer Learning
Used pre-trained image classification models like ResNet, Inception, and DenseNet
Can use knowledge from large-scale image classification tasks for video classification
Can save time on model training since training won't start from scratch
Supervision transfer of knowledge from 2D image classification models to any 3D video classification model.
Temporal 3D Convnet
Architecture similar to 3D DenseNets
Performs 3D convolutions composed of sequence of images
Can be based from a 2D DenseNet model.
2D model weights can be converted to 3D model weights by replicating 2D weights d times depending on the depth of a convolution block
Utilizes Temporal Transition Layers in between 3D DenseNet Blocks
TTL can store information on variable temporal depths
Has better representation of temporal features since it can represent short-term, medium-term, and long-term temporal information.
Temporal features from temporal depths are concatenated as one feature before supplying as input to the next 3D DenseNet Block
TTL is not limited to 3D DenseNet blocks It can be used in between 3D versions of ResNet or Inception blocks as well and it will achieve better results.
Bratolli et al (2020), Rethinking Zero-shot Video Classification: End-to-end Training for
Realistic Applications
Zero-shot learning in other literatures
Pre-training on a large dataset to obtain generalization capabilities
Using pre-trained video classification models for initializing weights on your specific subproblem
Usually used when there is no large dataset available for your specific subproblem
Transfer Learning
Zero shot learning in other literatures is usually through the use of transfer learning
Does not completely adhere to zero-shot learning as pre-trained models could have already encountered the specific subproblem
These models would then contain bias on the specific subproblem and would indeed increase performance
Not generalized as it may only perform well on problems it have encountered before
Saves training time since model would not start from scratch
Zero-shot learning defined in this paper
Definition
Procedure for training a model on a pre-training and training set and testing it on a testing set where the testing set does not overlap with pre-training and training set.
Need to generalize to unseen classes to be proven effective
Guidelines for realistic ZSL
Training classes (labels) must not overlap with test classes (labels). This ensures that a model can
generalize to scenarios it has not encountered.
Domain of pre-training and training set must be different than test set. This is to ensure that domain shifts between training and test set are accommodated in the learning problem
ZSL model should perform well on multiple test datasets it has not encountered
Training
Dataset preparation
Remove classes of training set on test set
This ensures that the information used for learning won't be available for testing.
This should be able to test generalization capabilities of model
Models
Pre-trained models on large datasets
Used available 3D convolution models with weights frozen.
End-to-end model trained for zero-shot learning
Uses similar architectures as those in the pre-trained models
Difference is that these models were trained for ZSL
Evaluation
Protocol 1: Randomly choose half of the test set classes and evaluate model on the chosen half. Repeat steps ten times and get average results
ZSL in this paper outperformed pre-trained networks in using this evaluation protocol
Protocol 2: Other literature use training / test splits on evaluating models. The models in this paper are trained on training and test sets with different labels. The paper evaluates the model on the test set with no similar classes with the training set.
ZSL also outperformed pre-trained networks in using this evaluation protocol.
Results may show some bias since pre-trained networks were not made for this definition of ZSL learning
Results highlighted in this paper may not be that impressive since two networks of different purpose are compared on a problem that fits only one network.
Modeling temporal information that captures short-term and long-term
Provides ways of reducing high computational cost of slow fusion 3D convolutions
Example of transfer learning that does not completely generalizes
3D convolutions