Please enable JavaScript.

Coggle requires JavaScript to display documents.

Video Classification Papers, Modeling temporal information that captures…

- - - - LSTM
        
        Variant of RNN that effectively captures long-term and short-term temporal data
        
        Used in this research to take advantage of long-term information found across frames in video clips as well as signals in EEG
      - Sequence-to-one
        
        RNN variant that takes a sequence data and outputs a single classification
        
        An example use-case is determining whether a sentence makes readers happy or not
        
        Used in this research to determine how a user reacts or feels when watching a video
      - Sequence-to-sequence
        
        RNN variant the takes sequence of data as inputs and also outputs sequence of data
        
        Used for problems like translation where your input is a sentence and your output is also a sentence
      - Bi-directional RNN
        
        An RNN approach where information is propagated forward in time and then backwards in time before generating output
        
        Used when useful information can also be observed when observing sequence of inputs backwards
    - - Composed of a series of vectors with 1024 dimension video features concatenated with 1024 dimension EEG features
  - - - Videos
        
        Represent WUW information
        
        Inception-v3 to obtain features per frame and PCA to reduce features dimensionality to 1024
      - EEG Signals
        
        Represent WUL information
        
        Obtained by sampling 1024 Hz and are used raw
    - - Obtained by recording EEG signals of volunteers while watching video trailers
      - EEG signals are mapped according to the appropriate time-frame in the video
  - - - Represents how users react on videos they watch
      - Useful information for video recommendation systems
      - Can be used for targeted ads
    - - Represents the information that can be found in videos
      - Most literature uses WUW features to estimate the WUL problem
        
        Usually good enough on estimating WUL
        
        May fail for examples like when a user watches an action movie, a WUW based WUL classification model might predict that the user likes action movies when in fact, the user watches the movie only because of the actor.
- - - - Usually done to reduce number of floating point operations (FLOPS)
      - Usual approach: Separate 3D into 2D convolutions for space, 1D convolution for time
      - Paper approach: separate convolutions into channel convolutions and spatiotemporal convolutions
      - Reorganize convolution operations
    - - Define as the effect a channel has on another channel
      - Usually occurs when two channels share a common filter
  - - - Extreme case of group convolution where there is only one input for a filter
      - Number of output must have the same number of inputs
      - Reduces channel interaction to effectively none since there is only one channel in a group.
      - Builds foundation of kernel refactorization to produce convolutions for channel interactions and convolutions for spatiotemporal interactions
  - - - Models channel interactions through 1D convolutions
    - - Models spatiotemporal features
    - - Has 1 x 1 x 1 convolutions before every depth-wise convolutions
      - Aims to preserve channel interaction modelling capabilities of original 3D network while reducing cost
    - - Convert 3D convolution to depth-wise convolution
      - Removes channel interactions
- - - - 2D convolutions are for images. Need to model temporal information in videos
        
        Use 3D Convolutions
        
        Early Fusion
        
        3D convolutions on first layer
        
        Succeeding layers are 2D convolutions
        
        Initial 3D convolution models time component
        
        Late Fusion
        
        Similar to single-frame but uses two input streams separated by a time interval T
        
        The two input streams are joined together by a fully connected layer
        
        Temporal information is provided by the two input streams and are processed by the fully connected layer
        
        Slow Fusion
        
        Uses 3D convolutions all throughout the model until the fully-connected layer
        
        Performs best but is also the most computationally expensive
        
        It performs best since it is the version that captures the most temporal information
        
        3D convolutions can utilize time information by adding temporal depth to the convolutions.
        
        Single-frame (baseline approach)
        
        Classify images one frame at a time
        
        Obtain average classification of frames to get video classification
        
        Does not model temporal information but gives a baseline when only static information are modeled.
      - Convolutions are expensive.
        
        Has a lot of floating point operations FLOPS
        
        More convolutions give more FLOPS
        
        Reduce cost by reducing layers
        
        Significantly reduces performance
        
        Not suitable since deeper networks tend to be more accurate
        
        Reduce cost by reducing input resolution
        
        Reducing input resolution reduces performance
        
        Use muilti-resolution networks
        
        Process two input streams, one with cropped original res, and the other one with downsampled frames
        
        Performs better and faster than single-frame
    - - Hand-created features
      - Uses classifiers like SVM
      - Does not utilize large pre-trained data
  - - - Benchmark for video classification
      - Used by most video classification models for training
      - Composed of videos obtained from YouTube and classified via tags
        
        Dataset is prone to errors since videos could have been mistagged
        
        Video labels may not completelyl describe what a video represents
    - - Crop to center
      - Resize to 200 x 200
      - Random sample 170 x 170
      - Random flipping with 50% chance
  - - - Trained network on Sports 1M and applied on other classification dataset to test generalization capabilities
      - Layer fine-tuning
        
        Fine-tune top layer
        
        Best performing: Fine-tune top 3 layers
        
        Fine-tune all layers
    - - Outperforms own baseline model single-network
      - Outperforms traditional approaches that use hand-made features
      - Slow-fusion is best performing variant
- - - - Random weight initialization
      - Initialization of weights via transfer learning
        
        Can use knowledge from image classification tasks on video classification methods
        
        Saves training time and can increase performance
        
        Initialize weights by first training the video classification model on a larger dataset
      - Initialization of weights by replicating weights of a 2D basis model
    - - Uses supervision learning with 2D models as the teacher.
      - Combine 2D model and 3D model into a model that solves 1 learning problem.
        
        2D model accepts series of images as input while 3D model accepts a video clip.
        
        The combined model solves the problem of correspondence between a video clip and a series of images.
        
        This works since for a video frame and an image to match, the 2D model and the 3D model must represent them the same way.
        
        3D model effectively learns how 2D model represents an image. And therefore, transfer learning is achieved.
        
        2D model weights are frozen.
  - - - Frame Resolution
        
        Compared 112 x 112 vs 224 x 224
        
        Higher frame resolution yielded better performance
        
        This could be because more information is present in more pixels
        
        This could impact cost since there would be more frames to consider
      - Frame sampling rate
        
        Evaluated on frame strides {1, 2, 4, 16}
        
        Best performance was achieved when stride = 2
    - - HMDB51
      - Kinetics
        
        Largest dataset among the three.
        
        The model is pre-trained on this dataset before applying on the other dataset. This is to test the effects of pre-training from a larger dataset.
      - UCF101
  - - - Used pre-trained image classification models like ResNet, Inception, and DenseNet
        
        Can use knowledge from large-scale image classification tasks for video classification
        
        Can save time on model training since training won't start from scratch
      - Supervision transfer of knowledge from 2D image classification models to any 3D video classification model.
    - - Architecture similar to 3D DenseNets
        
        Performs 3D convolutions composed of sequence of images
        
        Can be based from a 2D DenseNet model.
        
        2D model weights can be converted to 3D model weights by replicating 2D weights d times depending on the depth of a convolution block
      - Utilizes Temporal Transition Layers in between 3D DenseNet Blocks
        
        TTL can store information on variable temporal depths
        
        Has better representation of temporal features since it can represent short-term, medium-term, and long-term temporal information.
        
        Temporal features from temporal depths are concatenated as one feature before supplying as input to the next 3D DenseNet Block
        
        TTL is not limited to 3D DenseNet blocks It can be used in between 3D versions of ResNet or Inception blocks as well and it will achieve better results.
- - - - Zero shot learning in other literatures is usually through the use of transfer learning
    - - These models would then contain bias on the specific subproblem and would indeed increase performance
      - Not generalized as it may only perform well on problems it have encountered before
  - - - Procedure for training a model on a pre-training and training set and testing it on a testing set where the testing set does not overlap with pre-training and training set.
      - Need to generalize to unseen classes to be proven effective
      - Guidelines for realistic ZSL
        
        Training classes (labels) must not overlap with test classes (labels). This ensures that a model can
        generalize to scenarios it has not encountered.
        
        Domain of pre-training and training set must be different than test set. This is to ensure that domain shifts between training and test set are accommodated in the learning problem
        
        ZSL model should perform well on multiple test datasets it has not encountered
    - - Dataset preparation
        
        Remove classes of training set on test set
        
        This ensures that the information used for learning won't be available for testing.
        
        This should be able to test generalization capabilities of model
      - Models
        
        Pre-trained models on large datasets
        
        Used available 3D convolution models with weights frozen.
        
        End-to-end model trained for zero-shot learning
        
        Uses similar architectures as those in the pre-trained models
        
        Difference is that these models were trained for ZSL
    - - Protocol 1: Randomly choose half of the test set classes and evaluate model on the chosen half. Repeat steps ten times and get average results
        
        ZSL in this paper outperformed pre-trained networks in using this evaluation protocol
      - Protocol 2: Other literature use training / test splits on evaluating models. The models in this paper are trained on training and test sets with different labels. The paper evaluates the model on the test set with no similar classes with the training set.
        
        ZSL also outperformed pre-trained networks in using this evaluation protocol.
        
        Results may show some bias since pre-trained networks were not made for this definition of ZSL learning
        
        Results highlighted in this paper may not be that impressive since two networks of different purpose are compared on a problem that fits only one network.