Please enable JavaScript.

Coggle requires JavaScript to display documents.

Video Classification Papers that try to improve performance on benchmark…

- - - - An RNN architecture that effectively models long-term information found in temporal data.
      - Used by this research to effectively model classify long videos as opposed to clips as LSTM can retain info for long-term
      - LSTM model uses image feature as inputs concatenated with optical flow features.
      - Classification
  - - - Conv Pooling
      - Late Pooling
      - Slow Pooling
      - Local Pooling
      - Time-Domain Convolution
  - - - Used for training and evaluation of feature-pooling models and LSTM models.
      - Large-scale video classification dataset that is publicly available at the time of publication.
    - - Another benchmark dataset for video classification.
      - Significantly smaller than Sports 1M
- - - - The idea behind inflating is that learned parameters from image classification could prove to be useful in the video classification domain.
      - Inflating involves replicating 2D convolution weights d times where d denotes the temporal depth examined by a convolutional layer
  - - - Another large dataset for large-scale video classification
      - Also used in this research for testing and fine-tuning
    - - UCF101
      - HMDB51
- - - - Based on the hypothesis that motion modeling is only needed on the early layers of the network and that deeper layers need spatial information
      - Early layers have 3D convolutions while deeper layers have 2D convolutions
      - The x in MCx denotes the number of the last few layers which will have 2D convolutions
    - - Same intuition with MCx but reversed. Early layers have 2D convolutions while last layers have 3D convolutions
      - The x in MCx denotes the number of the first few layers which will have 2D convolutions
  - - - Similar architecture to R2D, however, input is a stream of video frames.
      - Temporal information is processed at the space-time pooling layer found at the end of the network.
      - Network convolves over images as if it was an image classification task
    - - 2D convolutions over an entire video clip
      - Used frames as channels in 2D convolutions
      - Temporal information is only processed on the first convolution. Afterwards, the succeeding 2D convolutions don't have temporal information.
  - - - Sports 1M
      - Kinetics
    - - HMDB51
      - UCF101
  - - - Pre-trained from Sports 1M and fine-tuned on benchmark datasets
      - Trained from scratch on benchmark datsets
      - Pre-trained from Kinetics and fine-tuned on benchmark datsets
- - - - ResNet contains skip connections to prevent vanishing gradients and overfitting
      - ResNet solves the vanishing gradient problem and allows neural networks to be deeper.
  - - - Large-scale video classification dataset used for pretraining Res3D
    - - Action recognition dataset for comparing performance of Res3D
    - - Another action recognition dataset for comparing performance of Res3D
    - - Action similarity dataset for comparing performance of Res3D
    - - Action detection dataset for comparing performance of Res3D
  - - - Employs same training methodology as C3D paper
    - - Evaluates on same benchmarks as C3D
      - Outperforms C3D
      - 2 times faster, 2 times smaller, and more compact than C3D
- - - - Performs better than varying temporal depth
      - Network with temporal depth of 3 performs best among examined homogeneous depths
    - - Decreasing -Temporal depths decrease as information is propagated deeper in the network
      - Increasing - Temporal depths increase as information is propagated deeper in the network
      - No significant difference between the two
  - - - Large -scale video classification dataset used for pre-training the network
    - - Dataset for evaluating network feature + non deep network classifiers (SVM) on action recognition tasks
    - - Action similarity dataset used for testing features obtained from the network and comparing performance with other state-of-the-art video feature descriptors
    - - Dataset for evaluating network features + traditional (SVM) classifiers on scene recognition tasks
  - - - Learning rate of 0.003 divided by 2 for every 150k iterations
      - 1.9M iterations
    - - Network is evaluated on Sports 1M dataset
      - Video classification is done by averaging results on 10 randomly chosen clips from a video.
        
        Top 5 video accuracy
        
        Top 5 accuracy reports true when top 5 classification results of a network contains the true result
        
        Top 1 video accuracy
        
        Top 5 clip accuracy
      - Video clips are obtained by center-cropping on videos
      - Outperforms DeepVideo