Please enable JavaScript.

Coggle requires JavaScript to display documents.

Karpathy et al, (2014) Large-scale Video Classification with Convolutional…

- - - - Use 3D Convolutions
        
        Early Fusion
        
        3D convolutions on first layer
        
        Succeeding layers are 2D convolutions
        
        Initial 3D convolution models time component
        
        Late Fusion
        
        Similar to single-frame but uses two input streams separated by a time interval T
        
        The two input streams are joined together by a fully connected layer
        
        Temporal information is provided by the two input streams and are processed by the fully connected layer
        
        Slow Fusion
        
        Uses 3D convolutions all throughout the model until the fully-connected layer
        
        Performs best but is also the most computationally expensive
        
        It performs best since it is the version that captures the most temporal information
      - 3D convolutions can utilize time information by adding temporal depth to the convolutions.
      - Single-frame (baseline approach)
        
        Classify images one frame at a time
        
        Obtain average classification of frames to get video classification
        
        Does not model temporal information but gives a baseline when only static information are modeled.
    - - Has a lot of floating point operations FLOPS
      - More convolutions give more FLOPS
      - Reduce cost by reducing layers
        
        Significantly reduces performance
        
        Not suitable since deeper networks tend to be more accurate
      - Reduce cost by reducing input resolution
        
        Reducing input resolution reduces performance
        
        Use muilti-resolution networks
        
        Process two input streams, one with cropped original res, and the other one with downsampled frames
        
        Performs better and faster than single-frame
- - - - Dataset is prone to errors since videos could have been mistagged
      - Video labels may not completelyl describe what a video represents
- - - - Fine-tune top layer
      - Best performing: Fine-tune top 3 layers
      - Fine-tune all layers