Please enable JavaScript.

Coggle requires JavaScript to display documents.

Deep Learning, architecture - Coggle Diagram

- - - - convolution layer
      - pooling layer
        
        combines nearby units, reduces a size of a feature map, and reduces dimensions
        
        Common pooling layers
        
        max pooling layer
        
        divides a feature map into several regions and uses the maximum value of each region to represent the entire region
        
        average pooling layer
        
        divides a feature map into several regions and uses the average value of each region to represent the entire region
        
        shape of each region in the feature map is referred to as a pooling window size
        
        does not include any parameter
        
        does not involve arrangement of elements in each small region, and concerns only statistical features of these elements
        
        function
        
        reducing the size of input data of the next layer
        
        effectively reducing a quantity of parameters
        
        reducing a calculation amount
        
        preventing overfitting
        
        enables the CNN to be applicable to an input image of any size
      - fully connected layer
- - - - Batch Gradient Descent Algorithm (BGD)
        
        All training samples need to be calculated every time the weight is updated.
      - Stochastic Gradient Descent Algorithm (SGD)
        
        selects one sample at a time to update the gradient.
        
        dataset can be expanded during model training.
        
        This mode of training is called online learning.
        
        Most training samples contain noises
      - Mini-Batch Gradient Descent Algorithm (MBGD)
        
        uses a small batch of samples each time the weight is updated,
  - - - Quadratic Cost Function
      - Cross entropy error function
        
        Depicts the distance between two probability
        distributions
        
        Used loss function for classification problems
      - mean square error function
        
        used to solve the regression problem
- - - - add momentum terms
      - where 𝛼 is a constant (0 ≤ 𝛼 < 1) called Momentum Coefficient
      - 𝛼∆𝑤𝑗𝑖 𝑛 − 1 is a momentum term
      - Advantages
        
        Enhances the stability of the gradient correction direction and reduces mutations.
        
        where the gradient direction is stable, accelerates convergence
        
        roll over some narrow local extrema.
      - Disadvantages
        
        The learning rate 𝜂 and momentum 𝛼 need to be manually set, requires more experiments to
        determine the appropriate value.
    - - different learning rates need to be set for different parameters
      - 𝜂 indicates the global LR, which needs to be set manually
      - 𝜀 is a small constant is set to about 10-7
        for numerical stability
      - 𝑟 continues increasing while the overall learning rate keeps decreasing as the algorithm iterates.
      - Pros
        
        The learning rate is automatically updated. As the number of updates increases, the learning rate decreases.
      - Cons
        
        The denominator keeps accumulating so that the learning rate will eventually become very small, and the algorithm will become ineffective.
      - initial value of 𝑟 is 0, which
        increases continuously
    - - improved AdaGrad optimizer
        
        attenuation coefficient to ensure a certain attenuation ratio for 𝑟 in each round
        
        solve problem:ends the optimization process too
        early
      - suitable for non-stable target handling and has good effects on the RNN
      - the initial value of 𝑟 is 0, which may not increase and needs to be adjusted using a parameter
      - 𝜂 indicates the global LR, which needs to be set manually
      - 𝜀 is a small constant is set to about 10-7
        for numerical stabilit
    - - Developed based on AdaGrad and
        AdaDelta
      - 𝑚𝑡 and 𝑣𝑡 are estimates of the first moment (the average value)
        and the second moment (the uncentered variance) of the gradients
      - Although the rule involves manual setting of 𝜂, 𝛽1, and 𝛽2, the setting is much simpler.
      - converge quickly
      - When convergence saturation is reached, xx can be reduced
- - - - 𝐿1 Regularization
        
        Lasso regression
        
        parameter value complies with the Laplace distribution rule
        
        𝐿1 can generate a more sparse model than 𝐿2
      - 𝐿2 Regularization
        
        ridge regression
        
        parameter value complies with the Gaussian distribution rule
    - - adding noise
      - transforming data
    - - randomly discards some inputs during the training process.
      - parameters corresponding to the discarded inputs are not updated
    - - When the data loss of the verification set increases, perform early stopping.