Please enable JavaScript.

Coggle requires JavaScript to display documents.

(Activation function, LSTM internal structure, Optimization,…

- - - - Xavier initializer for the parameters
      - Initializing the state of an RNN is to use a zero state for cell
  - - - Removing mean and stander deviation
    - - Scale to value between [0,1]
  - - - Feature selection
        
        above optimum number overfitting may happen
    - - low number of sample may result in underfitting
- - - - Combine the adaptive methods and the
      - momentum method. Use the first-order
      - moment estimation and the second order moment estimation of the gradient to dynamically adjust the learning
      - rate of each parameter. Add the bias
      - correction.
    - - Solve the optimal value along the direction of the gradient descent. The method converges at a linear rate.
    - - The update parameters are calculated using a randomly sampled mini-batch. The method converges at a sublinear rate
    - - Accelerate the current gradient descent by accumulating the previous gradient as momentum and perform the gradient update process with momentum.
    - - The method approximates the objective function with a linear function, solves the linear programming to find the feasible descending direction, and makes a one-dimensional search along the direction in the feasible domain.
    - - Instead of saving the gradient of each sample, the average gradient is saved at regular intervals. The gradient sum is updated at each iteration by calculating the gradients with respect to the old parameters and the current parameters for the randomly selected samples.
    - - The learning rate is adaptively adjusted according to the sum of the squares of all historical gradients.
    - - Change the way of total gradient accumulation to exponential moving average.
    - - The method solves optimization problems with linear constraints by adding a penalty term to the objective and separating variables into sub-problems which can be solved iteratively.
    - - The old gradient of each sample and the summation of gradients over all samples are maintained in memory. For each update, one sample is randomly selected and the gradient sum is recalculated and used as the update direction.
  - - - Very low value result in too slow learning
      - very high value result in divergence for the cost function