Please enable JavaScript.

Coggle requires JavaScript to display documents.

Deep learning (Recurrent Neural Networks (architectures (many-to…

- - - - mini-batch gradient descent
        
        epoch
        
        size 64, 128, 256 ...
        
        fit CPU/GPU memory
        
        leaning rate decay
        
        method 1: \( \alpha = \frac{1}{1+decay\_rate*epoch\_num}*\alpha_0 \)
        
        method 2: \( \alpha = 0.95^{epoch\_num}*\alpha_0 \)
      - algorithms
        
        Exponentially weighted averages
        
        bias correction: \( v_t = \frac{v_t}{1-\beta^{t}} \)
        
        \(v_t = \beta*v_{t-1} + (1-\beta)*\theta_t \)
        
        momentum gradient descent #
        
        \( v_{dw} = \beta*v_{dw} + (1-\beta)*dw \)
        \( w=w-\alpha*v_{dw} \)
        
        RMSprop
        
        \( S_{dw} = \beta*S_{dw} + (1-\beta)*dw^2 \)
        \( w = w - \alpha*\frac{dw}{\sqrt{S_{dw}}} \)
        
        Adam # #
    - - normalize Z
      - \( Z^i_{norm} = \frac{Z^i - \mu}{\sqrt{\sigma^2 + \epsilon}} \)
        \( \hat{Z}^i_{norm} = \gamma*Z^i_{norm} + \beta \)
      - no need \( b^{[l]}\)
      - at test
        
        estimate \( \mu, \sigma \)
        
        from training set
    - - \( y_i = \frac{e^{Z_i}}{\sum{e^{Z_i}}} \)
      - loss function
        
        \( L(y,\hat{y}) = \sum{y_i*log(\hat{y_i})} \)
  - - - activation
        
        \( a = g(Z) \)
        
        sigmoid
        
        ReLU
        
        tanh
      - linear
        
        \( Z = W^{T}*x + b \)
    - - \( \triangle^{(l)} = \triangle^{(l)} + \delta^{l+1)}(a^{(l)})^T \)
        \( D_{i,j}^{(l)} = \frac{1}{m}(\triangle_{i,j}^{(l)} + \lambda\Theta_{i,j}^{(l)})\) if \( j \neq 0 \)
        \( D_{i,j}^{(l)} = \frac{1}{m}\triangle_{i,j}^{(l)} \) if \( j \neq 0 \)
      - \( \delta^{(L)} = a^{(L)} - y^{(t)} \)
        \( \delta^{(l)} = ((\Theta^{(l)})^T*\delta^{(l+1)}).*g'(z^{(l)}) \)
        \( g'(z^{(l)}) = a^{(l)} .* (1-a^{(l)}) \)
    - - \( a^{(j)} = g(z^{(j)}) \)
      - \( z^{(j)} = \Theta^{(j-1)}a^{(j-1)} \)
  - - - data
        
        small :black_circle:
        
        train 60%
        
        source
        
        dev 20%
        
        test 20%
        
        large :black_circle:
        
        source
        
        dev 1%
        
        test 1%
        
        train 98%
      - diagnostic
        
        High bias?
        (training set performance)
        
        no
        
        High variance?
        (dev set performance)
        
        yes
        
        solutions #
        
        3 more items...
        
        no
        
        Done
        
        yes
        
        solutions #
        
        train longer
        
        new architecture
        
        bigger network
      - model
        
        weight initialization
        
        exploding/vanishing problem
        
        tanh: : \( W^{[l]} = randn()*\sqrt{\frac{1}{n^{[l-1]}}}\)
        
        ReLU: \( W^{[l]} = randn()*\sqrt{\frac{2}{n^{[l-1]}}}\)
        
        normalizing Input
        
        gradient checking
        
        not for training
        
        remember regularization
        
        not for dropout
    - - process
        
        tips
        
        coarse to fine
        
        random sampling parameters
        
        appropriate scale (log)
        
        babysitting one model
        
        parallel models
      - priority
        
        1
        
        learing rate \( \alpha \)
        
        2
        
        mini batch size
        
        # hidden units
        
        3
        
        # layers
        
        learning rate decay
- - - - random cropping
      - color shifting
      - mirroring
  - - - sub problem
        
        classification with localization
        
        CONV net
        
        output
        
        \( p_{1}, p_{2} \dots \) - probability of each object
        
        \( p_c \) - Object exists or not
        
        \( b_x, b_y, b_w, b_h \) bounding box position and size
        
        lost function
        
        \( L(\hat{y}, y) = \Vert \hat{y} - y \Vert^2 \) if \( y_{p_c} = 1 \)
        
        \( L(\hat{y}, y) = \Vert \hat{y_{p_c}} - y_{p_c} \Vert^2 \) if \( y_{p_c} = 0 \)
        
        landmark detection
        
        sliding window
        
        convolutional implementation
        
        turn FC layer \( \rightarrow 1 \times 1\) conv layer
        
        output
        
        not \( 1 \times 1 \times n \)
        
        1 more item...
        
        share computation
        
        Bounding Box Predictions
        
        YOLO algorithm
        
        divide picture into grid cell
        
        object belong to the cell
        
        has object center point
        
        \( b_x, b_y, b_w, b_h \) - specify by ratio with cell size
        
        \( 0 < b_x, b_y < 1 \)
        
        \( b_w, b_h \) calculated from anchor box with and height
        
        non-max suppression #
        
        chose the box with highest probability
        
        remove boxes with high overlapping with chosen box
        
        algorithm
        
        1 more item...
        
        anchor boxes
        
        overlapping object
        
        object is assigned
        
        2 more items...
        
        intersection over union
        
        overlap 2 bounding boxes
        
        region proposal
        
        R-CNN
        
        propose high possibility region
        
        1 more item...
        
        slow
        
        Fast R-CNN
        
        convolutional implementation
        
        1 more item...
        
        slow identify proposed regions
        
        faster R-CNN
        
        CNN to propose regions
    - - develop from
        
        face verification
      - one shot learning
        
        learn similarity function
        
        \( d(img_1, img_2) \)
        
        Siamese network
        
        idea
        
        \( img \rightarrow ConV\ net \rightarrow array\ numbers \)
        
        \( d(x^{(i)}, x^{(j)}) = \Vert f(x^{(i)}) - f(x^{(j)}) \Vert_2^2\)
        
        encoding picture
        
        triplet loss
        
        triplet
        
        anchor - A
        
        positive - P
        
        negative - N
        
        choosing
        
        1 more item...
        
        loss function
        
        \( L(A, P, N) = max(\Vert f(A) - f(P) \Vert^2 - \Vert f(A) - f(N) \Vert^2 + \alpha, 0) \)
        
        cost function
        
        \( J = \sum_{i=1}^{m}{L(A^{(i)}, P^{(i)}, N^{(i)})} \)
        
        binary classification #
        
        network structure
        
        2 parallel Siamese net
        
        sigmoid function
        
        loss function
        
        \( \hat{y} = \sigma(\sum_{k=1}^n{\omega_k \vert f(x^{(i)})_k - f(x^{(j)})_k \vert} + b) \)
        
        or \( \hat{y} = \sigma(\sum_{k=1}^n{\omega_k \frac{(f(x^{(i)})_k - f(x^{(j)})_k)^2}{f(x^{(i)})_k + f(x^{(j)})_k} } + b) \)
        
        precompute encoding
        
        reference image
    - - goal
        
        generate image G
        
        from
        
        content image C
        
        style image S
      - cost function
        
        \( J(G) = \alpha J_{content}(C, G) + \beta J_{style}(C, S) \)
        
        content cost
        
        use hidden layer l
        
        use pre-train ConvNet
        
        activation layer l
        
        \( a^{[l](C)}, a^{[l](G)} \)
        
        similar \( \rightarrow \) image content the same
        
        \( J_{content}(C, G) = \Vert a^{[l](C)} - a^{[l](G)} \Vert^2 \)
        
        style cost
        
        correlation
        
        activations across channels
        
        style (Gram) matrix
        
        \( G^{[l]}: n_c^{[l]} \times n_c^{[l]} \)
        
        \( G_{kk'}^{[l](S)} = \sum_{i=1}^{n_H^{[l]}} \sum_{i=1}^{n_H^{[l]}} a_{i,j,k}^{[l](S)} a_{i,j,k'}^{[l](S)} \)
        
        \( G_{kk'}^{[l](G)} = \sum_{i=1}^{n_H^{[l]}} \sum_{i=1}^{n_H^{[l]}} a_{i,j,k}^{[l](G)} a_{i,j,k'}^{[l](G)} \)
        
        \( J_{style}^{[l]}(G, S) = \Vert G^{[l](S)} - G^{[l](G)} \Vert^2 \)
        
        \( J_{style} = \sum_{l} \lambda^{[l]} J_{style}^{[l]}(G, S) \)
      - procedure
        
        initiate G randomly
        
        use gradient descent to minimize J(G)
        \( G = G - \frac{\partial}{\partial G}J(G) \)
  - - - Convolution layer
      - Pooling layer
      - fully connected layer
      - logistic/softmax layer
    - - classics
        
        LeNet-5
        
        AlexNet
        
        VGG-16
      - ResNets
        
        residual block
        
        skip connection
      - Inception Network
        
        computational cost solved by
        
        1*1 convolution
        
        network learn filter size
    - - generalization
        
        3D
        
        1D
      - padding
      - stride convolution
        
        purpose
        
        parameters sharing
        
        sparsity of connections
- - - - Attention Model
        
        structure
        
        bidirectional RNN
        
        context parameters #
        
        attention parameters
      - Beam Search
        
        beam width
        
        length normalization
        
        \( \frac{1}{T_y^{\alpha}}*\sum_{y=1}^{T_y}{log(P(y^{t} | x, y^1, \dots, y^{t-1}))} \)
        
        error debug
        
        compare \( P(y_{human} |x) \) vs \( P(y_{algorithm} | x) \)
        
        count errors
        
        belongs RNN
        
        belongs beam seach
      - Bleu score
  - - - sequence sampling #
  - - - new architectures
        
        Long Short Term Memory (LSTM)
        
        Gated Recurrent Unit (GRU)
  - - - skip-grams
      - model
        
        \( O_c \rightarrow E \rightarrow e_c \rightarrow softmax \rightarrow \hat{y} \)
        
        enhance models
        
        hierarchical softmax
        
        negative sampling
        
        K negative in dictionary
        
        logistic regression
        
        each word \( \rightarrow \) 1 unit
        
        train only K+1 units
        
        sampling
        
        \( P(word_i) = \frac{freq(word_i)^\frac{3}{4}}{\sum_{j=1}^{10000}{freq(word_i)^\frac{3}{4}}} \)
        
        Glove Model
        
        loss function
        
        \( \sum_{i=1}^{10000}{\sum_{j=1}^{10000}{f(X_{ij})*(\theta_i^T*e_j +b_i + b_j' - log(X_{ij}))^2}} \)
        
        \( e_w^{final} = \frac{e_w+\theta_w}{2} \)
    - - sentiment classification
        
        simple
        
        average
        
        RNN model
        
        many-to-one
      - Debiasing
        
        1 identify bias direction
        
        neutralize words by project
        to bias direction
        
        equalize pairs